System Design: Ad Click Tracking — High-Throughput Event Ingestion and Fraud Detection
The interview prompt is deceptively simple: “Design Google’s ad click tracking system.” Underneath lies one of the most financially critical data pipelines in tech — 10 billion clicks per day, each click potentially representing billable revenue. Miss a click? An advertiser underpays. Double-count a click? An advertiser overpays and may dispute the charge. Let a bot click through? $120 billion per year in fraud globally.
This post walks through the full system: ingestion, deduplication, fraud detection, aggregation, and billing. Each piece involves real engineering tradeoffs with no obviously correct answer.
1. Scale and requirements
Start with numbers. “10 billion clicks/day” needs to be translated into engineering constraints before you can design anything.
| Requirement | Value |
|---|---|
| Clicks per day | 10,000,000,000 |
| Average clicks/sec | ~115,000 |
| Peak clicks/sec (5× average) | ~575,000 |
| Click payload size | ~500 bytes |
| Click-to-redirect latency | < 200ms (user-visible) |
| Data durability | Zero loss — each click is billable |
| Deduplication window | 5 minutes per (userId, adId) pair |
| Report freshness | Real-time (<1 min delay) + full history |
| Retention | 7 years (legal and billing audit) |
The redirect latency requirement is the hardest constraint. A user clicked an ad and is waiting to land on advertiser.com. The tracking infrastructure sits in their critical path. Any slowdown they notice degrades the perceived quality of search results — Google’s core product.
115,000 clicks/sec average means the system must be horizontally scalable with no single hot shard. At 500 bytes each: 57 MB/sec of click data, ~5 TB/day raw, ~1.8 PB/year raw before compression.
2. The click redirect flow
When a user clicks an ad, the browser follows a URL like:
-- The rendered anchor in a Google Search result https://ads.google.com/click?adId=A123&pub=P456&redirect=https%3A%2F%2Fadvertiser.com%2Fproduct
The user never sees advertiser.com until after the redirect lands. This gives the tracking service a window to record the click. Three implementation options:
Option A — Synchronous (simplest, adds latency):
Record the click in a database, then return 302 Found. Every click touches storage before the user navigates. Adds 50–200ms. Unacceptable at 115k clicks/sec.
Option B — Fire-and-forget beacon (fastest, some loss):
Return the redirect immediately. The browser fires navigator.sendBeacon() in the background. Risk: beacons are dropped on fast page unload, network failure, or certain mobile browsers. Loss rate 0.1–1%. For a billing system, this is a financial liability.
Option C — Hybrid async (production choice):
Click Collector receives the request, immediately publishes to Kafka (acks=1, <5ms), then returns 302. Kafka provides durability. Background processors handle everything else. Users see ~20–40ms overhead — imperceptible.
// Click Collector request handler (Go pseudocode) func HandleClick(w http.ResponseWriter, r *http.Request) { adId := r.URL.Query().Get("adId") redirect := r.URL.Query().Get("redirect") click := ClickEvent{ ID: uuid.New().String(), // idempotency key AdID: adId, UserID: extractUserId(r), IP: r.RemoteAddr, UserAgent: r.Header.Get("User-Agent"), Referrer: r.Referer(), Timestamp: time.Now().UTC(), GeoCountry: geoIP.Lookup(r.RemoteAddr), } // Non-blocking: publish to Kafka, do not wait for consumer kafkaProducer.ProduceAsync(click) // Immediately redirect the user — tracking is off the critical path http.Redirect(w, r, redirect, http.StatusFound) }
The idempotency UUID is generated at collection time. If the request retries (browser back/forward, network retry), a new UUID is issued — but downstream deduplication catches duplicates by (userId, adId, 5-minute window).
3. The ingestion pipeline
The full pipeline from click to storage has five stages. Each is independently scalable. Kafka decouples producers from consumers — if the fraud detector is slow, it simply builds consumer lag without blocking the user's redirect or the Click Collector.
Click
Collector
ad-clicks
Processor
+ Redis
Kafka is partitioned by adId so all clicks for a given ad go to the same partition. This preserves per-ad ordering and makes deduplication and per-ad aggregation efficient. With 100 partitions at 115k clicks/sec average, each partition handles ~1,150 clicks/sec — well below Kafka's per-partition ceiling of ~100 MB/sec.
The Stream Processor (Kafka Streams or Apache Flink) handles three tasks in a single pipeline pass:
- Deduplication: check Redis SET NX, mark duplicates before writing to ClickHouse
- Fraud scoring: apply rule-based signals and ML model score; invalidate fraudulent clicks
- Aggregation: increment Redis real-time counters; batch-write raw events to ClickHouse
4. Deduplication
The business rule: the same user clicking the same ad twice within 5 minutes counts as one billable click. This is a set-membership problem with a TTL.
Redis SET NX with expiry:
-- Redis command — atomic, no race condition possible SET "click:user-8821:adA-456" "1" NX EX 300 -- Returns OK → first click in the 5-minute window → billable -- Returns nil → key already existed → duplicate → not billable
NX means “set only if Not eXists” and is atomic. No race condition between two concurrent processors checking the same (userId, adId) pair.
Memory calculation: At 115,000 clicks/sec with a 300-second window, the maximum number of live keys is 115,000 × 300 = 34.5 million. At ~50 bytes per key (prefix + userId + adId): approximately 1.7 GB of Redis RAM. With Redis Cluster across three shards, each shard holds ~600 MB — trivial.
The dedup key is intentionally scoped to (userId, adId), not (IP, adId). A household with five people can each legitimately click the same ad. IP-based dedup would produce false positives for shared networks (offices, universities, mobile carrier NAT).
// Stream processor dedup logic (Go pseudocode) func processClick(click ClickEvent) { key := "click:" + click.UserID + ":" + click.AdID isNew, _ := redis.SetNX(key, "1", 5*time.Minute) if !isNew { click.Billable = false click.DedupReason = "window_duplicate" } else { click.Billable = true } writeToClickHouse(click) // always write, both billable and deduped }
Non-billable clicks are still persisted to ClickHouse. Advertisers can audit total clicks vs. unique billable clicks — that transparency is important for trust.
Handling Redis failure: Redis can lose data on restart without AOF persistence. The tradeoff: if a dedup key is lost, a duplicate gets billed (advertiser slightly overpays). The pragmatic choice is appendfsync everysec — accept up to 1 second of dedup data loss over adding 1–2ms disk latency per click. Nightly reconciliation catches and credits these rare discrepancies.
5. Fraud detection
Click fraud is an arms race: bot operators continuously evolve to evade detection. The five signal categories below are applied in real time. Use the sliders to tune detection sensitivity and observe how aggressiveness affects the fraud catch rate.
| Time | IP Address | User ID | Ad ID | Status | Rule Triggered |
|---|
Beyond rule-based detection, production systems add a gradient-boosted ML model that scores each click 0–100 for fraud probability. Features include: device fingerprint entropy, mouse movement patterns before the click, session behavior (did the user scroll? hover over the ad?), historical conversion rate for the publisher domain, and network topology signals (VPN exit nodes, datacenter IP ranges, Tor exits).
The ML model runs asynchronously — clicks are initially accepted, then retroactively invalidated within a 2-hour window if the model flags them. This prevents blocking the user redirect on ML inference latency (which may be 20–100ms on a complex model). Advertisers are credited for any clicks invalidated during the grace window.
6. Aggregation and reporting
Advertisers query a dashboard: “Show me clicks, impressions, CTR, and spend for campaign C-789 for the last 30 days, broken down by keyword and device type.”
This is an OLAP workload. ClickHouse is purpose-built for it: columnar storage, vectorized query execution, and excellent compression on repetitive values (adId, campaignId appear billions of times and compress 10:1+).
ClickHouse schema:
CREATE TABLE ad_clicks ( click_id UUID, ad_id String, campaign_id String, advertiser_id String, publisher_id String, user_id String, ip_address String, geo_country FixedString(2), device_type Enum8('desktop'=1, 'mobile'=2, 'tablet'=3), keyword String, timestamp DateTime, is_billable UInt8, fraud_score Float32, cpc Decimal(10, 6) ) ENGINE = MergeTree() PARTITION BY toYYYYMM(timestamp) ORDER BY (advertiser_id, campaign_id, timestamp) TTL timestamp + INTERVAL 7 YEAR;
Advertiser 30-day report query:
SELECT toDate(timestamp) AS day, keyword, device_type, count(*) AS total_clicks, countIf(is_billable = 1) AS billable_clicks, sumIf(cpc, is_billable = 1) AS spend, avg(fraud_score) AS avg_fraud_score FROM ad_clicks WHERE advertiser_id = 'ADV-12345' AND campaign_id = 'C-789' AND timestamp BETWEEN '2026-05-19' AND '2026-06-19' GROUP BY day, keyword, device_type ORDER BY day DESC, spend DESC;
ClickHouse executes this over 30 days of data with partition pruning on toYYYYMM(timestamp) and primary-key skipping on (advertiser_id, campaign_id). Expected latency: under 500ms even at full scale.
Real-time dashboard (last 60 minutes):
Redis counters give sub-second freshness. The stream processor increments atomically:
-- Atomic pipeline for each billable click (Redis pipeline) INCR "stats:adA123:clicks:2026061914" -- hourly bucket key INCR "stats:adA123:billable:2026061914" INCRBYFLOAT "stats:adA123:spend:2026061914" 0.35 EXPIRE "stats:adA123:clicks:2026061914" 7200 EXPIRE "stats:adA123:billable:2026061914" 7200 EXPIRE "stats:adA123:spend:2026061914" 7200
The dashboard API reads Redis keys for the current and previous hour (real-time view) and queries ClickHouse for anything older. This gives advertisers a unified experience where recent data feels live.
7. Exactly-once processing challenge
The most critical correctness requirement: each click must be billed exactly once — not zero times (lost revenue), not twice (advertiser overpay, chargeback risk).
The failure modes:
| Scenario | What happens | Solution |
|---|---|---|
| Processor crashes mid-batch | Kafka offset not committed; batch reprocessed | Idempotency key unique constraint in DB |
| Network timeout writing to ClickHouse | Retry produces duplicate row | click_id UUID dedup key |
| Kafka message delivered twice | Consumer processes same message twice | UUID idempotency + Redis NX |
| Click Collector crashes after Kafka ack | Message in Kafka, collector never confirms | Kafka durability guarantees delivery |
| Redis dedup key expires before window ends | Duplicate click billed in same window | Accept; reconcile in nightly batch |
Idempotent writes to ClickHouse using ReplacingMergeTree:
-- ReplacingMergeTree deduplicates rows with same ORDER BY key in background CREATE TABLE ad_clicks_deduped ( click_id UUID, -- ... other columns ... version DateTime DEFAULT now() ) ENGINE = ReplacingMergeTree(version) ORDER BY click_id; -- FINAL forces merge at query time — use for billing queries only SELECT count(*) FROM ad_clicks_deduped FINAL WHERE advertiser_id = 'ADV-12345' AND is_billable = 1;
FINAL forces ClickHouse to merge duplicate rows at query time, guaranteeing exactly-once semantics for billing reads. It is slower than a regular scan, so use it only for billing — dashboards can skip it and accept minor inaccuracy.
The at-least-once contract: Production systems accept at-least-once delivery from Kafka and enforce exactly-once semantics at the write layer via idempotency keys. Kafka’s native exactly-once transactions exist but add ~30% latency overhead and significant operational complexity. That cost is not worth paying when the write layer can dedup via a unique constraint or ReplacingMergeTree.
8. Billing pipeline
Every verified click — non-fraud, non-deduplicated — triggers a budget decrement. Advertisers set a daily budget; when it is exhausted, their ads stop showing immediately.
// Budget decrement — atomic Redis operation // Budget stored as micro-cents to avoid float precision issues remaining := redis.DecrBy("budget:ADV-12345:daily", cpcMicroCents) if remaining <= 0 { // Publish to ad-control topic — Ad Servers stop bidding within ~100ms kafka.Produce("ad-paused", AdPauseEvent{ AdvertiserID: "ADV-12345", Reason: "daily_budget_exhausted", Timestamp: time.Now(), }) }
The Ad Server subscribes to the ad-paused topic. On receiving a pause event, it removes the advertiser from the active bid pool within ~100ms. This is eventual consistency — a handful of clicks may slip through immediately after budget exhaustion. That is an accepted cost. Advertisers are never charged more than 20% over their daily budget (a Google Ads contractual guarantee enforced by a hard cap at budget × 1.2).
Daily budget reset:
-- Midnight UTC cron: reset budget and set 24-hour expiry SET "budget:ADV-12345:daily" 50000000 -- $500.00 = 50,000,000 micro-cents EXPIRE "budget:ADV-12345:daily" 86400 -- auto-expire after 24 hours
Reconciliation: a nightly batch job compares the total decrements recorded in Redis against the sum of cpc for is_billable = 1 rows in ClickHouse for the same day. Discrepancies over 0.01% trigger a PagerDuty alert. Discrepancies under 0.01% (expected from crash windows) are credited to the advertiser automatically.
9. Capacity estimate
| Component | Specification | Reasoning |
|---|---|---|
| Click Collector servers | 50 × 2,300 clicks/sec | ~50 MB/s each; limited by Kafka produce latency |
| Kafka partitions | 100 | 1,150 clicks/partition/sec, well below Kafka limits |
| Kafka retention | 7 days | Full replay window for disaster recovery |
| Kafka brokers | 15 | Replication factor 3; ~5 partition leaders per broker |
| Redis dedup cluster | 3 shards × 2 GB RAM | 34.5M keys × ~50 bytes = ~1.7 GB + 20% headroom |
| Redis budget cluster | 1 shard | Small dataset — one entry per advertiser |
| Stream Processor nodes | 30 | 100 partitions / ~3 partitions per node |
| ClickHouse cluster | 10 shards × 3 replicas | ~5 TB/day raw; ~500 GB/day compressed per shard |
| ClickHouse storage/year | ~180 TB compressed | 10:1 compression on repetitive ad attribution data |
| Raw storage/year | ~1.8 PB | Before compression; tiered to cold object storage after 90 days |
Back-of-envelope check:
- 115,000 clicks/sec × 500 bytes = 57.5 MB/sec ingest rate
- ClickHouse handles up to 1 GB/sec per server with columnar storage
- 10 shards means ~5.75 MB/sec per shard — enormous headroom, room to grow 100×
- 30-day query at 10B rows/day: ~500ms with partition pruning, sub-second on warm cache
The bottleneck in most real deployments is ClickHouse disk I/O during background merges on ingest-heavy days. Use separate replica groups for ingest vs. query to prevent merge I/O from spiking query latency.
10. Notes and observations
Click fraud is a $120 billion per year problem (2023 estimate by Juniper Research). Google’s Invalid Click Detection is one of the most sophisticated fraud detection systems ever built — they claim to catch over 99% of fraudulent clicks before billing advertisers. The remaining under 1% is refunded when advertisers report it through the Invalid Activity report in Google Ads.
The three genuinely hard problems in this system are not the ones that sound hard. Kafka partitioning, Redis dedup, ClickHouse schema — these are well-understood patterns with documented solutions. The hard problems are:
-
Fraud detection at the feature level. Distinguishing a legitimate power user — a real estate agent clicking 50 competitor ads to research pricing — from a bot. Rule-based systems generate too many false positives. ML models require labeled training data, and fraudsters adapt continuously. The feedback loop between fraudster and detector runs faster than any release cycle.
-
Budget enforcement with 100ms global consistency. An advertiser’s ad must stop showing within seconds of budget exhaustion, across thousands of Ad Server instances deployed in 30+ regions globally. This requires a distributed cache invalidation protocol that is faster than the incoming click rate. Pub/sub (Kafka or Redis) to all Ad Server pods is the production approach — it is not a solved problem when pods number in the thousands.
-
Seven-year billing auditability. Regulators and enterprise advertisers require click-level audit trails. Every click — with its fraud score, dedup decision, billing record, and the exact version of the fraud detection model that scored it — must be queryable for 7 years. ClickHouse with tiered storage (hot NVMe SSD → warm HDD → cold object storage) handles this. The model versioning is the underappreciated part: you must be able to replay fraud scoring with a historical model version to answer advertiser disputes.
The first documented click fraud case was in 2004 — a website owner was clicking competitors’ ads to drain their Google Ads budgets. Google and competitors had to rebuild their entire billing infrastructure around fraud detection within the first few years of AdWords. This is why Google’s system treats fraud detection as a first-class architectural concern rather than an afterthought bolted on later.
The redirect flow also carries a subtle UX detail: Google’s click tracking URL persists in browser history, so a user navigating back sees ads.google.com/click?... instead of advertiser.com. This is intentional — it lets Google distinguish ad-click return visits from organic return visits, which matters for conversion attribution modelling.
Google processes over 8.5 billion searches per day, each potentially showing 3–5 ads. Even at a 1% click-through rate, that is 25–42 million ad clicks from Search alone. YouTube, Gmail, and the Display Network add hundreds of millions more. The click tracking infrastructure is, financially speaking, what funds Google’s entire AI research program. The engineering investment in making it 99.99% reliable and fraud-resistant is not optional — it is the business.
The system described here is deliberately not over-engineered. Many companies start with Kafka → PostgreSQL. The jump to ClickHouse, Redis Cluster, and ML fraud scoring is earned as scale demands it. If you are designing this in an interview, identifying the right scaling triggers — when does a single Postgres instance break down? when does Redis dedup need clustering? — demonstrates more depth than naming every technology upfront.