System Design: Ad Click Tracking — High-Throughput Event Ingestion and Fraud Detection

The interview prompt is deceptively simple: “Design Google’s ad click tracking system.” Underneath lies one of the most financially critical data pipelines in tech — 10 billion clicks per day, each click potentially representing billable revenue. Miss a click? An advertiser underpays. Double-count a click? An advertiser overpays and may dispute the charge. Let a bot click through? $120 billion per year in fraud globally.

This post walks through the full system: ingestion, deduplication, fraud detection, aggregation, and billing. Each piece involves real engineering tradeoffs with no obviously correct answer.


1. Scale and requirements

Start with numbers. “10 billion clicks/day” needs to be translated into engineering constraints before you can design anything.

Requirement Value
Clicks per day 10,000,000,000
Average clicks/sec ~115,000
Peak clicks/sec (5× average) ~575,000
Click payload size ~500 bytes
Click-to-redirect latency < 200ms (user-visible)
Data durability Zero loss — each click is billable
Deduplication window 5 minutes per (userId, adId) pair
Report freshness Real-time (<1 min delay) + full history
Retention 7 years (legal and billing audit)

The redirect latency requirement is the hardest constraint. A user clicked an ad and is waiting to land on advertiser.com. The tracking infrastructure sits in their critical path. Any slowdown they notice degrades the perceived quality of search results — Google’s core product.

115,000 clicks/sec average means the system must be horizontally scalable with no single hot shard. At 500 bytes each: 57 MB/sec of click data, ~5 TB/day raw, ~1.8 PB/year raw before compression.


2. The click redirect flow

When a user clicks an ad, the browser follows a URL like:

-- The rendered anchor in a Google Search result
https://ads.google.com/click?adId=A123&pub=P456&redirect=https%3A%2F%2Fadvertiser.com%2Fproduct

The user never sees advertiser.com until after the redirect lands. This gives the tracking service a window to record the click. Three implementation options:

Option A — Synchronous (simplest, adds latency): Record the click in a database, then return 302 Found. Every click touches storage before the user navigates. Adds 50–200ms. Unacceptable at 115k clicks/sec.

Option B — Fire-and-forget beacon (fastest, some loss): Return the redirect immediately. The browser fires navigator.sendBeacon() in the background. Risk: beacons are dropped on fast page unload, network failure, or certain mobile browsers. Loss rate 0.1–1%. For a billing system, this is a financial liability.

Option C — Hybrid async (production choice): Click Collector receives the request, immediately publishes to Kafka (acks=1, <5ms), then returns 302. Kafka provides durability. Background processors handle everything else. Users see ~20–40ms overhead — imperceptible.

// Click Collector request handler (Go pseudocode)
func HandleClick(w http.ResponseWriter, r *http.Request) {
    adId     := r.URL.Query().Get("adId")
    redirect := r.URL.Query().Get("redirect")

    click := ClickEvent{
        ID:         uuid.New().String(),   // idempotency key
        AdID:       adId,
        UserID:     extractUserId(r),
        IP:         r.RemoteAddr,
        UserAgent:  r.Header.Get("User-Agent"),
        Referrer:   r.Referer(),
        Timestamp:  time.Now().UTC(),
        GeoCountry: geoIP.Lookup(r.RemoteAddr),
    }

    // Non-blocking: publish to Kafka, do not wait for consumer
    kafkaProducer.ProduceAsync(click)

    // Immediately redirect the user — tracking is off the critical path
    http.Redirect(w, r, redirect, http.StatusFound)
}

The idempotency UUID is generated at collection time. If the request retries (browser back/forward, network retry), a new UUID is issued — but downstream deduplication catches duplicates by (userId, adId, 5-minute window).


3. The ingestion pipeline

The full pipeline from click to storage has five stages. Each is independently scalable. Kafka decouples producers from consumers — if the fraud detector is slow, it simply builds consumer lag without blocking the user's redirect or the Click Collector.

User
Click
Click
Collector
Kafka
ad-clicks
Stream
Processor
ClickHouse
+ Redis
THROUGHPUT
0/s
KAFKA CONSUMER LAG
0 msgs
FRAUD SCORE
0%

Kafka is partitioned by adId so all clicks for a given ad go to the same partition. This preserves per-ad ordering and makes deduplication and per-ad aggregation efficient. With 100 partitions at 115k clicks/sec average, each partition handles ~1,150 clicks/sec — well below Kafka's per-partition ceiling of ~100 MB/sec.

The Stream Processor (Kafka Streams or Apache Flink) handles three tasks in a single pipeline pass:

  1. Deduplication: check Redis SET NX, mark duplicates before writing to ClickHouse
  2. Fraud scoring: apply rule-based signals and ML model score; invalidate fraudulent clicks
  3. Aggregation: increment Redis real-time counters; batch-write raw events to ClickHouse

4. Deduplication

The business rule: the same user clicking the same ad twice within 5 minutes counts as one billable click. This is a set-membership problem with a TTL.

Redis SET NX with expiry:

-- Redis command — atomic, no race condition possible
SET "click:user-8821:adA-456" "1" NX EX 300

-- Returns OK  → first click in the 5-minute window → billable
-- Returns nil → key already existed → duplicate → not billable

NX means “set only if Not eXists” and is atomic. No race condition between two concurrent processors checking the same (userId, adId) pair.

Memory calculation: At 115,000 clicks/sec with a 300-second window, the maximum number of live keys is 115,000 × 300 = 34.5 million. At ~50 bytes per key (prefix + userId + adId): approximately 1.7 GB of Redis RAM. With Redis Cluster across three shards, each shard holds ~600 MB — trivial.

The dedup key is intentionally scoped to (userId, adId), not (IP, adId). A household with five people can each legitimately click the same ad. IP-based dedup would produce false positives for shared networks (offices, universities, mobile carrier NAT).

// Stream processor dedup logic (Go pseudocode)
func processClick(click ClickEvent) {
    key    := "click:" + click.UserID + ":" + click.AdID
    isNew, _ := redis.SetNX(key, "1", 5*time.Minute)

    if !isNew {
        click.Billable     = false
        click.DedupReason  = "window_duplicate"
    } else {
        click.Billable     = true
    }

    writeToClickHouse(click)  // always write, both billable and deduped
}

Non-billable clicks are still persisted to ClickHouse. Advertisers can audit total clicks vs. unique billable clicks — that transparency is important for trust.

Handling Redis failure: Redis can lose data on restart without AOF persistence. The tradeoff: if a dedup key is lost, a duplicate gets billed (advertiser slightly overpays). The pragmatic choice is appendfsync everysec — accept up to 1 second of dedup data loss over adding 1–2ms disk latency per click. Nightly reconciliation catches and credits these rare discrepancies.


5. Fraud detection

{: class="marginalia" } Click fraud costs advertisers over $120 billion per year (2023, Juniper Research). Google's Invalid Click Detection is among the most sophisticated fraud detection systems ever built — they claim to catch over 99% of fraudulent clicks before billing advertisers. The remaining fraction is refunded when advertisers file an Invalid Activity report in Google Ads.

Click fraud is an arms race: bot operators continuously evolve to evade detection. The five signal categories below are applied in real time. Use the sliders to tune detection sensitivity and observe how aggressiveness affects the fraud catch rate.

Live Fraud Detector
Clicks: 0
Blocked: 0 (0%)
Time IP Address User ID Ad ID Status Rule Triggered

Beyond rule-based detection, production systems add a gradient-boosted ML model that scores each click 0–100 for fraud probability. Features include: device fingerprint entropy, mouse movement patterns before the click, session behavior (did the user scroll? hover over the ad?), historical conversion rate for the publisher domain, and network topology signals (VPN exit nodes, datacenter IP ranges, Tor exits).

The ML model runs asynchronously — clicks are initially accepted, then retroactively invalidated within a 2-hour window if the model flags them. This prevents blocking the user redirect on ML inference latency (which may be 20–100ms on a complex model). Advertisers are credited for any clicks invalidated during the grace window.


6. Aggregation and reporting

Advertisers query a dashboard: “Show me clicks, impressions, CTR, and spend for campaign C-789 for the last 30 days, broken down by keyword and device type.”

This is an OLAP workload. ClickHouse is purpose-built for it: columnar storage, vectorized query execution, and excellent compression on repetitive values (adId, campaignId appear billions of times and compress 10:1+).

ClickHouse schema:

CREATE TABLE ad_clicks (
    click_id       UUID,
    ad_id          String,
    campaign_id    String,
    advertiser_id  String,
    publisher_id   String,
    user_id        String,
    ip_address     String,
    geo_country    FixedString(2),
    device_type    Enum8('desktop'=1, 'mobile'=2, 'tablet'=3),
    keyword        String,
    timestamp      DateTime,
    is_billable    UInt8,
    fraud_score    Float32,
    cpc            Decimal(10, 6)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY    (advertiser_id, campaign_id, timestamp)
TTL         timestamp + INTERVAL 7 YEAR;

Advertiser 30-day report query:

SELECT
    toDate(timestamp)              AS day,
    keyword,
    device_type,
    count(*)                       AS total_clicks,
    countIf(is_billable = 1)        AS billable_clicks,
    sumIf(cpc, is_billable = 1)     AS spend,
    avg(fraud_score)               AS avg_fraud_score
FROM  ad_clicks
WHERE
    advertiser_id = 'ADV-12345'
    AND campaign_id = 'C-789'
    AND timestamp  BETWEEN '2026-05-19' AND '2026-06-19'
GROUP BY day, keyword, device_type
ORDER BY day DESC, spend DESC;

ClickHouse executes this over 30 days of data with partition pruning on toYYYYMM(timestamp) and primary-key skipping on (advertiser_id, campaign_id). Expected latency: under 500ms even at full scale.

Real-time dashboard (last 60 minutes):

Redis counters give sub-second freshness. The stream processor increments atomically:

-- Atomic pipeline for each billable click (Redis pipeline)
INCR        "stats:adA123:clicks:2026061914"    -- hourly bucket key
INCR        "stats:adA123:billable:2026061914"
INCRBYFLOAT "stats:adA123:spend:2026061914"  0.35
EXPIRE      "stats:adA123:clicks:2026061914"  7200
EXPIRE      "stats:adA123:billable:2026061914" 7200
EXPIRE      "stats:adA123:spend:2026061914"   7200

The dashboard API reads Redis keys for the current and previous hour (real-time view) and queries ClickHouse for anything older. This gives advertisers a unified experience where recent data feels live.


7. Exactly-once processing challenge

The most critical correctness requirement: each click must be billed exactly once — not zero times (lost revenue), not twice (advertiser overpay, chargeback risk).

The failure modes:

Scenario What happens Solution
Processor crashes mid-batch Kafka offset not committed; batch reprocessed Idempotency key unique constraint in DB
Network timeout writing to ClickHouse Retry produces duplicate row click_id UUID dedup key
Kafka message delivered twice Consumer processes same message twice UUID idempotency + Redis NX
Click Collector crashes after Kafka ack Message in Kafka, collector never confirms Kafka durability guarantees delivery
Redis dedup key expires before window ends Duplicate click billed in same window Accept; reconcile in nightly batch

Idempotent writes to ClickHouse using ReplacingMergeTree:

-- ReplacingMergeTree deduplicates rows with same ORDER BY key in background
CREATE TABLE ad_clicks_deduped (
    click_id    UUID,
    -- ... other columns ...
    version     DateTime DEFAULT now()
) ENGINE = ReplacingMergeTree(version)
ORDER BY click_id;

-- FINAL forces merge at query time — use for billing queries only
SELECT count(*) FROM ad_clicks_deduped FINAL
WHERE advertiser_id = 'ADV-12345'
  AND is_billable = 1;

FINAL forces ClickHouse to merge duplicate rows at query time, guaranteeing exactly-once semantics for billing reads. It is slower than a regular scan, so use it only for billing — dashboards can skip it and accept minor inaccuracy.

The at-least-once contract: Production systems accept at-least-once delivery from Kafka and enforce exactly-once semantics at the write layer via idempotency keys. Kafka’s native exactly-once transactions exist but add ~30% latency overhead and significant operational complexity. That cost is not worth paying when the write layer can dedup via a unique constraint or ReplacingMergeTree.


8. Billing pipeline

Every verified click — non-fraud, non-deduplicated — triggers a budget decrement. Advertisers set a daily budget; when it is exhausted, their ads stop showing immediately.

// Budget decrement — atomic Redis operation
// Budget stored as micro-cents to avoid float precision issues
remaining := redis.DecrBy("budget:ADV-12345:daily", cpcMicroCents)

if remaining <= 0 {
    // Publish to ad-control topic — Ad Servers stop bidding within ~100ms
    kafka.Produce("ad-paused", AdPauseEvent{
        AdvertiserID: "ADV-12345",
        Reason:       "daily_budget_exhausted",
        Timestamp:    time.Now(),
    })
}

The Ad Server subscribes to the ad-paused topic. On receiving a pause event, it removes the advertiser from the active bid pool within ~100ms. This is eventual consistency — a handful of clicks may slip through immediately after budget exhaustion. That is an accepted cost. Advertisers are never charged more than 20% over their daily budget (a Google Ads contractual guarantee enforced by a hard cap at budget × 1.2).

Daily budget reset:

-- Midnight UTC cron: reset budget and set 24-hour expiry
SET    "budget:ADV-12345:daily"  50000000  -- $500.00 = 50,000,000 micro-cents
EXPIRE "budget:ADV-12345:daily"  86400     -- auto-expire after 24 hours

Reconciliation: a nightly batch job compares the total decrements recorded in Redis against the sum of cpc for is_billable = 1 rows in ClickHouse for the same day. Discrepancies over 0.01% trigger a PagerDuty alert. Discrepancies under 0.01% (expected from crash windows) are credited to the advertiser automatically.


9. Capacity estimate

Component Specification Reasoning
Click Collector servers 50 × 2,300 clicks/sec ~50 MB/s each; limited by Kafka produce latency
Kafka partitions 100 1,150 clicks/partition/sec, well below Kafka limits
Kafka retention 7 days Full replay window for disaster recovery
Kafka brokers 15 Replication factor 3; ~5 partition leaders per broker
Redis dedup cluster 3 shards × 2 GB RAM 34.5M keys × ~50 bytes = ~1.7 GB + 20% headroom
Redis budget cluster 1 shard Small dataset — one entry per advertiser
Stream Processor nodes 30 100 partitions / ~3 partitions per node
ClickHouse cluster 10 shards × 3 replicas ~5 TB/day raw; ~500 GB/day compressed per shard
ClickHouse storage/year ~180 TB compressed 10:1 compression on repetitive ad attribution data
Raw storage/year ~1.8 PB Before compression; tiered to cold object storage after 90 days

Back-of-envelope check:

  • 115,000 clicks/sec × 500 bytes = 57.5 MB/sec ingest rate
  • ClickHouse handles up to 1 GB/sec per server with columnar storage
  • 10 shards means ~5.75 MB/sec per shard — enormous headroom, room to grow 100×
  • 30-day query at 10B rows/day: ~500ms with partition pruning, sub-second on warm cache

The bottleneck in most real deployments is ClickHouse disk I/O during background merges on ingest-heavy days. Use separate replica groups for ingest vs. query to prevent merge I/O from spiking query latency.


10. Notes and observations

Click fraud is a $120 billion per year problem (2023 estimate by Juniper Research). Google’s Invalid Click Detection is one of the most sophisticated fraud detection systems ever built — they claim to catch over 99% of fraudulent clicks before billing advertisers. The remaining under 1% is refunded when advertisers report it through the Invalid Activity report in Google Ads.

The three genuinely hard problems in this system are not the ones that sound hard. Kafka partitioning, Redis dedup, ClickHouse schema — these are well-understood patterns with documented solutions. The hard problems are:

  1. Fraud detection at the feature level. Distinguishing a legitimate power user — a real estate agent clicking 50 competitor ads to research pricing — from a bot. Rule-based systems generate too many false positives. ML models require labeled training data, and fraudsters adapt continuously. The feedback loop between fraudster and detector runs faster than any release cycle.

  2. Budget enforcement with 100ms global consistency. An advertiser’s ad must stop showing within seconds of budget exhaustion, across thousands of Ad Server instances deployed in 30+ regions globally. This requires a distributed cache invalidation protocol that is faster than the incoming click rate. Pub/sub (Kafka or Redis) to all Ad Server pods is the production approach — it is not a solved problem when pods number in the thousands.

  3. Seven-year billing auditability. Regulators and enterprise advertisers require click-level audit trails. Every click — with its fraud score, dedup decision, billing record, and the exact version of the fraud detection model that scored it — must be queryable for 7 years. ClickHouse with tiered storage (hot NVMe SSD → warm HDD → cold object storage) handles this. The model versioning is the underappreciated part: you must be able to replay fraud scoring with a historical model version to answer advertiser disputes.

The first documented click fraud case was in 2004 — a website owner was clicking competitors’ ads to drain their Google Ads budgets. Google and competitors had to rebuild their entire billing infrastructure around fraud detection within the first few years of AdWords. This is why Google’s system treats fraud detection as a first-class architectural concern rather than an afterthought bolted on later.

The redirect flow also carries a subtle UX detail: Google’s click tracking URL persists in browser history, so a user navigating back sees ads.google.com/click?... instead of advertiser.com. This is intentional — it lets Google distinguish ad-click return visits from organic return visits, which matters for conversion attribution modelling.

Google processes over 8.5 billion searches per day, each potentially showing 3–5 ads. Even at a 1% click-through rate, that is 25–42 million ad clicks from Search alone. YouTube, Gmail, and the Display Network add hundreds of millions more. The click tracking infrastructure is, financially speaking, what funds Google’s entire AI research program. The engineering investment in making it 99.99% reliable and fraud-resistant is not optional — it is the business.

The system described here is deliberately not over-engineered. Many companies start with Kafka → PostgreSQL. The jump to ClickHouse, Redis Cluster, and ML fraud scoring is earned as scale demands it. If you are designing this in an interview, identifying the right scaling triggers — when does a single Postgres instance break down? when does Redis dedup need clustering? — demonstrates more depth than naming every technology upfront.