System Design: Web Analytics Pipeline — Building Google Analytics from Scratch

Series System Design: Web Scenarios

ClickHouse was open-
sourced by Yandex in
2016. It was built to
power Yandex.Metrica,
which processes 25+
billion events per day.
Now used by Cloudflare,
Uber, and thousands
more.

Design a web analytics system like Google Analytics. Website owners embed a tracking script. Every page view, click, and event is collected. The dashboard shows real-time active users, page views, sessions, bounce rate, top pages, and geographic breakdown. Scale: 10 billion events per day across all customers.

The question: Design a web analytics platform like Google Analytics. Website owners embed a JS snippet. Every event is collected, stored, and queryable in near-real-time. Support 10 billion events per day.


1. Scale & Constraints

Nail the numbers before proposing any architecture. Ten billion events per day sounds large in the abstract — let’s make it concrete.

10B
Events / day
~115K
Events / sec (avg)
~500K
Events / sec (peak)
5 TB
Ingested / day
3.6 PB
Raw storage (2 yr)
<1 s
Dashboard query SLA

Three key insights jump out:

  1. Write throughput dominates. 115K events/sec average, 500K peak. No SQL database handles this naively.
  2. Read patterns are analytical. Dashboard queries are GROUP BY, COUNT, time-range scans — the opposite of OLTP point lookups.
  3. Retention is long. 2 years of raw events, but lifetime of pre-aggregated rollups. Storage is the dominant cost.

2. The Tracking Snippet

Before designing servers, understand how data enters the system. Every analytics product starts with a few lines of JavaScript embedded on the customer’s website.

html
<!-- Embed on every page -->
<script src="//analytics.example.com/track.js"></script>

The loaded script collects context and fires a beacon on every meaningful event:

javascript
// track.js — simplified
var Analytics = (function() {
  var SITE_ID = getScriptParam('site');
  var SESSION_ID = getOrCreateSession();
  var ENDPOINT = '//analytics.example.com/collect';

  function buildPayload(eventType, extra) {
    return JSON.stringify({
      siteId:    SITE_ID,
      sessionId: SESSION_ID,
      pageUrl:   location.href,
      referrer:  document.referrer,
      userAgent: navigator.userAgent,
      screenRes: screen.width + 'x' + screen.height,
      timestamp: Date.now(),
      eventType: eventType,
      extra:     extra || {}
    });
  }

  function send(eventType, extra) {
    // sendBeacon: fire-and-forget, survives page unload
    if (navigator.sendBeacon) {
      navigator.sendBeacon(ENDPOINT, buildPayload(eventType, extra));
    } else {
      // Fallback: synchronous XHR (blocks page close briefly)
      var xhr = new XMLHttpRequest();
      xhr.open('POST', ENDPOINT, false);
      xhr.send(buildPayload(eventType, extra));
    }
  }

  // Fire pageview on load
  send('pageview');

  // Track single-page app navigation
  var _pushState = history.pushState;
  history.pushState = function() {
    _pushState.apply(this, arguments);
    send('pageview');
  };

  return { track: function(name, data) { send(name, data); } };
})();

Why sendBeacon and not fetch? sendBeacon is designed for analytics: it queues the request in the browser’s network stack so it completes even when the user navigates away or closes the tab. fetch without keepalive: true is cancelled on page unload, silently dropping the event.

Why not <img src="?data=..."> (the classic tracking pixel)? Limited to GET requests, URL length cap (~2 KB), no JSON body. Fine for ad impression tracking, wrong for rich event payloads.

Interview signal: Mentioning sendBeacon vs fetch vs image pixel and knowing why each exists immediately differentiates a systems-thinking candidate from someone who just described "an HTTP POST."

Interactive: Tracking Snippet Demo

▶ Mock website — watch events fire
mock website
My Awesome Blog
Welcome! Click around to generate analytics events.
event log (collector receives)
Events will appear here...
Events fired: 0 Session: s_demo1

3. Level 1 — Naive: Direct SQL Writes

The instinct: every beacon fires an INSERT into an events table.

sql
-- The naive schema
CREATE TABLE events (
  id         BIGSERIAL PRIMARY KEY,
  site_id    TEXT,
  session_id TEXT,
  page_url   TEXT,
  event_type TEXT,
  user_agent TEXT,
  ip_address TEXT,
  ts         TIMESTAMPTZ
);

-- One INSERT per event: 115,000/sec average
INSERT INTO events (site_id, session_id, page_url, event_type, ip_address, ts)
VALUES ('site_abc', 'sess_xyz', '/home', 'pageview', '203.0.113.5', NOW());

Why this fails immediately:

  • PostgreSQL handles ~5–10K simple inserts/sec on a single node. You need 115K average.
  • OLTP databases are row-oriented — every analytical query (GROUP BY page_url, COUNT(*) WHERE site_id = ?) does a full sequential scan.
  • ACID overhead (WAL, MVCC) wastes I/O on write-once, never-update data.
  • At 5 TB/day, storage costs on a SQL instance are punishing.
Red flag in an interview: Jumping straight to "shard the SQL database." Sharding adds complexity without fixing the fundamental mismatch between OLTP row storage and analytical read patterns.

4. Level 2 — Write Buffer + Batch Insert

Buffer events in memory (or Redis), flush every 5 seconds.

python
# Collector service — buffered writer
import asyncio, collections

buffer = collections.deque()
BATCH_SIZE = 5000
FLUSH_INTERVAL = 5  # seconds

async def receive_event(event):
    buffer.append(event)
    if len(buffer) >= BATCH_SIZE:
        await flush()

async def flush():
    batch = [buffer.popleft() for _ in range(min(BATCH_SIZE, len(buffer)))]
    if batch:
        await db.executemany("INSERT INTO events ...", batch)

async def periodic_flush():
    while True:
        await asyncio.sleep(FLUSH_INTERVAL)
        await flush()

Improvement: Reduces write amplification from 115K individual INSERTs/sec to ~1 batch per 5 seconds. Batch inserts are 10–50× cheaper per row.

Still broken: The fundamental read-performance problem remains. SQL is still scanning billions of rows for every dashboard query. A COUNT(*) over 30 days of data requires reading every row in that time range. And you lose all events in the buffer if the collector crashes.


5. Level 3 — Kafka + ClickHouse (The Right Architecture)

This is where the real design begins. Two insights drive the architecture:

  1. Decouple ingestion from storage using a durable message queue (Kafka). Collectors write fast; consumers write at their own pace.
  2. Use a columnar OLAP database (ClickHouse) instead of a row-oriented SQL database.

The Pipeline

▶ Animated pipeline — click "Fire Events" to watch data flow
JS Snippet
Collector
50 servers
Kafka
raw-events
ClickHouse
raw + rollups
Redis
real-time
Kafka lag
0 msg
ClickHouse rows
0
Redis HLL keys
0

Why Kafka?

  • Durability: Events are persisted to disk. Collector crashes don’t lose data.
  • Backpressure isolation: If ClickHouse is slow, Kafka absorbs the burst. Collectors never block.
  • Replay: Re-process 2 years of events by rewinding Kafka offset — no separate archival system needed.
  • Fan-out: Multiple consumers (ClickHouse writer, session stitcher, alerting) read the same topic independently.

Why ClickHouse?

Row-oriented (PostgreSQL)

Each row is stored together. Great for retrieving a single record. Terrible for SELECT page, COUNT(*) — must read every column of every row even if you only need two columns.

Columnar (ClickHouse)

Each column is stored in its own file. Analytical query reads only the columns it needs. All values of a column are the same type → compress 10:1 or better. Vectorized SIMD execution over compressed blocks.

sql
-- ClickHouse events table
CREATE TABLE events (
  site_id    LowCardinality(String),   -- dictionary-encoded
  session_id String,
  page_url   String,
  referrer   String,
  event_type LowCardinality(String),
  country    LowCardinality(FixedString(2)),
  device     LowCardinality(String),
  browser    LowCardinality(String),
  ts         DateTime
) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(ts)      -- one partition per month
  ORDER BY   (site_id, ts)        -- primary sort key
  SETTINGS   index_granularity = 8192;

-- This query runs in <1 second over 1 billion rows:
SELECT page_url, COUNT(*) AS views
FROM   events
WHERE  site_id = 'site_abc'
  AND  ts >= now() - INTERVAL 30 DAY
GROUP BY page_url
ORDER BY views DESC
LIMIT 10;

The MergeTree engine in ClickHouse sorts data by the ORDER BY key and stores a sparse index. Queries with WHERE site_id = ? skip entire parts that don’t match, then scan only the page_url and ts columns — no others loaded.


6. Level 4 — Real-Time vs. Batch Aggregation

The dashboard has two fundamentally different query types with different freshness requirements.

QueryFreshnessData SourceLatency
Active users right now< 5 secondsRedis HyperLogLog~1 ms
Page views in last hour~30 secondsClickHouse real-time~50 ms
Top pages last 7 days~5 minutesClickHouse raw table~200 ms
Monthly report last year~1 hourClickHouse daily rollup~80 ms

Real-Time: Redis HyperLogLog

“How many users are on the site RIGHT NOW?” must update in seconds and cannot query raw events.

python
# Flink stream processor: update Redis on every event
def process_event(event):
    site   = event['siteId']
    user   = event['sessionId']  # proxy for unique user

    # HyperLogLog: ~1% error, O(1) space per key (~12 KB)
    redis.pfadd('active:' + site + ':5min', user)
    redis.expire('active:' + site + ':5min', 300)   # 5-minute window

    # Dashboard reads: PFCOUNT active:site_abc:5min => ~147

Historical: Pre-Aggregated Rollups

For longer time ranges, pre-aggregate at write time using a Materialized View in ClickHouse:

sql
-- Materialized view auto-aggregates as data is written
CREATE MATERIALIZED VIEW daily_page_views
ENGINE = SummingMergeTree()
ORDER BY (site_id, page_url, day) AS
SELECT
  site_id,
  page_url,
  toDate(ts)              AS day,
  count()                 AS views,
  uniqHLL12(session_id)  AS unique_sessions
FROM events
GROUP BY site_id, page_url, day;

SummingMergeTree automatically merges rows with the same key, summing the numeric columns. The 30-day top-pages query now reads a tiny rollup table instead of raw events.

Lambda vs. Kappa Architecture

Lambda Architecture

Two parallel paths: a speed layer (stream processing, less accurate) and a batch layer (MapReduce/Spark, accurate). A query layer merges both. Complex to maintain: two codebases must agree on the same business logic.

Kappa Architecture ✓

Single stream (Kafka) as the source of truth. To reprocess history, replay from beginning with new code. Simpler: one codebase, one pipeline. ClickHouse materialized views replace the batch layer entirely.


7. Level 5 — Session Stitching

Raw events are individual beacons. The dashboard needs sessions — grouped sequences of events from the same user within a 30-minute inactivity window.

Google Analytics 4 (GA4)
moved to an event-based
model in 2023, replacing
the session-based Universal
Analytics. The backend
runs on Google BigQuery
— customers can query
their own raw data via
BigQuery export.

A session ends when 30 minutes pass with no new events from that user. The Flink stream processor uses a session window — a gap-based window that closes after 30 minutes of silence.

java (flink pseudocode)
// Flink: sessionize events per user
events
  .keyBy(e -> e.siteId + ":" + e.sessionId)
  .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
  .aggregate(new SessionAggregator())

// SessionAggregator computes per session:
//   - duration (last_ts - first_ts)
//   - page_count
//   - entry_page (first event url)
//   - exit_page (last event url)
//   - is_bounce (page_count == 1)

Session Stitching Visualizer

▶ Events stream in — watch sessions form (30-min gap = new session)

8. Level 6 — Dashboard Design

The dashboard is the customer-facing product. It must feel instantaneous even when querying months of data.

The “bounce rate” metric
is more nuanced than it
appears. A bounce is a
single-page session — but
a user who reads your
entire blog post and
leaves is also a “bounce.”
GA4 replaced it with
“engagement rate.”

📊 Analytics Dashboard — myblog.com
Live:  142 users
284,120
Page Views
▲ +12%
41,830
Unique Visitors
▲ +8%
58,200
Sessions
▲ +5%
2m 14s
Avg Duration
▼ -3%
62%
Bounce Rate
▼ +4%
Top Pages
Top Countries

9. Level 7 — Privacy & GDPR

Analytics collects user behavior data. This triggers legal obligations in many jurisdictions.

IP AnonymizationZero the last octet before storing: 203.0.113.47203.0.113.0. You retain geo precision (city-level) without storing a unique identifier. Required in Germany, recommended everywhere.
Cookie ConsentOnly set _ga session cookie after explicit opt-in. Without consent: use fingerprinting-free session IDs derived from anonymized IP + User-Agent hash, scoped to a single request window.
Right to ErasureClickHouse makes DELETE expensive — it requires rewriting entire data parts. Use a partition-level trick: store userId in its own partition column. Erasure = drop partition for that user. Requires partition design at schema time.
First-Party vs. Third-PartyChrome blocks 3rd-party cookies. Modern analytics uses first-party collection: the tracking endpoint is served from the same domain as the customer site (e.g. analytics.yourdomain.com via CNAME), so cookies are first-party.
python
import ipaddress

def anonymize_ip(raw_ip: str) -> str:
    try:
        addr = ipaddress.ip_address(raw_ip)
        if addr.version == 4:
            parts = raw_ip.split('.')
            return '.'.join(parts[:3]) + '.0'     # zero last octet
        else:
            # IPv6: zero last 80 bits
            network = ipaddress.ip_network(raw_ip + '/48', strict=False)
            return str(network.network_address)
    except ValueError:
        return '0.0.0.0'

10. Capacity Estimate

MetricNumber
Events/sec average115,000
Events/sec peak500,000
Kafka throughput~57 MB/sec (500 bytes × 115K)
Kafka partitions needed~600 (at 100 MB/sec/partition cap)
ClickHouse storage raw (2 yr)~3.6 PB raw, ~360 TB compressed (10:1)
ClickHouse rollup storage~5 TB (daily aggregates per site)
Redis real-time footprint~10 GB (HyperLogLog per site × 5-min windows)
Collector servers (10K events/sec each)~50 avg, ~100 peak
Flink task managers~20 (session windowing is CPU-heavy)
ClickHouse cluster3 shards × 2 replicas = 6 nodes, 60 TB SSD each

The dominant cost is ClickHouse storage. At 10:1 compression, 2 years of raw events compresses to ~360 TB. At $0.02/GB-month (hot NVMe), that’s ~$7,200/month just for raw event storage — before rollups, compute, Kafka, and Redis.


11. Component Deep Dive: The Collector Service

The collector is the front door — it receives 115K events/sec, must be stateless and horizontally scalable.

python (fastapi)
from fastapi import FastAPI, Request, Response
from aiokafka import AIOKafkaProducer
import orjson, ua_parser, geoip2.database

app      = FastAPI()
producer = AIOKafkaProducer(bootstrap_servers='kafka:9092')
geo_db   = geoip2.database.Reader('/data/GeoLite2-City.mmdb')

@app.post('/collect')
async def collect(request: Request):
    raw  = await request.body()
    data = orjson.loads(raw)

    # Enrich: IP -> country, UA -> device/browser
    ip      = request.client.host
    country = 'XX'
    try:
        rec     = geo_db.city(ip)
        country = rec.country.iso_code or 'XX'
    except:
        pass

    ua_str  = request.headers.get('user-agent', '')
    ua_info = ua_parser.parse(ua_str)

    data['country']  = country
    data['device']   = ua_info['device']['family']
    data['browser']  = ua_info['user_agent']['family']
    data['ip_anon']  = anonymize_ip(ip)

    await producer.send(
        'raw-events',
        key=data['siteId'].encode(),   # partition by siteId
        value=orjson.dumps(data)
    )
    return Response(content=b'', status_code=204)  # No Content

Key decisions:

  • Partition Kafka by siteId: All events for a single customer land on the same partition. Session stitching in Flink is keyed by siteId:sessionId — no cross-partition shuffles.
  • 204 No Content response: The tracker doesn’t care about the response body. Returning nothing saves ~500 bytes × 115K/sec = 57 MB/sec of egress.
  • IP enrichment at ingest: Do geo-lookup and UA parsing once at collection time, store the result. Never re-parse 10 billion events at query time.

12. Architecture Summary

The complete architecture in one sentence: A stateless Collector fleet enriches events and writes to Kafka; Flink consumers sessionize the stream and write to ClickHouse (raw) and Redis (real-time HyperLogLog); a Query API serves the dashboard from ClickHouse materialized views for historical queries and Redis for live counters.
architecture
Browser / Mobile App
  |
  |  sendBeacon POST /collect
  v
Collector Service          x50 stateless pods
  |  validate, enrich (IP→geo, UA→device)
  |  partition key = siteId
  v
Kafka  raw-events topic    600 partitions, 7-day retention
  |
  +-------+----------+
  |               |
  v               v
Flink           Flink
Session         Realtime
Stitcher        Aggregator
  |               |
  v               v
ClickHouse     Redis
events table   HyperLogLog
+ MV rollups   active:site:5min
  |               |
  +-------+--------+
          |
          v
    Query API
    Dashboard UI

Trade-offs Worth Discussing in an Interview

DecisionChosenAlternativeWhy chosen
Stream brokerKafkaKinesis, PulsarSelf-hosted, unlimited retention, strong ordering guarantees
OLAP storeClickHouseBigQuery, Druid, PinotSub-second queries, efficient compression, on-premise option
Real-time counterRedis HyperLogLogExact count in ClickHouseO(1) space, ~1% error acceptable, millisecond reads
Stream processorFlinkSpark Streaming, StormTrue streaming (not micro-batch), session windows native
Aggregation strategyKappa (single stream)Lambda (batch + speed)Simpler operations, reprocess by replaying Kafka

Design a system like Google Analytics, and you touch nearly every major systems concept: high-throughput ingestion, durable messaging, columnar storage, streaming computation, approximate algorithms, GDPR compliance, and real-time dashboarding. The key insight is that analytics data is write-once, read-analytically — and every architectural decision should follow from that observation.