System Design: Web Analytics Pipeline — Building Google Analytics from Scratch
ClickHouse was open-
sourced by Yandex in
2016. It was built to
power Yandex.Metrica,
which processes 25+
billion events per day.
Now used by Cloudflare,
Uber, and thousands
more.
Design a web analytics system like Google Analytics. Website owners embed a tracking script. Every page view, click, and event is collected. The dashboard shows real-time active users, page views, sessions, bounce rate, top pages, and geographic breakdown. Scale: 10 billion events per day across all customers.
The question: Design a web analytics platform like Google Analytics. Website owners embed a JS snippet. Every event is collected, stored, and queryable in near-real-time. Support 10 billion events per day.
1. Scale & Constraints
Nail the numbers before proposing any architecture. Ten billion events per day sounds large in the abstract — let’s make it concrete.
Three key insights jump out:
- Write throughput dominates. 115K events/sec average, 500K peak. No SQL database handles this naively.
- Read patterns are analytical. Dashboard queries are
GROUP BY,COUNT, time-range scans — the opposite of OLTP point lookups. - Retention is long. 2 years of raw events, but lifetime of pre-aggregated rollups. Storage is the dominant cost.
2. The Tracking Snippet
Before designing servers, understand how data enters the system. Every analytics product starts with a few lines of JavaScript embedded on the customer’s website.
<!-- Embed on every page --> <script src="//analytics.example.com/track.js"></script>
The loaded script collects context and fires a beacon on every meaningful event:
// track.js — simplified var Analytics = (function() { var SITE_ID = getScriptParam('site'); var SESSION_ID = getOrCreateSession(); var ENDPOINT = '//analytics.example.com/collect'; function buildPayload(eventType, extra) { return JSON.stringify({ siteId: SITE_ID, sessionId: SESSION_ID, pageUrl: location.href, referrer: document.referrer, userAgent: navigator.userAgent, screenRes: screen.width + 'x' + screen.height, timestamp: Date.now(), eventType: eventType, extra: extra || {} }); } function send(eventType, extra) { // sendBeacon: fire-and-forget, survives page unload if (navigator.sendBeacon) { navigator.sendBeacon(ENDPOINT, buildPayload(eventType, extra)); } else { // Fallback: synchronous XHR (blocks page close briefly) var xhr = new XMLHttpRequest(); xhr.open('POST', ENDPOINT, false); xhr.send(buildPayload(eventType, extra)); } } // Fire pageview on load send('pageview'); // Track single-page app navigation var _pushState = history.pushState; history.pushState = function() { _pushState.apply(this, arguments); send('pageview'); }; return { track: function(name, data) { send(name, data); } }; })();
Why sendBeacon and not fetch? sendBeacon is designed for analytics: it queues the request in the browser’s network stack so it completes even when the user navigates away or closes the tab. fetch without keepalive: true is cancelled on page unload, silently dropping the event.
Why not <img src="?data=..."> (the classic tracking pixel)? Limited to GET requests, URL length cap (~2 KB), no JSON body. Fine for ad impression tracking, wrong for rich event payloads.
sendBeacon vs fetch vs image pixel and knowing why each exists immediately differentiates a systems-thinking candidate from someone who just described "an HTTP POST."
Interactive: Tracking Snippet Demo
3. Level 1 — Naive: Direct SQL Writes
The instinct: every beacon fires an INSERT into an events table.
-- The naive schema CREATE TABLE events ( id BIGSERIAL PRIMARY KEY, site_id TEXT, session_id TEXT, page_url TEXT, event_type TEXT, user_agent TEXT, ip_address TEXT, ts TIMESTAMPTZ ); -- One INSERT per event: 115,000/sec average INSERT INTO events (site_id, session_id, page_url, event_type, ip_address, ts) VALUES ('site_abc', 'sess_xyz', '/home', 'pageview', '203.0.113.5', NOW());
Why this fails immediately:
- PostgreSQL handles ~5–10K simple inserts/sec on a single node. You need 115K average.
- OLTP databases are row-oriented — every analytical query (
GROUP BY page_url,COUNT(*) WHERE site_id = ?) does a full sequential scan. - ACID overhead (WAL, MVCC) wastes I/O on write-once, never-update data.
- At 5 TB/day, storage costs on a SQL instance are punishing.
4. Level 2 — Write Buffer + Batch Insert
Buffer events in memory (or Redis), flush every 5 seconds.
# Collector service — buffered writer import asyncio, collections buffer = collections.deque() BATCH_SIZE = 5000 FLUSH_INTERVAL = 5 # seconds async def receive_event(event): buffer.append(event) if len(buffer) >= BATCH_SIZE: await flush() async def flush(): batch = [buffer.popleft() for _ in range(min(BATCH_SIZE, len(buffer)))] if batch: await db.executemany("INSERT INTO events ...", batch) async def periodic_flush(): while True: await asyncio.sleep(FLUSH_INTERVAL) await flush()
Improvement: Reduces write amplification from 115K individual INSERTs/sec to ~1 batch per 5 seconds. Batch inserts are 10–50× cheaper per row.
Still broken: The fundamental read-performance problem remains. SQL is still scanning billions of rows for every dashboard query. A COUNT(*) over 30 days of data requires reading every row in that time range. And you lose all events in the buffer if the collector crashes.
5. Level 3 — Kafka + ClickHouse (The Right Architecture)
This is where the real design begins. Two insights drive the architecture:
- Decouple ingestion from storage using a durable message queue (Kafka). Collectors write fast; consumers write at their own pace.
- Use a columnar OLAP database (ClickHouse) instead of a row-oriented SQL database.
The Pipeline
50 servers
raw-events
stream proc
raw + rollups
real-time
Why Kafka?
- Durability: Events are persisted to disk. Collector crashes don’t lose data.
- Backpressure isolation: If ClickHouse is slow, Kafka absorbs the burst. Collectors never block.
- Replay: Re-process 2 years of events by rewinding Kafka offset — no separate archival system needed.
- Fan-out: Multiple consumers (ClickHouse writer, session stitcher, alerting) read the same topic independently.
Why ClickHouse?
Row-oriented (PostgreSQL)
Each row is stored together. Great for retrieving a single record. Terrible for SELECT page, COUNT(*) — must read every column of every row even if you only need two columns.
Columnar (ClickHouse)
Each column is stored in its own file. Analytical query reads only the columns it needs. All values of a column are the same type → compress 10:1 or better. Vectorized SIMD execution over compressed blocks.
-- ClickHouse events table CREATE TABLE events ( site_id LowCardinality(String), -- dictionary-encoded session_id String, page_url String, referrer String, event_type LowCardinality(String), country LowCardinality(FixedString(2)), device LowCardinality(String), browser LowCardinality(String), ts DateTime ) ENGINE = MergeTree() PARTITION BY toYYYYMM(ts) -- one partition per month ORDER BY (site_id, ts) -- primary sort key SETTINGS index_granularity = 8192; -- This query runs in <1 second over 1 billion rows: SELECT page_url, COUNT(*) AS views FROM events WHERE site_id = 'site_abc' AND ts >= now() - INTERVAL 30 DAY GROUP BY page_url ORDER BY views DESC LIMIT 10;
The MergeTree engine in ClickHouse sorts data by the ORDER BY key and stores a sparse index. Queries with WHERE site_id = ? skip entire parts that don’t match, then scan only the page_url and ts columns — no others loaded.
6. Level 4 — Real-Time vs. Batch Aggregation
The dashboard has two fundamentally different query types with different freshness requirements.
| Query | Freshness | Data Source | Latency |
|---|---|---|---|
| Active users right now | < 5 seconds | Redis HyperLogLog | ~1 ms |
| Page views in last hour | ~30 seconds | ClickHouse real-time | ~50 ms |
| Top pages last 7 days | ~5 minutes | ClickHouse raw table | ~200 ms |
| Monthly report last year | ~1 hour | ClickHouse daily rollup | ~80 ms |
Real-Time: Redis HyperLogLog
“How many users are on the site RIGHT NOW?” must update in seconds and cannot query raw events.
# Flink stream processor: update Redis on every event def process_event(event): site = event['siteId'] user = event['sessionId'] # proxy for unique user # HyperLogLog: ~1% error, O(1) space per key (~12 KB) redis.pfadd('active:' + site + ':5min', user) redis.expire('active:' + site + ':5min', 300) # 5-minute window # Dashboard reads: PFCOUNT active:site_abc:5min => ~147
Historical: Pre-Aggregated Rollups
For longer time ranges, pre-aggregate at write time using a Materialized View in ClickHouse:
-- Materialized view auto-aggregates as data is written CREATE MATERIALIZED VIEW daily_page_views ENGINE = SummingMergeTree() ORDER BY (site_id, page_url, day) AS SELECT site_id, page_url, toDate(ts) AS day, count() AS views, uniqHLL12(session_id) AS unique_sessions FROM events GROUP BY site_id, page_url, day;
SummingMergeTree automatically merges rows with the same key, summing the numeric columns. The 30-day top-pages query now reads a tiny rollup table instead of raw events.
Lambda vs. Kappa Architecture
Lambda Architecture
Two parallel paths: a speed layer (stream processing, less accurate) and a batch layer (MapReduce/Spark, accurate). A query layer merges both. Complex to maintain: two codebases must agree on the same business logic.
Kappa Architecture ✓
Single stream (Kafka) as the source of truth. To reprocess history, replay from beginning with new code. Simpler: one codebase, one pipeline. ClickHouse materialized views replace the batch layer entirely.
7. Level 5 — Session Stitching
Raw events are individual beacons. The dashboard needs sessions — grouped sequences of events from the same user within a 30-minute inactivity window.
Google Analytics 4 (GA4)
moved to an event-based
model in 2023, replacing
the session-based Universal
Analytics. The backend
runs on Google BigQuery
— customers can query
their own raw data via
BigQuery export.
A session ends when 30 minutes pass with no new events from that user. The Flink stream processor uses a session window — a gap-based window that closes after 30 minutes of silence.
// Flink: sessionize events per user events .keyBy(e -> e.siteId + ":" + e.sessionId) .window(EventTimeSessionWindows.withGap(Time.minutes(30))) .aggregate(new SessionAggregator()) // SessionAggregator computes per session: // - duration (last_ts - first_ts) // - page_count // - entry_page (first event url) // - exit_page (last event url) // - is_bounce (page_count == 1)
Session Stitching Visualizer
8. Level 6 — Dashboard Design
The dashboard is the customer-facing product. It must feel instantaneous even when querying months of data.
The “bounce rate” metric
is more nuanced than it
appears. A bounce is a
single-page session — but
a user who reads your
entire blog post and
leaves is also a “bounce.”
GA4 replaced it with
“engagement rate.”
9. Level 7 — Privacy & GDPR
Analytics collects user behavior data. This triggers legal obligations in many jurisdictions.
203.0.113.47 → 203.0.113.0. You retain geo precision (city-level) without storing a unique identifier. Required in Germany, recommended everywhere._ga session cookie after explicit opt-in. Without consent: use fingerprinting-free session IDs derived from anonymized IP + User-Agent hash, scoped to a single request window.userId in its own partition column. Erasure = drop partition for that user. Requires partition design at schema time.analytics.yourdomain.com via CNAME), so cookies are first-party.import ipaddress def anonymize_ip(raw_ip: str) -> str: try: addr = ipaddress.ip_address(raw_ip) if addr.version == 4: parts = raw_ip.split('.') return '.'.join(parts[:3]) + '.0' # zero last octet else: # IPv6: zero last 80 bits network = ipaddress.ip_network(raw_ip + '/48', strict=False) return str(network.network_address) except ValueError: return '0.0.0.0'
10. Capacity Estimate
| Metric | Number |
|---|---|
| Events/sec average | 115,000 |
| Events/sec peak | 500,000 |
| Kafka throughput | ~57 MB/sec (500 bytes × 115K) |
| Kafka partitions needed | ~600 (at 100 MB/sec/partition cap) |
| ClickHouse storage raw (2 yr) | ~3.6 PB raw, ~360 TB compressed (10:1) |
| ClickHouse rollup storage | ~5 TB (daily aggregates per site) |
| Redis real-time footprint | ~10 GB (HyperLogLog per site × 5-min windows) |
| Collector servers (10K events/sec each) | ~50 avg, ~100 peak |
| Flink task managers | ~20 (session windowing is CPU-heavy) |
| ClickHouse cluster | 3 shards × 2 replicas = 6 nodes, 60 TB SSD each |
The dominant cost is ClickHouse storage. At 10:1 compression, 2 years of raw events compresses to ~360 TB. At $0.02/GB-month (hot NVMe), that’s ~$7,200/month just for raw event storage — before rollups, compute, Kafka, and Redis.
11. Component Deep Dive: The Collector Service
The collector is the front door — it receives 115K events/sec, must be stateless and horizontally scalable.
from fastapi import FastAPI, Request, Response from aiokafka import AIOKafkaProducer import orjson, ua_parser, geoip2.database app = FastAPI() producer = AIOKafkaProducer(bootstrap_servers='kafka:9092') geo_db = geoip2.database.Reader('/data/GeoLite2-City.mmdb') @app.post('/collect') async def collect(request: Request): raw = await request.body() data = orjson.loads(raw) # Enrich: IP -> country, UA -> device/browser ip = request.client.host country = 'XX' try: rec = geo_db.city(ip) country = rec.country.iso_code or 'XX' except: pass ua_str = request.headers.get('user-agent', '') ua_info = ua_parser.parse(ua_str) data['country'] = country data['device'] = ua_info['device']['family'] data['browser'] = ua_info['user_agent']['family'] data['ip_anon'] = anonymize_ip(ip) await producer.send( 'raw-events', key=data['siteId'].encode(), # partition by siteId value=orjson.dumps(data) ) return Response(content=b'', status_code=204) # No Content
Key decisions:
- Partition Kafka by
siteId: All events for a single customer land on the same partition. Session stitching in Flink is keyed bysiteId:sessionId— no cross-partition shuffles. - 204 No Content response: The tracker doesn’t care about the response body. Returning nothing saves ~500 bytes × 115K/sec = 57 MB/sec of egress.
- IP enrichment at ingest: Do geo-lookup and UA parsing once at collection time, store the result. Never re-parse 10 billion events at query time.
12. Architecture Summary
Browser / Mobile App | | sendBeacon POST /collect v Collector Service x50 stateless pods | validate, enrich (IP→geo, UA→device) | partition key = siteId v Kafka raw-events topic 600 partitions, 7-day retention | +-------+----------+ | | v v Flink Flink Session Realtime Stitcher Aggregator | | v v ClickHouse Redis events table HyperLogLog + MV rollups active:site:5min | | +-------+--------+ | v Query API Dashboard UI
Trade-offs Worth Discussing in an Interview
| Decision | Chosen | Alternative | Why chosen |
|---|---|---|---|
| Stream broker | Kafka | Kinesis, Pulsar | Self-hosted, unlimited retention, strong ordering guarantees |
| OLAP store | ClickHouse | BigQuery, Druid, Pinot | Sub-second queries, efficient compression, on-premise option |
| Real-time counter | Redis HyperLogLog | Exact count in ClickHouse | O(1) space, ~1% error acceptable, millisecond reads |
| Stream processor | Flink | Spark Streaming, Storm | True streaming (not micro-batch), session windows native |
| Aggregation strategy | Kappa (single stream) | Lambda (batch + speed) | Simpler operations, reprocess by replaying Kafka |
Design a system like Google Analytics, and you touch nearly every major systems concept: high-throughput ingestion, durable messaging, columnar storage, streaming computation, approximate algorithms, GDPR compliance, and real-time dashboarding. The key insight is that analytics data is write-once, read-analytically — and every architectural decision should follow from that observation.