System Design: Live Chat at Scale — YouTube Live, Twitch, and 100k Concurrent Viewers
Twitch chat during major
esports events receives
50,000+ messages per
minute. Twitch uses a
custom IRC-over-WebSocket
protocol and client-side
throttling — your client
silently drops messages
if the stream is too fast.
Design the live chat for YouTube Live. A streamer has 100,000 concurrent viewers. Each viewer can send messages. Messages must appear to all viewers within 2 seconds. The system must survive chat “explosions” — when a streamer says something controversial and everyone types at once.
This is a real-world distributed systems problem with a clean progression of solutions, each solving a real failure mode of the previous. It’s also a favourite interview question because it touches WebSockets, pub/sub, Kafka, rate limiting, and fan-out simultaneously.
1. Scale & Constraints
The numbers immediately rule out naive approaches. Let’s walk through the levels.
2. Level 1 — Naive Polling
The simplest implementation: every client polls an HTTP endpoint every second asking “any new messages?”
# Client polls every 1 second GET /chat?streamId=abc&since=msg_1234 → 200 OK {"messages": [...], "lastId": "msg_1250"} # If no new messages GET /chat?streamId=abc&since=msg_1250 → 200 OK {"messages": []}
Why it fails:
- 100,000 clients × 1 req/sec = 100,000 requests/second for a single stream
- Latency up to 1 second (you only see a message at the next poll interval)
- Most requests return an empty array — pure waste
- Thundering herd: all clients poll in sync, creating spikes
- A busy platform with 10,000 concurrent streams → 1 billion req/sec
3. Level 2 — Long Polling
Long polling keeps the HTTP connection open until a message arrives (or a timeout fires). The server holds the request and responds only when there’s new data.
function longPollHandler(req, res) { const timeout = setTimeout(() => res.json([]), 30000); chatStore.subscribe(req.query.streamId, (msg) => { clearTimeout(timeout); res.json([msg]); // respond immediately on new message }); }
This halves the request rate, but the fundamental problem remains: every viewer still needs a stateful HTTP connection parked on the server. A typical HTTP server handles a few thousand concurrent connections. At 100,000 viewers, you would need 20–30 web servers just to hold connections open — and you still have no mechanism to fan a message out to connections on different servers.
4. Level 3 — WebSockets
WebSockets give us a persistent bidirectional connection. The server can push messages to clients the instant they arrive. No polling. No wasted requests.
const ws = new WebSocket('wss://chat.example.com/stream/abc'); ws.onopen = () => { ws.send(JSON.stringify({ type: 'join', streamId: 'abc' })); }; ws.onmessage = (event) => { const msg = JSON.parse(event.data); renderChatMessage(msg); // instant push, no polling }; ws.onclose = () => { setTimeout(() => reconnect(), 2000); // reconnect with backoff };
The new problem: cross-server fan-out.
A single server can hold roughly 10,000 WebSocket connections (limited by file descriptors and memory, not CPU). For 100,000 viewers you need at least 10 chat servers. When a message arrives on Chat Server 1, it must also reach viewers connected to Chat Servers 2 through 10.
How does Server 1 tell the others? This is the fan-out problem.
5. Level 4 — Redis Pub/Sub for Cross-Server Fan-Out
Redis Pub/Sub provides a lightweight message bus. Every chat server subscribes to the same channel. When any server receives a message, it publishes to Redis, and all servers receive it simultaneously.
// On startup: subscribe to this stream's chat channel redis.subscribe('stream:' + streamId + ':chat', (message) => { // Received from Redis — push to ALL local WebSocket clients localClients.forEach(client => client.send(message)); }); // When a viewer sends a message ws.onmessage = (rawMsg) => { const msg = validate(rawMsg); redis.publish('stream:' + streamId + ':chat', JSON.stringify(msg)); };
The fan-out path:
- Viewer on Server 3 sends a message via WebSocket
- Server 3 publishes the message to Redis channel
stream:abc:chat - Redis delivers it to all subscribers — Server 1, Server 2, Server 3, …, Server 10
- Each server pushes to its local WebSocket clients
- All 100,000 viewers receive the message — typically in under 50ms
Interactive — WebSocket Fan-Out Visualizer
6. Level 5 — Kafka for Durability and Replay
YouTube Live introduced
‘Top Chat’ — an ML
model that runs on every
message in real-time to
filter the highest-quality
ones from the flood.
At 50k messages/min
that’s serious inference
at scale.
Redis Pub/Sub has a critical flaw: it has no persistence. Messages are delivered to current subscribers and immediately discarded. If a chat server restarts mid-stream, it misses every message during downtime. A viewer who joins 30 minutes in cannot fetch chat history.
Kafka solves this. Each stream’s chat becomes a Kafka topic, partitioned by stream ID. Messages are retained on disk for the configured retention period (here: 3 hours). Chat servers are Kafka consumer groups. New servers joining simply pick up from the last committed offset.
// When a viewer sends a message kafka.produce({ topic: 'live-chat', partition: hash(streamId) % NUM_PARTITIONS, key: streamId, value: JSON.stringify({ msgId: uuid(), streamId: streamId, userId: req.user.id, text: sanitized, ts: Date.now() }) });
// Chat server subscribes as a consumer group kafka.subscribe({ topics: ['live-chat'], groupId: 'chat-servers-group' }, (message) => { const streamId = message.key; // Push to all local WebSocket clients watching this stream localSubs.get(streamId).forEach(client => { client.send(message.value); }); }); // New viewer joining mid-stream: fetch last 100 messages const history = await kafka.fetchFromOffset({ topic: 'live-chat', partition: hash(streamId), fromOffset: 'latest', limit: 100 });
Kafka vs Redis trade-off:
| Property | Redis Pub/Sub | Kafka |
|---|---|---|
| Latency | < 5ms | 50–100ms |
| Persistence | None — fire and forget | Durable (configurable retention) |
| Replay / history | No | Yes — seek to any offset |
| Fan-out to many consumers | Instant broadcast | Consumer group — partitioned |
| Operational complexity | Low | High (ZooKeeper/KRaft, brokers) |
Best practice used by major platforms: use both. Redis Pub/Sub for the hot path (sub-10ms live delivery to connected servers), Kafka for durability and replay. The write path publishes to both. Kafka’s consumer group is used for message history, analytics, and moderation pipelines — not for the real-time fan-out.
7. Level 6 — Rate Limiting and Moderation
A stream with 100,000 viewers where every viewer can send messages without limits becomes unusable within seconds. Rate limiting and moderation are not optional features — they are prerequisites for a functioning chat.
7a. Per-User Rate Limiting (Sliding Window in Redis)
-- Key: ratelimit:{userId}:{streamId} -- Allow max 2 messages per 1000ms sliding window local key = KEYS[1] local now = tonumber(ARGV[1]) local window = tonumber(ARGV[2]) -- 1000 ms local limit = tonumber(ARGV[3]) -- 2 messages -- Remove events outside the window redis.call('ZREMRANGEBYSCORE', key, 0, now - window) -- Count remaining events in window local count = redis.call('ZCARD', key) if count < limit then redis.call('ZADD', key, now, now) redis.call('EXPIRE', key, 2) return 1 -- allowed else return 0 -- throttled end
7b. Streamer-Controlled Modes
- Slow mode: 1 message per N seconds per user — enforced with a Redis key with TTL
- Subscriber-only mode: check
userIdagainst subscriber set in Redis - Emote-only mode: validate message against regex on the chat server before publishing
async function canSendInSlowMode(userId, streamId, cooldownSecs) { const key = 'slowmode:' + streamId + ':' + userId; const exists = await redis.exists(key); if (exists) return false; await redis.set(key, 1, 'EX', cooldownSecs); return true; }
7c. Auto-Moderation Pipeline
Messages pass through a moderation pipeline before being published:
- Keyword filter — Redis SET of banned words, O(1) lookup per word token
- Regex patterns — phone numbers, emails, URLs (configurable per channel)
- ML classifier — harassment/hate-speech model; runs asynchronously, messages flagged after delivery can be retroactively removed
- Channel-specific blocklists — streamers maintain their own ban lists
Interactive — Live Chat Demo
8. Level 7 — Viewer Count (HyperLogLog)
Discord built its
real-time engine on
Elixir/Erlang — a
runtime designed for
telecom switches that
must handle millions of
lightweight processes
concurrently. At
Discord’s scale this
wasn’t premature
optimisation; it was
the right tool
from day one.
Displaying an accurate live viewer count sounds simple. It isn’t.
Naive approach — count WebSocket connections:
-- Called every second: terrible idea SELECT COUNT(*) FROM connections WHERE stream_id = 'abc' AND last_heartbeat > NOW() - INTERVAL '30 seconds';
At millions of connections per second, this is a write-heavy table. The query touches the full index for every stream every second. It scales linearly with stream count — a disaster.
Production approach — Redis HyperLogLog:
HyperLogLog is a probabilistic data structure that estimates cardinality in O(1) time with only 12KB of memory, regardless of how many unique elements it has seen. The error rate is ±0.81%.
# Viewer joins or sends heartbeat PFADD stream:abc:viewers user:12345 # Get approximate viewer count (12 KB memory, ±0.81% error) PFCOUNT stream:abc:viewers # Merge across regional clusters PFMERGE stream:abc:viewers:global stream:abc:viewers:us stream:abc:viewers:eu
Viewers send a heartbeat every 30 seconds. On disconnect or missed heartbeat the HLL key is not modified — HLL only adds, never removes. Instead, you rotate the HLL key every minute: copy it, expire the old one, start fresh. The count displayed is a 1-minute rolling window.
9. Geographic Distribution
A single regional cluster introduces a problem: a viewer in Tokyo watching a US stream routes all WebSocket traffic to US data centres — adding 150ms of network latency on top of application latency. For chat, which should feel instantaneous, this is noticeable.
Multi-region architecture:
User (Tokyo) → Anycast DNS / GeoDNS → Asia-Pacific Chat Cluster (WebSocket) → AP Redis Pub/Sub # local fan-out < 5ms → AP Kafka Broker # local durability → Kafka MirrorMaker 2.0 → US-East Kafka Broker # cross-region replication ~100ms → EU Kafka Broker
Messages from a Tokyo viewer reach other Tokyo viewers in under 10ms via local Redis. The cross-region Kafka replication ensures US-East and EU viewers also receive the message — with an additional ~80–120ms of network transit. For chat, this is completely acceptable and unnoticeable to users.
Consistency model: eventual consistency across regions. Chat is not a financial transaction — a 100ms ordering discrepancy between a Tokyo viewer and a New York viewer is invisible and irrelevant.
10. Capacity Estimate
| Metric | Number | Notes |
|---|---|---|
| WebSocket connections / server | ~10,000 | Limited by fd ulimit + memory (~10KB per conn) |
| Chat servers for 100k viewers | 10 | Plus 3–5 spare for rolling deploys |
| Peak messages / sec | 1,000 | Single popular stream |
| Redis pub/sub throughput | < 5ms | In-memory, no disk I/O |
| Kafka write throughput | ~500 KB/sec | 500 bytes avg × 1,000 msg/sec |
| Kafka retention (3hr) | ~5.4 GB | Per popular stream (500B × 1000/s × 10,800s) |
| Redis HLL per stream | 12 KB | Regardless of viewer count |
| Rate limit keys in Redis | ~100k | One per active user, TTL 2s |
| Platform-wide Kafka throughput | ~50 MB/sec | 100 popular streams simultaneously |
11. Full Architecture Summary
sticky sessions / L4
WebSocket + rate limit
fan-out < 5ms
durability + replay
keyword + ML
Flink / Spark
Message lifecycle:
- Viewer sends message via WebSocket to Chat Server
- Chat Server checks rate limit (Redis, sliding window) — reject if exceeded
- Chat Server runs synchronous moderation (keyword filter) — reject if flagged
- Chat Server PUBLISHes to Redis
stream:{id}:chatchannel (live fan-out) - Chat Server produces to Kafka
live-chattopic (durability + async moderation) - All Chat Servers subscribed to Redis channel receive the message and push via WebSocket
- Kafka consumers (moderation service, analytics) process the message asynchronously
- If async ML moderation flags the message, a “retract” event is published to Redis + Kafka
12. Interview Checklist
When presenting this in an interview, hit these beats:
✅ Non-functional confirmed: high availability, horizontal scalability, durability (3hr retention)
✅ Estimation done: 10 chat servers for 100k viewers, 500KB/s Kafka, 5.4GB retention per stream
✅ Protocol chosen and justified: WebSocket over polling — permanent connection, server-push
✅ Fan-out solved: Redis Pub/Sub for cross-server broadcast
✅ Durability solved: Kafka for persistence, replay, and pipeline consumers
✅ Rate limiting designed: sliding window in Redis, slow mode, subscriber-only
✅ Viewer count solved: HyperLogLog — O(1) memory, O(1) time, ±0.81% accuracy
✅ Geographic distribution mentioned: GeoDNS, regional clusters, Kafka MirrorMaker
Common follow-up questions:
- “What happens when a chat server crashes?” — Clients reconnect (with exponential backoff). Kafka consumers rebalance within the consumer group. No messages are lost because they are already in Kafka. The reconnecting server subscribes to Redis and resumes.
- “How do you handle a celebrity joining a stream, suddenly jumping from 1k to 500k viewers?” — Horizontal auto-scaling of chat servers behind the load balancer. Redis Pub/Sub handles the fan-out channel automatically; new servers just subscribe. The load balancer redistributes connections across the new fleet.
- “How do you display the last 100 messages to a viewer joining late?” — Chat servers query Kafka by seeking to
(latest offset - 100)for the stream’s partition, fetch the messages, and send them to the client before joining the live Redis channel.
Next in this series: Designing Real-Time Collaborative Editing — Google Docs at Scale.