System Design: Live Chat at Scale — YouTube Live, Twitch, and 100k Concurrent Viewers

Series System Design: Web Scenarios

Twitch chat during major
esports events receives
50,000+ messages per
minute. Twitch uses a
custom IRC-over-WebSocket
protocol and client-side
throttling — your client
silently drops messages
if the stream is too fast.

Design the live chat for YouTube Live. A streamer has 100,000 concurrent viewers. Each viewer can send messages. Messages must appear to all viewers within 2 seconds. The system must survive chat “explosions” — when a streamer says something controversial and everyone types at once.

This is a real-world distributed systems problem with a clean progression of solutions, each solving a real failure mode of the previous. It’s also a favourite interview question because it touches WebSockets, pub/sub, Kafka, rate limiting, and fan-out simultaneously.

1. Scale & Constraints

100K

Concurrent viewers / stream

1,000

Messages / sec peak

< 2s

Delivery latency target

3 hrs

Message retention for replay

Millions

Platform-wide concurrent users

1,000s

Simultaneous live streams

The numbers immediately rule out naive approaches. Let’s walk through the levels.

2. Level 1 — Naive Polling

Level 1 — Fails at Scale

The simplest implementation: every client polls an HTTP endpoint every second asking “any new messages?”

http

# Client polls every 1 second
GET /chat?streamId=abc&since=msg_1234
→ 200 OK  {"messages": [...], "lastId": "msg_1250"}

# If no new messages
GET /chat?streamId=abc&since=msg_1250
→ 200 OK  {"messages": []}

Why it fails:

100,000 clients × 1 req/sec = 100,000 requests/second for a single stream
Latency up to 1 second (you only see a message at the next poll interval)
Most requests return an empty array — pure waste
Thundering herd: all clients poll in sync, creating spikes
A busy platform with 10,000 concurrent streams → 1 billion req/sec

Rule: Any solution where N clients generate N requests per second is a non-starter for live media. The request rate must be sub-linear in the viewer count.

3. Level 2 — Long Polling

Level 2 — Better, but Stateful HTTP

Long polling keeps the HTTP connection open until a message arrives (or a timeout fires). The server holds the request and responds only when there’s new data.

pseudocode — server

function longPollHandler(req, res) {
  const timeout = setTimeout(() => res.json([]), 30000);

  chatStore.subscribe(req.query.streamId, (msg) => {
    clearTimeout(timeout);
    res.json([msg]);    // respond immediately on new message
  });
}

This halves the request rate, but the fundamental problem remains: every viewer still needs a stateful HTTP connection parked on the server. A typical HTTP server handles a few thousand concurrent connections. At 100,000 viewers, you would need 20–30 web servers just to hold connections open — and you still have no mechanism to fan a message out to connections on different servers.

4. Level 3 — WebSockets

Level 3 — Right Protocol, New Problem

WebSockets give us a persistent bidirectional connection. The server can push messages to clients the instant they arrive. No polling. No wasted requests.

javascript — client

const ws = new WebSocket('wss://chat.example.com/stream/abc');

ws.onopen = () => {
  ws.send(JSON.stringify({ type: 'join', streamId: 'abc' }));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  renderChatMessage(msg);   // instant push, no polling
};

ws.onclose = () => {
  setTimeout(() => reconnect(), 2000);  // reconnect with backoff
};

The new problem: cross-server fan-out.

A single server can hold roughly 10,000 WebSocket connections (limited by file descriptors and memory, not CPU). For 100,000 viewers you need at least 10 chat servers. When a message arrives on Chat Server 1, it must also reach viewers connected to Chat Servers 2 through 10.

How does Server 1 tell the others? This is the fan-out problem.

5. Level 4 — Redis Pub/Sub for Cross-Server Fan-Out

Level 4 — Production-Ready Fan-Out

Redis Pub/Sub provides a lightweight message bus. Every chat server subscribes to the same channel. When any server receives a message, it publishes to Redis, and all servers receive it simultaneously.

pseudocode — chat server

// On startup: subscribe to this stream's chat channel
redis.subscribe('stream:' + streamId + ':chat', (message) => {
  // Received from Redis — push to ALL local WebSocket clients
  localClients.forEach(client => client.send(message));
});

// When a viewer sends a message
ws.onmessage = (rawMsg) => {
  const msg = validate(rawMsg);
  redis.publish('stream:' + streamId + ':chat', JSON.stringify(msg));
};

The fan-out path:

Viewer on Server 3 sends a message via WebSocket
Server 3 publishes the message to Redis channel stream:abc:chat
Redis delivers it to all subscribers — Server 1, Server 2, Server 3, …, Server 10
Each server pushes to its local WebSocket clients
All 100,000 viewers receive the message — typically in under 50ms

Limitation: Redis Pub/Sub is fire-and-forget. If a chat server restarts, it loses all messages during the downtime window. There is no persistence, no replay, no offset. A viewer joining mid-stream cannot retrieve the last 100 messages.

Interactive — WebSocket Fan-Out Visualizer

▶ WebSocket Fan-Out & Redis Pub/Sub — Live Animation

6. Level 5 — Kafka for Durability and Replay

Level 5 — Durable Delivery

YouTube Live introduced
‘Top Chat’ — an ML
model that runs on every
message in real-time to
filter the highest-quality
ones from the flood.
At 50k messages/min
that’s serious inference
at scale.

Redis Pub/Sub has a critical flaw: it has no persistence. Messages are delivered to current subscribers and immediately discarded. If a chat server restarts mid-stream, it misses every message during downtime. A viewer who joins 30 minutes in cannot fetch chat history.

Kafka solves this. Each stream’s chat becomes a Kafka topic, partitioned by stream ID. Messages are retained on disk for the configured retention period (here: 3 hours). Chat servers are Kafka consumer groups. New servers joining simply pick up from the last committed offset.

pseudocode — kafka producer

// When a viewer sends a message
kafka.produce({
  topic: 'live-chat',
  partition: hash(streamId) % NUM_PARTITIONS,
  key: streamId,
  value: JSON.stringify({
    msgId:    uuid(),
    streamId: streamId,
    userId:   req.user.id,
    text:     sanitized,
    ts:       Date.now()
  })
});

pseudocode — kafka consumer (chat server)

// Chat server subscribes as a consumer group
kafka.subscribe({
  topics: ['live-chat'],
  groupId: 'chat-servers-group'
}, (message) => {
  const streamId = message.key;
  // Push to all local WebSocket clients watching this stream
  localSubs.get(streamId).forEach(client => {
    client.send(message.value);
  });
});

// New viewer joining mid-stream: fetch last 100 messages
const history = await kafka.fetchFromOffset({
  topic: 'live-chat',
  partition: hash(streamId),
  fromOffset: 'latest',
  limit: 100
});

Kafka vs Redis trade-off:

Property	Redis Pub/Sub	Kafka
Latency	< 5ms	50–100ms
Persistence	None — fire and forget	Durable (configurable retention)
Replay / history	No	Yes — seek to any offset
Fan-out to many consumers	Instant broadcast	Consumer group — partitioned
Operational complexity	Low	High (ZooKeeper/KRaft, brokers)

Best practice used by major platforms: use both. Redis Pub/Sub for the hot path (sub-10ms live delivery to connected servers), Kafka for durability and replay. The write path publishes to both. Kafka’s consumer group is used for message history, analytics, and moderation pipelines — not for the real-time fan-out.

Architecture: Viewer sends message → Chat Server → (1) Redis PUBLISH for live fan-out AND (2) Kafka produce for persistence. Chat Server also consumes Kafka to serve chat replay to late-joining viewers.

7. Level 6 — Rate Limiting and Moderation

Level 6 — Production Hardening

A stream with 100,000 viewers where every viewer can send messages without limits becomes unusable within seconds. Rate limiting and moderation are not optional features — they are prerequisites for a functioning chat.

7a. Per-User Rate Limiting (Sliding Window in Redis)

lua — redis sliding window rate limiter

-- Key: ratelimit:{userId}:{streamId}
-- Allow max 2 messages per 1000ms sliding window
local key    = KEYS[1]
local now    = tonumber(ARGV[1])
local window = tonumber(ARGV[2])  -- 1000 ms
local limit  = tonumber(ARGV[3])  -- 2 messages

-- Remove events outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count remaining events in window
local count = redis.call('ZCARD', key)

if count < limit then
  redis.call('ZADD', key, now, now)
  redis.call('EXPIRE', key, 2)
  return 1  -- allowed
else
  return 0  -- throttled
end

7b. Streamer-Controlled Modes

Slow mode: 1 message per N seconds per user — enforced with a Redis key with TTL
Subscriber-only mode: check userId against subscriber set in Redis
Emote-only mode: validate message against regex on the chat server before publishing

pseudocode — slow mode check

async function canSendInSlowMode(userId, streamId, cooldownSecs) {
  const key = 'slowmode:' + streamId + ':' + userId;
  const exists = await redis.exists(key);
  if (exists) return false;
  await redis.set(key, 1, 'EX', cooldownSecs);
  return true;
}

7c. Auto-Moderation Pipeline

Messages pass through a moderation pipeline before being published:

Keyword filter — Redis SET of banned words, O(1) lookup per word token
Regex patterns — phone numbers, emails, URLs (configurable per channel)
ML classifier — harassment/hate-speech model; runs asynchronously, messages flagged after delivery can be retroactively removed
Channel-specific blocklists — streamers maintain their own ban lists

Interactive — Live Chat Demo

▶ Live Chat Demo — Rate Limiting & Moderation

Live Chat

System Stats

Mode: normal
Msgs delivered: 0
Msgs throttled: 0
Msgs removed: 0
Msgs/sec: 0
Rate limit: 2/sec
Kafka lag: 12ms
Redis RTT: 2ms

Moderation Log

8. Level 7 — Viewer Count (HyperLogLog)

Discord built its
real-time engine on
Elixir/Erlang — a
runtime designed for
telecom switches that
must handle millions of
lightweight processes
concurrently. At
Discord’s scale this
wasn’t premature
optimisation; it was
the right tool
from day one.

Displaying an accurate live viewer count sounds simple. It isn’t.

Naive approach — count WebSocket connections:

sql

-- Called every second: terrible idea
SELECT COUNT(*) FROM connections
WHERE stream_id = 'abc'
  AND last_heartbeat > NOW() - INTERVAL '30 seconds';

At millions of connections per second, this is a write-heavy table. The query touches the full index for every stream every second. It scales linearly with stream count — a disaster.

Production approach — Redis HyperLogLog:

HyperLogLog is a probabilistic data structure that estimates cardinality in O(1) time with only 12KB of memory, regardless of how many unique elements it has seen. The error rate is ±0.81%.

redis commands

# Viewer joins or sends heartbeat
PFADD  stream:abc:viewers  user:12345

# Get approximate viewer count (12 KB memory, ±0.81% error)
PFCOUNT  stream:abc:viewers

# Merge across regional clusters
PFMERGE  stream:abc:viewers:global  stream:abc:viewers:us  stream:abc:viewers:eu

Viewers send a heartbeat every 30 seconds. On disconnect or missed heartbeat the HLL key is not modified — HLL only adds, never removes. Instead, you rotate the HLL key every minute: copy it, expire the old one, start fresh. The count displayed is a 1-minute rolling window.

Trade-off accepted: ±0.81% error on a viewer count is imperceptible to humans. "100,000 viewers" vs "100,810 viewers" — nobody cares. Trading exactness for O(1) memory and O(1) time is the correct engineering call.

9. Geographic Distribution

A single regional cluster introduces a problem: a viewer in Tokyo watching a US stream routes all WebSocket traffic to US data centres — adding 150ms of network latency on top of application latency. For chat, which should feel instantaneous, this is noticeable.

Multi-region architecture:

architecture

User (Tokyo) 
  → Anycast DNS / GeoDNS 
  → Asia-Pacific Chat Cluster (WebSocket)
    → AP Redis Pub/Sub  # local fan-out < 5ms
    → AP Kafka Broker   # local durability
      → Kafka MirrorMaker 2.0
        → US-East Kafka Broker  # cross-region replication ~100ms
        → EU Kafka Broker

Messages from a Tokyo viewer reach other Tokyo viewers in under 10ms via local Redis. The cross-region Kafka replication ensures US-East and EU viewers also receive the message — with an additional ~80–120ms of network transit. For chat, this is completely acceptable and unnoticeable to users.

Consistency model: eventual consistency across regions. Chat is not a financial transaction — a 100ms ordering discrepancy between a Tokyo viewer and a New York viewer is invisible and irrelevant.

10. Capacity Estimate

Metric	Number	Notes
WebSocket connections / server	~10,000	Limited by fd ulimit + memory (~10KB per conn)
Chat servers for 100k viewers	10	Plus 3–5 spare for rolling deploys
Peak messages / sec	1,000	Single popular stream
Redis pub/sub throughput	< 5ms	In-memory, no disk I/O
Kafka write throughput	~500 KB/sec	500 bytes avg × 1,000 msg/sec
Kafka retention (3hr)	~5.4 GB	Per popular stream (500B × 1000/s × 10,800s)
Redis HLL per stream	12 KB	Regardless of viewer count
Rate limit keys in Redis	~100k	One per active user, TTL 2s
Platform-wide Kafka throughput	~50 MB/sec	100 popular streams simultaneously

11. Full Architecture Summary

▶ Complete System Architecture

Viewer
→
Load Balancer
sticky sessions / L4
→
Chat Server
WebSocket + rate limit
↓ publish
Redis Pub/Sub
fan-out < 5ms
Kafka
durability + replay
Moderation Service
keyword + ML
← Kafka consumer
Analytics Pipeline
Flink / Spark

Message lifecycle:

Viewer sends message via WebSocket to Chat Server
Chat Server checks rate limit (Redis, sliding window) — reject if exceeded
Chat Server runs synchronous moderation (keyword filter) — reject if flagged
Chat Server PUBLISHes to Redis stream:{id}:chat channel (live fan-out)
Chat Server produces to Kafka live-chat topic (durability + async moderation)
All Chat Servers subscribed to Redis channel receive the message and push via WebSocket
Kafka consumers (moderation service, analytics) process the message asynchronously
If async ML moderation flags the message, a “retract” event is published to Redis + Kafka

12. Interview Checklist

When presenting this in an interview, hit these beats:

✅ Functional requirements confirmed: message delivery < 2s, 100k concurrent viewers, chat history for replay, rate limiting and moderation
✅ Non-functional confirmed: high availability, horizontal scalability, durability (3hr retention)
✅ Estimation done: 10 chat servers for 100k viewers, 500KB/s Kafka, 5.4GB retention per stream
✅ Protocol chosen and justified: WebSocket over polling — permanent connection, server-push
✅ Fan-out solved: Redis Pub/Sub for cross-server broadcast
✅ Durability solved: Kafka for persistence, replay, and pipeline consumers
✅ Rate limiting designed: sliding window in Redis, slow mode, subscriber-only
✅ Viewer count solved: HyperLogLog — O(1) memory, O(1) time, ±0.81% accuracy
✅ Geographic distribution mentioned: GeoDNS, regional clusters, Kafka MirrorMaker

Common follow-up questions:

“What happens when a chat server crashes?” — Clients reconnect (with exponential backoff). Kafka consumers rebalance within the consumer group. No messages are lost because they are already in Kafka. The reconnecting server subscribes to Redis and resumes.
“How do you handle a celebrity joining a stream, suddenly jumping from 1k to 500k viewers?” — Horizontal auto-scaling of chat servers behind the load balancer. Redis Pub/Sub handles the fan-out channel automatically; new servers just subscribe. The load balancer redistributes connections across the new fleet.
“How do you display the last 100 messages to a viewer joining late?” — Chat servers query Kafka by seeking to (latest offset - 100) for the stream’s partition, fetch the messages, and send them to the client before joining the live Redis channel.

Next in this series: Designing Real-Time Collaborative Editing — Google Docs at Scale.