System Design: Live Chat at Scale — YouTube Live, Twitch, and 100k Concurrent Viewers

Series System Design: Web Scenarios

Twitch chat during major
esports events receives
50,000+ messages per
minute. Twitch uses a
custom IRC-over-WebSocket
protocol and client-side
throttling — your client
silently drops messages
if the stream is too fast.

Design the live chat for YouTube Live. A streamer has 100,000 concurrent viewers. Each viewer can send messages. Messages must appear to all viewers within 2 seconds. The system must survive chat “explosions” — when a streamer says something controversial and everyone types at once.

This is a real-world distributed systems problem with a clean progression of solutions, each solving a real failure mode of the previous. It’s also a favourite interview question because it touches WebSockets, pub/sub, Kafka, rate limiting, and fan-out simultaneously.


1. Scale & Constraints

100K
Concurrent viewers / stream
1,000
Messages / sec peak
< 2s
Delivery latency target
3 hrs
Message retention for replay
Millions
Platform-wide concurrent users
1,000s
Simultaneous live streams

The numbers immediately rule out naive approaches. Let’s walk through the levels.


2. Level 1 — Naive Polling

Level 1 — Fails at Scale

The simplest implementation: every client polls an HTTP endpoint every second asking “any new messages?”

http
# Client polls every 1 second
GET /chat?streamId=abc&since=msg_1234
 200 OK  {"messages": [...], "lastId": "msg_1250"}

# If no new messages
GET /chat?streamId=abc&since=msg_1250
 200 OK  {"messages": []}

Why it fails:

  • 100,000 clients × 1 req/sec = 100,000 requests/second for a single stream
  • Latency up to 1 second (you only see a message at the next poll interval)
  • Most requests return an empty array — pure waste
  • Thundering herd: all clients poll in sync, creating spikes
  • A busy platform with 10,000 concurrent streams → 1 billion req/sec
Rule: Any solution where N clients generate N requests per second is a non-starter for live media. The request rate must be sub-linear in the viewer count.

3. Level 2 — Long Polling

Level 2 — Better, but Stateful HTTP

Long polling keeps the HTTP connection open until a message arrives (or a timeout fires). The server holds the request and responds only when there’s new data.

pseudocode — server
function longPollHandler(req, res) {
  const timeout = setTimeout(() => res.json([]), 30000);

  chatStore.subscribe(req.query.streamId, (msg) => {
    clearTimeout(timeout);
    res.json([msg]);    // respond immediately on new message
  });
}

This halves the request rate, but the fundamental problem remains: every viewer still needs a stateful HTTP connection parked on the server. A typical HTTP server handles a few thousand concurrent connections. At 100,000 viewers, you would need 20–30 web servers just to hold connections open — and you still have no mechanism to fan a message out to connections on different servers.


4. Level 3 — WebSockets

Level 3 — Right Protocol, New Problem

WebSockets give us a persistent bidirectional connection. The server can push messages to clients the instant they arrive. No polling. No wasted requests.

javascript — client
const ws = new WebSocket('wss://chat.example.com/stream/abc');

ws.onopen = () => {
  ws.send(JSON.stringify({ type: 'join', streamId: 'abc' }));
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  renderChatMessage(msg);   // instant push, no polling
};

ws.onclose = () => {
  setTimeout(() => reconnect(), 2000);  // reconnect with backoff
};

The new problem: cross-server fan-out.

A single server can hold roughly 10,000 WebSocket connections (limited by file descriptors and memory, not CPU). For 100,000 viewers you need at least 10 chat servers. When a message arrives on Chat Server 1, it must also reach viewers connected to Chat Servers 2 through 10.

How does Server 1 tell the others? This is the fan-out problem.


5. Level 4 — Redis Pub/Sub for Cross-Server Fan-Out

Level 4 — Production-Ready Fan-Out

Redis Pub/Sub provides a lightweight message bus. Every chat server subscribes to the same channel. When any server receives a message, it publishes to Redis, and all servers receive it simultaneously.

pseudocode — chat server
// On startup: subscribe to this stream's chat channel
redis.subscribe('stream:' + streamId + ':chat', (message) => {
  // Received from Redis — push to ALL local WebSocket clients
  localClients.forEach(client => client.send(message));
});

// When a viewer sends a message
ws.onmessage = (rawMsg) => {
  const msg = validate(rawMsg);
  redis.publish('stream:' + streamId + ':chat', JSON.stringify(msg));
};

The fan-out path:

  1. Viewer on Server 3 sends a message via WebSocket
  2. Server 3 publishes the message to Redis channel stream:abc:chat
  3. Redis delivers it to all subscribers — Server 1, Server 2, Server 3, …, Server 10
  4. Each server pushes to its local WebSocket clients
  5. All 100,000 viewers receive the message — typically in under 50ms
Limitation: Redis Pub/Sub is fire-and-forget. If a chat server restarts, it loses all messages during the downtime window. There is no persistence, no replay, no offset. A viewer joining mid-stream cannot retrieve the last 100 messages.

Interactive — WebSocket Fan-Out Visualizer

▶ WebSocket Fan-Out & Redis Pub/Sub — Live Animation

6. Level 5 — Kafka for Durability and Replay

Level 5 — Durable Delivery

YouTube Live introduced
‘Top Chat’ — an ML
model that runs on every
message in real-time to
filter the highest-quality
ones from the flood.
At 50k messages/min
that’s serious inference
at scale.

Redis Pub/Sub has a critical flaw: it has no persistence. Messages are delivered to current subscribers and immediately discarded. If a chat server restarts mid-stream, it misses every message during downtime. A viewer who joins 30 minutes in cannot fetch chat history.

Kafka solves this. Each stream’s chat becomes a Kafka topic, partitioned by stream ID. Messages are retained on disk for the configured retention period (here: 3 hours). Chat servers are Kafka consumer groups. New servers joining simply pick up from the last committed offset.

pseudocode — kafka producer
// When a viewer sends a message
kafka.produce({
  topic: 'live-chat',
  partition: hash(streamId) % NUM_PARTITIONS,
  key: streamId,
  value: JSON.stringify({
    msgId:    uuid(),
    streamId: streamId,
    userId:   req.user.id,
    text:     sanitized,
    ts:       Date.now()
  })
});
pseudocode — kafka consumer (chat server)
// Chat server subscribes as a consumer group
kafka.subscribe({
  topics: ['live-chat'],
  groupId: 'chat-servers-group'
}, (message) => {
  const streamId = message.key;
  // Push to all local WebSocket clients watching this stream
  localSubs.get(streamId).forEach(client => {
    client.send(message.value);
  });
});

// New viewer joining mid-stream: fetch last 100 messages
const history = await kafka.fetchFromOffset({
  topic: 'live-chat',
  partition: hash(streamId),
  fromOffset: 'latest',
  limit: 100
});

Kafka vs Redis trade-off:

PropertyRedis Pub/SubKafka
Latency< 5ms50–100ms
PersistenceNone — fire and forgetDurable (configurable retention)
Replay / historyNoYes — seek to any offset
Fan-out to many consumersInstant broadcastConsumer group — partitioned
Operational complexityLowHigh (ZooKeeper/KRaft, brokers)

Best practice used by major platforms: use both. Redis Pub/Sub for the hot path (sub-10ms live delivery to connected servers), Kafka for durability and replay. The write path publishes to both. Kafka’s consumer group is used for message history, analytics, and moderation pipelines — not for the real-time fan-out.

Architecture: Viewer sends message → Chat Server → (1) Redis PUBLISH for live fan-out AND (2) Kafka produce for persistence. Chat Server also consumes Kafka to serve chat replay to late-joining viewers.

7. Level 6 — Rate Limiting and Moderation

Level 6 — Production Hardening

A stream with 100,000 viewers where every viewer can send messages without limits becomes unusable within seconds. Rate limiting and moderation are not optional features — they are prerequisites for a functioning chat.

7a. Per-User Rate Limiting (Sliding Window in Redis)

lua — redis sliding window rate limiter
-- Key: ratelimit:{userId}:{streamId}
-- Allow max 2 messages per 1000ms sliding window
local key    = KEYS[1]
local now    = tonumber(ARGV[1])
local window = tonumber(ARGV[2])  -- 1000 ms
local limit  = tonumber(ARGV[3])  -- 2 messages

-- Remove events outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count remaining events in window
local count = redis.call('ZCARD', key)

if count < limit then
  redis.call('ZADD', key, now, now)
  redis.call('EXPIRE', key, 2)
  return 1  -- allowed
else
  return 0  -- throttled
end

7b. Streamer-Controlled Modes

  • Slow mode: 1 message per N seconds per user — enforced with a Redis key with TTL
  • Subscriber-only mode: check userId against subscriber set in Redis
  • Emote-only mode: validate message against regex on the chat server before publishing
pseudocode — slow mode check
async function canSendInSlowMode(userId, streamId, cooldownSecs) {
  const key = 'slowmode:' + streamId + ':' + userId;
  const exists = await redis.exists(key);
  if (exists) return false;
  await redis.set(key, 1, 'EX', cooldownSecs);
  return true;
}

7c. Auto-Moderation Pipeline

Messages pass through a moderation pipeline before being published:

  1. Keyword filter — Redis SET of banned words, O(1) lookup per word token
  2. Regex patterns — phone numbers, emails, URLs (configurable per channel)
  3. ML classifier — harassment/hate-speech model; runs asynchronously, messages flagged after delivery can be retroactively removed
  4. Channel-specific blocklists — streamers maintain their own ban lists

Interactive — Live Chat Demo

▶ Live Chat Demo — Rate Limiting & Moderation
Live Chat
System Stats
Mode: normal
Msgs delivered: 0
Msgs throttled: 0
Msgs removed: 0
Msgs/sec: 0
Rate limit: 2/sec
Kafka lag: 12ms
Redis RTT: 2ms
Moderation Log

8. Level 7 — Viewer Count (HyperLogLog)

Discord built its
real-time engine on
Elixir/Erlang — a
runtime designed for
telecom switches that
must handle millions of
lightweight processes
concurrently. At
Discord’s scale this
wasn’t premature
optimisation; it was
the right tool
from day one.

Displaying an accurate live viewer count sounds simple. It isn’t.

Naive approach — count WebSocket connections:

sql
-- Called every second: terrible idea
SELECT COUNT(*) FROM connections
WHERE stream_id = 'abc'
  AND last_heartbeat > NOW() - INTERVAL '30 seconds';

At millions of connections per second, this is a write-heavy table. The query touches the full index for every stream every second. It scales linearly with stream count — a disaster.

Production approach — Redis HyperLogLog:

HyperLogLog is a probabilistic data structure that estimates cardinality in O(1) time with only 12KB of memory, regardless of how many unique elements it has seen. The error rate is ±0.81%.

redis commands
# Viewer joins or sends heartbeat
PFADD  stream:abc:viewers  user:12345

# Get approximate viewer count (12 KB memory, ±0.81% error)
PFCOUNT  stream:abc:viewers

# Merge across regional clusters
PFMERGE  stream:abc:viewers:global  stream:abc:viewers:us  stream:abc:viewers:eu

Viewers send a heartbeat every 30 seconds. On disconnect or missed heartbeat the HLL key is not modified — HLL only adds, never removes. Instead, you rotate the HLL key every minute: copy it, expire the old one, start fresh. The count displayed is a 1-minute rolling window.

Trade-off accepted: ±0.81% error on a viewer count is imperceptible to humans. "100,000 viewers" vs "100,810 viewers" — nobody cares. Trading exactness for O(1) memory and O(1) time is the correct engineering call.

9. Geographic Distribution

A single regional cluster introduces a problem: a viewer in Tokyo watching a US stream routes all WebSocket traffic to US data centres — adding 150ms of network latency on top of application latency. For chat, which should feel instantaneous, this is noticeable.

Multi-region architecture:

architecture
User (Tokyo) 
   Anycast DNS / GeoDNS 
   Asia-Pacific Chat Cluster (WebSocket)
     AP Redis Pub/Sub  # local fan-out < 5ms
     AP Kafka Broker   # local durability
       Kafka MirrorMaker 2.0
         US-East Kafka Broker  # cross-region replication ~100ms
         EU Kafka Broker

Messages from a Tokyo viewer reach other Tokyo viewers in under 10ms via local Redis. The cross-region Kafka replication ensures US-East and EU viewers also receive the message — with an additional ~80–120ms of network transit. For chat, this is completely acceptable and unnoticeable to users.

Consistency model: eventual consistency across regions. Chat is not a financial transaction — a 100ms ordering discrepancy between a Tokyo viewer and a New York viewer is invisible and irrelevant.


10. Capacity Estimate

MetricNumberNotes
WebSocket connections / server~10,000Limited by fd ulimit + memory (~10KB per conn)
Chat servers for 100k viewers10Plus 3–5 spare for rolling deploys
Peak messages / sec1,000Single popular stream
Redis pub/sub throughput< 5msIn-memory, no disk I/O
Kafka write throughput~500 KB/sec500 bytes avg × 1,000 msg/sec
Kafka retention (3hr)~5.4 GBPer popular stream (500B × 1000/s × 10,800s)
Redis HLL per stream12 KBRegardless of viewer count
Rate limit keys in Redis~100kOne per active user, TTL 2s
Platform-wide Kafka throughput~50 MB/sec100 popular streams simultaneously

11. Full Architecture Summary

▶ Complete System Architecture
Viewer
Load Balancer
sticky sessions / L4
Chat Server
WebSocket + rate limit
↓ publish
Redis Pub/Sub
fan-out < 5ms
Kafka
durability + replay
Moderation Service
keyword + ML
← Kafka consumer
Analytics Pipeline
Flink / Spark

Message lifecycle:

  1. Viewer sends message via WebSocket to Chat Server
  2. Chat Server checks rate limit (Redis, sliding window) — reject if exceeded
  3. Chat Server runs synchronous moderation (keyword filter) — reject if flagged
  4. Chat Server PUBLISHes to Redis stream:{id}:chat channel (live fan-out)
  5. Chat Server produces to Kafka live-chat topic (durability + async moderation)
  6. All Chat Servers subscribed to Redis channel receive the message and push via WebSocket
  7. Kafka consumers (moderation service, analytics) process the message asynchronously
  8. If async ML moderation flags the message, a “retract” event is published to Redis + Kafka

12. Interview Checklist

When presenting this in an interview, hit these beats:

✅ Functional requirements confirmed: message delivery < 2s, 100k concurrent viewers, chat history for replay, rate limiting and moderation
✅ Non-functional confirmed: high availability, horizontal scalability, durability (3hr retention)
✅ Estimation done: 10 chat servers for 100k viewers, 500KB/s Kafka, 5.4GB retention per stream
✅ Protocol chosen and justified: WebSocket over polling — permanent connection, server-push
✅ Fan-out solved: Redis Pub/Sub for cross-server broadcast
✅ Durability solved: Kafka for persistence, replay, and pipeline consumers
✅ Rate limiting designed: sliding window in Redis, slow mode, subscriber-only
✅ Viewer count solved: HyperLogLog — O(1) memory, O(1) time, ±0.81% accuracy
✅ Geographic distribution mentioned: GeoDNS, regional clusters, Kafka MirrorMaker

Common follow-up questions:

  • “What happens when a chat server crashes?” — Clients reconnect (with exponential backoff). Kafka consumers rebalance within the consumer group. No messages are lost because they are already in Kafka. The reconnecting server subscribes to Redis and resumes.
  • “How do you handle a celebrity joining a stream, suddenly jumping from 1k to 500k viewers?” — Horizontal auto-scaling of chat servers behind the load balancer. Redis Pub/Sub handles the fan-out channel automatically; new servers just subscribe. The load balancer redistributes connections across the new fleet.
  • “How do you display the last 100 messages to a viewer joining late?” — Chat servers query Kafka by seeking to (latest offset - 100) for the stream’s partition, fetch the messages, and send them to the client before joining the live Redis channel.

Next in this series: Designing Real-Time Collaborative Editing — Google Docs at Scale.