System Design: The Like Button — Counting at Billions of Clicks per Second

Series System Design: Web Scenarios

Facebook’s “Reactions”
(Like, Love, Haha, Wow,
Sad, Angry) are
architecturally the same
problem — just 6 counters
per post instead of 1.
The 2016 launch added
~6x write load to their
like infrastructure
overnight.

Design the “Like” button for YouTube. At peak, a viral video receives 500,000 likes per minute. Likes must be accurate, fast, and eventually consistent. Users should see a near-real-time count. Likes must be idempotent — clicking twice must not double-count.

This is one of the most common system design interview questions. It looks trivial. It is not.


1. Scale & Constraints

800M
YouTube DAU
8,333
Likes/sec (viral video)
~2M
Platform likes/sec peak
1000:1
Read : Write ratio
<100ms
Like latency target
~1s
Acceptable count lag

The read:write asymmetry is critical. For every person clicking Like, roughly 1,000 users are just viewing the count. This means the display path must be ultra-cheap (cache-heavy), while the write path can tolerate slightly more latency and can be eventually consistent.

The constraints immediately rule out naive relational approaches. Let’s walk through each level.


2. Level 1 — Naive SQL

The first instinct is to increment a counter in the database:

sql
-- Like a video
UPDATE videos
SET    like_count = like_count + 1
WHERE  id = 'video_abc';

-- Idempotency via unique constraint
INSERT INTO user_likes (user_id, video_id)
VALUES ('user_123', 'video_abc')
ON DUPLICATE KEY UPDATE user_id = user_id;

Why it breaks at scale:

The UPDATE videos SET like_count = like_count + 1 statement acquires a row-level lock on that single row for the duration of the transaction. At 8,333 writes/sec all hitting the same video_abc row, you get:

  • Lock contention: writes queue up, latency climbs from milliseconds to seconds
  • Connection pool exhaustion: threads holding locks block new connections
  • Thundering herd: a cache expiry causes all readers to hit DB simultaneously
Benchmark reality check: A single MySQL row can handle roughly 5,000–10,000 simple updates/sec under ideal conditions. But "simple" means no contention. When 8,333 concurrent writers target the same row, effective throughput collapses to a few hundred writes/sec — the rest queue or timeout. A single viral video breaks the database.

The idempotency table (user_likes) also creates a secondary write on every like, doubling DB load. And reads at 1000× the write rate hit the same DB unless you add read replicas — which don’t help write throughput at all.

Verdict: Works for a small site. Fails at YouTube scale on a single hot video.


3. Level 2 — Write-Through Redis Counter

Replace the DB write with an atomic Redis operation:

redis-cli
# Atomic increment — O(1), no locks needed
INCR video:likes:abc123
# Returns: (integer) 500001

# Idempotency: only INCR if user hasn't liked yet
SET  user:liked:user123:abc123  1  NX  EX 86400
# NX = only set if Not eXists → returns OK or nil
# EX 86400 = expire after 24h (memory management)

# Read the count (served from Redis — 100k ops/sec)
GET  video:likes:abc123

Redis INCR is atomic because Redis is single-threaded for command execution — no locks, no contention. A single Redis node handles ~100,000 simple operations/second, which comfortably handles 8,333 writes/sec for one video.

The Redis INCR command is
atomic because Redis is
single-threaded for command
execution. No locks needed
— it’s one of the reasons
Redis counters are so popular
for this exact use case.

Problems with Level 2:

  • Redis is in-memory: if the node crashes, like counts reset to zero
  • The idempotency keys (user:liked:…) use significant memory across millions of users and videos
  • We still haven’t addressed persistence to a database

Interactive Demo: Redis Like Counter

▶ Live Like Counter — Click the heart. Click again to unlike. Try rapid-clicking.

4. Level 3 — Write-Behind with Batch Flush

The solution to the durability problem: keep Redis as the live counter, but asynchronously flush deltas to the database.

Redis
Live counter
Background Job
Every 30s
MySQL/Postgres
Source of truth
INCR video:likes:{id} — instant, in-memory
UPDATE videos SET likes = likes + delta
python — background flush job
import redis, mysql.connector, time

r = redis.Redis()
db = mysql.connector.connect(...)

def flush_like_counts():
    # GETDEL atomically reads and removes the delta key
    cursor = r.scan_iter("video:likes:delta:*")
    for key in cursor:
        delta = r.getdel(key)       # atomic read + delete
        if delta:
            video_id = key.decode().split(':')[-1]
            db.cursor().execute(
                "UPDATE videos SET likes = likes + %s WHERE id = %s",
                (int(delta), video_id)
            )
    db.commit()

while True:
    flush_like_counts()
    time.sleep(30)     # flush every 30 seconds

Two Redis keys per video:

redis-cli
# Live display counter (absolute, loaded from DB + live delta)
GET   video:likes:abc123         # → "500,423" (shown to user)

# Delta buffer (how many likes since last DB flush)
INCR  video:likes:delta:abc123   # → incremented atomically

# On flush: read delta, write to DB, delete delta key
GETDEL video:likes:delta:abc123  # → "8333" (30s of likes)
Write-behind properties: Reads are still ultra-fast (Redis GET). Writes hit Redis only (in-memory). DB writes are batched — instead of 8,333 writes/sec, the DB sees 1 write per 30 seconds per video. If Redis crashes, we lose at most 30 seconds of likes — usually acceptable.

What to say in an interview: “The trade-off is a 30-second window of data loss on Redis crash. We mitigate this with Redis persistence (AOF/RDB snapshots) and Redis Sentinel for HA. For a like count, losing 30 seconds of likes on a node failure is an acceptable trade-off versus the DB being the hot write path.”


5. Level 4 — Event Streaming with Kafka

For true scale, analytics, and full decoupling: publish every like/unlike as an event.

json — like event schema
{
  "userId":    "user_a1b2c3",
  "videoId":   "video_abc123",
  "action":    "like",           // "like" | "unlike"
  "timestamp": 1748736000000,    // Unix ms
  "region":    "us-east-1",
  "sessionId": "sess_xyz789"
}
python — kafka producer
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka-1:9092', 'kafka-2:9092'],
    value_serializer=lambda v: json.dumps(v).encode()
)

def publish_like_event(user_id, video_id, action):
    producer.send(
        topic='video-likes',
        key=video_id.encode(),     # partition by videoId
        value={
            'userId':    user_id,
            'videoId':   video_id,
            'action':    action,
            'timestamp': time_ms()
        }
    )

Stream processor (Flink/Spark Streaming):

pseudocode — 1-second tumbling window
// For each 1-second window of events:
events
  .filter(e => e.action == "like")
  .keyBy(e => e.videoId)
  .window(TumblingEventTimeWindows.of(1, SECONDS))
  .aggregate(count)
  .sink(redisSink)             // INCRBY video:likes:X delta

// Every 60 seconds: snapshot Redis → MySQL
// (same flush pattern as Level 3)

Why Kafka unlocks more:

  • Replay: if the aggregator has a bug, replay all events to recompute counts
  • Analytics: who liked what, from where, at what time — fan out to a data warehouse
  • Multiple consumers: the like feed can power recommendations, notifications, trending algorithms — all from the same event stream
  • Backpressure handling: Kafka buffers spikes; the aggregator processes at its own pace

Interactive: Event Stream Visualizer

▶ Event Stream Visualizer — Watch like events flow through the pipeline
UserKafkaAggregatorRedisDisplay
Events/sec
0
Aggregated Count
500,000
Window Batches
0

6. Level 5 — Idempotency at Scale

The hardest constraint: one user = one like, even across distributed nodes. Compare the options:

Option Mechanism Pros Cons Verdict
A — Redis SET NX SET user:liked:{uid}:{vid} 1 NX Fast, atomic, no DB touch Memory: grows O(users × videos); eviction loses data OK for hot videos
B — DB unique constraint UNIQUE(user_id, video_id) in likes table Perfectly accurate; no memory issue DB write on every like; hot table at scale Best for correctness
C — Bloom filter Per-video probabilistic set membership Sub-MB memory per video; ultra-fast False positives → rare legitimate likes dropped; no undo Not for unlikes
D — UserID partition Shard by userId; each shard checks locally Distributed; each shard is independent Cross-shard queries needed for analytics; shard rebalancing Best at extreme scale

Recommended hybrid for an interview:

Recommended approach: Use Redis SET NX as the fast path (in-memory idempotency for recent likes). Back it with a DB unique constraint as the authoritative check. Redis handles the hot path; the DB enforces correctness. If Redis evicts a key (memory pressure), the DB constraint catches the duplicate.
python — hybrid idempotency
def like_video(user_id, video_id):
    key = 'user:liked:' + user_id + ':' + video_id

    # Fast path: Redis NX check (in-memory)
    if not r.set(key, 1, nx=True, ex=86400):
        return "already_liked"   # idempotent — no-op

    # Increment the display counter
    r.incr('video:likes:' + video_id)
    r.incr('video:likes:delta:' + video_id)

    # Async: write to DB (background job or queue)
    # DB has UNIQUE(user_id, video_id) as safety net
    queue.enqueue('persist_like', user_id, video_id)

    return "liked"

7. Level 6 — Sharding & Geographic Distribution

A single Redis node handles ~100k ops/sec. Platform-wide, we need ~2M ops/sec. The solution: shard Redis by videoId.

python — consistent hashing shard selection
import hashlib

REDIS_SHARDS = [
    'redis-shard-0:6379',
    'redis-shard-1:6379',
    'redis-shard-2:6379',
    'redis-shard-3:6379',
]

def get_shard(video_id):
    h = int(hashlib.md5(video_id.encode()).hexdigest(), 16)
    return REDIS_SHARDS[h % len(REDIS_SHARDS)]

# video_abc → shard-2, video_xyz → shard-0
# Each shard owns ~25% of videos; ~500k ops/sec each

Global distribution:

For a truly global video (viral in both Tokyo and New York simultaneously), regional Redis clusters reduce latency and distribute load:

Client Browser
CDN / Edge
Regional API
Regional Redis
Async Region Sync
Global Redis / DB
Client Browser: The user clicks Like. The browser sends a POST /api/like to the nearest regional API endpoint. The like count shown is fetched from Redis on page load via GET /api/video/:id/likes.
CDN / Edge: Like counts (reads) are served from CDN edge nodes with a short TTL (5–10 seconds). This absorbs the massive read:write asymmetry. Writes (likes) bypass CDN and go directly to the regional API.
Regional API: Handles authentication, rate limiting, idempotency checks. Publishes the like event to the nearest Kafka cluster, increments the regional Redis counter, and returns immediately. P99 latency target: <50ms.
Regional Redis: Holds the like count for users in this region. Counts are initialized from the global DB and incremented locally. A background job periodically ships deltas to global storage. Users see slightly different counts across regions — acceptable for ~1 second.
Async Region Sync: Every 1–5 seconds, regional Redis clusters push their deltas to a global aggregator. This keeps all regions within ~5 seconds of each other. The sync uses Kafka as the transport to ensure no delta is lost.
Global Redis / DB: The authoritative count. Regional counts converge here. MySQL holds the permanent record for video metadata + like counts. Redis Cluster (global) serves the canonical count for cold-start and cross-region reads.

Eventual consistency in practice: A user in Tokyo and a user in NYC may see like counts that differ by a few thousand for ~1 second. This is acceptable — like counts are inherently approximate displays, not financial ledgers. YouTube itself shows rounded counts (“1.2M likes”) for popular videos, which further masks small transient differences.

YouTube doesn’t show exact
like counts anymore for
videos under ~10k likes
(they show approximations).
This reduces the psychological
“one more click matters”
effect and — conveniently
— reduces the idempotency
enforcement cost.


8. The Unlike Problem & CRDTs

Eventual consistency gets non-trivial when users change their minds:

Scenario: User A likes a video at T=0. The like propagates to all 3 regional clusters by T=1. At T=2, User A unlikes. The unlike event starts propagating. At T=3, Region B has received the unlike but Region C hasn't. Region C still shows the count as +1. What is the correct state?

The naive G-Counter (grow-only counter) cannot model this — it has no decrement. You need a PN-Counter (Positive-Negative Counter):

P (Likes)
3
N (Unlikes)
1
=
Net Count
2
redis-cli — PN-Counter pattern
# Instead of one counter, maintain two
INCR video:likes:p:abc123     # positive counter (likes)
INCR video:likes:n:abc123     # negative counter (unlikes)

# Display count = P - N (always >= 0)
GET video:likes:p:abc123      # → 500,423
GET video:likes:n:abc123      # → 50,001
# net = 500,423 - 50,001 = 450,422

# Merging regions: take MAX of each regional P and N counter
# P_global = max(P_us, P_eu, P_asia)
# N_global = max(N_us, N_eu, N_asia)

Why MAX for merging? Each region only increments its own counter and never decrements it. If Region A has seen 3 likes and Region B has seen 5 likes for the same user actions, the global truth is 5 (Region B has more complete information). Taking MAX of monotonically-increasing G-Counters gives the correct CRDT merge.


9. Capacity Estimate

ComponentNumbersNotes
Viral video peak writes8,333 / sec500k likes/min ÷ 60
Platform-wide like events~2M / sec peak800M DAU, avg 150 likes/day each
Kafka throughput needed~10 MB/sec2M events × ~50 bytes/event
Redis memory per video~80 bytesP counter + N counter + delta + metadata
Top 1M videos in Redis~80 MBTrivial; Redis can hold billions of small keys
Idempotency keys (Redis NX)~50 bytes eachFor 10M active likers × top 10k videos = 500 GB — use TTL or DB fallback
DB write rate (after flush)1 write / 30s / videovs 8,333/s without batching
Like table in DB~500 bytes / like rowuserId(8) + videoId(8) + timestamp(8) + indexes + overhead
Annual like storage~150 TB / year~300B likes/year × 500 bytes

10. Full Architecture — Clickable Pipeline

Click each stage to see implementation details:

📱
Client
🌐
CDN
🚪
API Gateway

Like Service
↓         ↓

Redis Cluster
&
📨
Kafka Topic
                      ↓
🔄
Stream Processor
📊
Redis Display
&
🗄
DB Snapshot
Client: Browser or mobile app. Sends POST /v1/likes with JWT auth. Optimistically updates the UI (increments counter immediately) without waiting for server confirmation. If the server returns an error, the UI rolls back. This is the "optimistic update" pattern used by YouTube, Twitter, etc.
CDN (Read Path Only): Like counts (read-only) are served from CDN with a 5–10 second TTL. At 1000:1 read:write, this means 999 out of 1000 requests never reach the origin. Writes (likes/unlikes) bypass CDN and go directly to the API gateway with cache-busting headers.
API Gateway: Rate limiting (max 10 likes/minute per user per video to prevent abuse), authentication (JWT validation), request routing to the Like Service. Also handles DDoS protection and bot detection. A viral video announcement can trigger coordinated like-bot attacks.
Like Service: Stateless microservice (horizontally scalable). Responsibilities: (1) Idempotency check via Redis NX, (2) increment Redis counter, (3) publish event to Kafka, (4) return response. Target: P99 latency under 50ms. Deployed in 3+ regions with auto-scaling triggered at 70% CPU.
Redis Cluster: Sharded by videoId using consistent hashing. 4–8 shards, each handling ~500k ops/sec. Redis Sentinel for HA with automatic failover. AOF persistence enabled (fsync every second) to limit data loss to 1 second on crash. Master + 2 replicas per shard.
Kafka Topic (video-likes): Partitioned by videoId (ensures all events for a video go to the same partition, preserving ordering). Replication factor 3. Retention: 7 days (allows replay for analytics or bug fixes). At 2M events/sec × 50 bytes = ~10 MB/sec throughput — well within a 3-broker Kafka cluster's capacity.
Redis Display Layer: A separate Redis cluster optimized for reads. Stores the display count (what the user sees). Updated by the stream processor every 1 second. CDN pulls from here. This separation means a write-path Redis failure doesn't affect the read path — users keep seeing the last known count.
DB Snapshot (MySQL): Receives periodic batch writes from the flush job (every 30–60 seconds). Holds the permanent record: like counts per video, and the full likes table (userId, videoId, timestamp) for analytics and user-facing "your liked videos" history. Sharded by videoId for write scale.

11. Interview Cheat Sheet

When asked “Design the Like button” in an interview, structure your answer around these escalation levels:

LevelApproachMax ThroughputKey Trade-off
1 SQL UPDATE ... SET likes = likes + 1 ~500 writes/sec (hot row) Simple, correct, doesn't scale
2 Redis INCR + write-through ~100k writes/sec Fast; data loss on crash
3 Redis INCR + write-behind (30s flush) ~100k writes/sec Durable; 30s loss window
4 Kafka events + stream aggregation ~2M writes/sec Fully decoupled; operationally complex
5 Sharded Redis + geo-distribution + PN-Counters Theoretically unlimited Eventually consistent; ~1s lag
The key insight interviewers look for: The like button is a write-heavy, read-heavier problem. The answer is not "use a faster database" — it's decouple reads from writes (Redis as read cache), batch writes (flush job), and use eventual consistency where strict consistency isn't needed (like counts, not bank balances).

Summary

The “Like” button is a masterclass in the gap between appearances and complexity. A single UPDATE statement works for your side project. At YouTube scale, it requires:

  1. Redis atomic counters for in-memory, lock-free increment/decrement
  2. Write-behind batching to protect the database from hot-row contention
  3. Kafka event streaming for durability, analytics, and decoupling
  4. Hybrid idempotency (Redis NX fast path + DB unique constraint fallback)
  5. PN-Counters for correct CRDT semantics when merging regional like/unlike data
  6. CDN-cached read path to absorb the 1000:1 read:write asymmetry

Every design decision is a trade-off: memory vs. durability, consistency vs. latency, simplicity vs. scale. The right answer depends on where on that curve your system needs to be.

“Facebook’s ‘Reactions’
(Like, Love, Haha, Wow,
Sad, Angry) are
architecturally the same
problem — just 6 counters
per post instead of 1.
The 2016 launch added
~6x write load to their
like infrastructure
overnight.”