System Design: Video Streaming — How YouTube Serves 1 Billion Hours Per Day

Series System Design Interview Prep — #10 of 15

YouTube processes 500
hours of video every
single minute. The
infrastructure to
receive, transcode, and
deliver all of it is one
of the largest media
pipelines ever built.

Video streaming is the hardest content delivery problem in existence. Files are enormous (a single 4K movie exceeds 50 GB uncompressed), users are on wildly variable connections, the audience is global, and latency directly destroys the experience. YouTube serves 1 billion hours of video per day — roughly 114,000 years of footage, every 24 hours. This article builds the system from a naive file server to a planet-scale adaptive streaming platform, one level at a time.

The question: Design a video streaming platform like YouTube. Handle 500 hours of video uploaded per minute, 1 billion hours watched per day. Support fast start, smooth playback, and global reach.

1. The Problem

500h

Uploaded / min

83 hours/second

Hours watched / day

~114,000 years

11.5M

Concurrent viewers

at any given moment

500 PB

Daily CDN egress

at 4 Mbps avg bitrate

Video is the hardest content type to serve at scale. Three fundamental constraints make it uniquely difficult:

Size. A raw 1-hour 4K camera recording can exceed 100 GB. You cannot store or serve that directly.
Variable bandwidth. A user on 5G gets 100 Mbps; a user on rural LTE might see 500 kbps. The same file cannot serve both.
Seeking. Users skip to the middle constantly. A naive “download from byte 0” approach makes every seek a full restart.

Every architectural decision in this article is a direct response to one of these three constraints.

2. Level 1 — Naive: Serve the Raw File

Level 1 · Naive

The simplest approach: users upload a video file to a server; other users download it over HTTP. A single nginx serving files from disk.

User Upload

→

Origin Server

→

HTTP GET /video.mp4

→

Viewer Downloads

Where it fails immediately:

A 4K video at 8 Mbps needs an 8 Mbps connection sustained to avoid buffering. Half of global internet users cannot guarantee this.
Raw camera footage (ProRes, AVCHD) is not browser-playable. You would serve it and get zero playback.
Seeking to minute 45 means the browser must have already downloaded 45 minutes of video, or the server must support HTTP range requests — which most naive setups do not handle efficiently.
A single server handling 11.5 million concurrent viewers at 4 Mbps = 46 Tbps of egress. One machine has ~10 Gbps. You need 4,600 machines, all perfectly coordinated, serving the same files.
Mobile users on 360p do not need — and cannot stream — the same file as desktop 4K viewers.

This fails on every axis: codec compatibility, variable bandwidth, seeking, and scale.

3. Level 2 — Video Transcoding Pipeline

Level 2 · Transcoding

“Netflix uses a modified
version of DASH they
call VMAF (Video
Multi-Method
Assessment Fusion) to
decide which quality
level to stream — it’s
not just bitrate, it’s
perceptual quality.”

The first fix: never store or serve the raw upload. Feed every upload into a transcoding pipeline that produces multiple output variants in browser-native codecs (H.264, VP9, AV1).

Each resolution gets a different target bitrate tuned for that resolution’s pixel count:

Quality	Resolution	Target Bitrate	Approx. file size (1h)
4K	3840 × 2160	8 Mbps	3.6 GB
1080p	1920 × 1080	4 Mbps	1.8 GB
720p	1280 × 720	2.5 Mbps	1.1 GB
480p	854 × 480	1 Mbps	450 MB
360p	640 × 360	0.5 Mbps	225 MB
240p	426 × 240	0.25 Mbps	112 MB
Audio only	—	128 kbps	57 MB

Interactive — Transcoding visualizer. Upload one raw video file and watch it fan out into quality variants:

🎬 Transcoding Pipeline Visualizer

📹

raw_upload.mov

ProRes 4K — 18 GB — not browser-playable

8 Mbps · H.264

3.6 GB

1080p

4 Mbps · H.264

1.8 GB

720p

2.5 Mbps · H.264

1.1 GB

480p

1 Mbps · H.264

450 MB

360p

0.5 Mbps · H.264

225 MB

Audio

128 kbps · AAC

57 MB

Transcoding is CPU-intensive. YouTube uses a farm of dedicated transcoding workers (horizontal scale). A 1-hour 4K video can take 10–30 minutes to transcode fully, which is why newly uploaded videos sometimes show only low quality initially.

4. Level 3 — Chunked Streaming (HLS / DASH)

Level 3 · Chunked Streaming

Even with multiple quality levels, serving a 1.8 GB 1080p file means the viewer must wait for the entire download before seeking or even starting playback. The solution: segment the video.

Split every quality variant into small chunks (typically 2–10 seconds each). Create a manifest file that lists all segments in order. The player downloads the manifest first, then fetches segments sequentially. After the first 2–3 segments arrive, playback begins.

Apple’s HLS (HTTP Live Streaming) uses .m3u8 manifest files:

HLS Manifest (master.m3u8)

#EXTM3U
#EXT-X-VERSION:3
# Master playlist — lists quality variants
#EXT-X-STREAM-INF:BANDWIDTH=8000000,RESOLUTION=3840x2160
4k/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4000000,RESOLUTION=1920x1080
1080p/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1000000,RESOLUTION=854x480
480p/index.m3u8

HLS Media Playlist (1080p/index.m3u8)

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:0

# Each segment is exactly 6 seconds of video
#EXTINF:6.0,
segment_000.ts
#EXTINF:6.0,
segment_001.ts
#EXTINF:6.0,
segment_002.ts
# … 598 more segments for a 1-hour video …
#EXT-X-ENDLIST

Why this solves seeking: to jump to minute 45, the player calculates which segment contains that timestamp (45 * 60 / 6 = segment_450.ts) and fetches only that segment. No need to download anything before it. The server never needs to maintain state between requests — segments are just static files.

DASH (Dynamic Adaptive Streaming over HTTP, used by YouTube and Netflix) works identically but uses XML .mpd manifest files instead of .m3u8. The segment format and ABR algorithm differ, but the core concept is the same.

Start latency: With 6-second segments, a viewer can start playing after downloading just 1 manifest file + 2–3 segment files (~12–18 seconds of video). Combined with CDN edge caching, this means time-to-first-frame under 2 seconds for popular content.

5. Level 4 — Adaptive Bitrate Streaming (ABR)

Level 4 · Adaptive Bitrate

“The first 2–3 seconds
of buffering before a
video starts is almost
entirely network
round-trips to fetch
the manifest and first
segment — CDN edge
nodes cut this to
<200 ms.”

The master manifest lists multiple quality variants. The player monitors available bandwidth every few seconds and dynamically switches which quality tier it fetches next. On a fast connection it upgrades to 1080p. When the network degrades it drops to 360p — seamlessly, mid-playback.

The player maintains a buffer: a rolling window of pre-fetched segments ahead of the playhead. Buffer is the resilience reserve. As long as there is buffer, a temporary bandwidth drop does not cause a freeze. A good ABR algorithm tries to keep 15–30 seconds of buffer while targeting the highest quality the connection can sustain.

Interactive — ABR Simulator:

📺 Adaptive Bitrate Simulator

1080p

Bandwidth 4 Mbps

Buffer ahead

22s

Events

Drag the bandwidth slider to simulate network changes.

The algorithm that decides when to upgrade or downgrade is non-trivial. Too aggressive upgrading causes rebuffering when bandwidth drops suddenly. Too conservative means users on fast connections watch 480p unnecessarily. Modern players use model predictive control — estimating future bandwidth from recent history to pre-emptively switch.

6. Level 5 — CDN Architecture

Level 5 · CDN

With 11.5 million concurrent viewers, all requests hitting origin servers is physically impossible. A single data centre cannot sustain 46+ Tbps of egress. The solution: Content Delivery Networks.

A CDN is a network of 200–2,000 geographically distributed Points of Presence (PoPs) — edge servers that cache content close to users. A viewer in Tokyo fetches segments from a Tokyo PoP, not from a US origin. Round-trip latency drops from 180ms to 8ms. For popular videos, cache hit rates exceed 99% — the origin never sees the request.

Interactive — CDN World Map:

🌍 CDN Points of Presence

CDN PoP Your viewer Origin (US-East)

Click a simulation button to see how requests travel.

CDN cache strategy for video segments:

Popular videos (top 5%): pre-warm CDN caches proactively. Push segments to all PoPs before the video goes viral.
Long-tail videos (95%): cache on first access. First viewer triggers an origin fetch; subsequent viewers get the cached copy.
Cache TTL: typically 24h for video segments (content never changes), shorter for manifests (may update for live streams).
Invalidation: if a video is taken down or re-encoded, send invalidation commands to all PoPs simultaneously.

7. Level 6 — Upload Pipeline

Level 6 · Upload Pipeline

“YouTube re-encodes
every uploaded video
with VP9 (their
open-source codec)
which achieves 50%
better compression
than H.264 at the
same quality — that
alone saves petabytes
daily.”

The upload path is entirely separate from the playback path. It is an asynchronous data processing pipeline, not a synchronous API call. Click any component to learn more:

📱 User (Browser / App)

Chunked upload via HTTPS

↓

🔁 Upload Service

Validates · assembles chunks · assigns video ID

↓

🗄 Raw Storage (S3)

Immutable raw upload, private bucket

↓

📨 Message Queue (Kafka)

Decouples upload from processing · fan-out to workers

↓

⚙️ Transcoding Workers (×N)

Parallel · stateless · auto-scaled · FFmpeg

↓

📦 Processed Storage (S3)

Segments + manifests · public · CDN origin

↓

🌐 CDN Pre-warm + Metadata DB

Video becomes available · search indexed · notifications sent

8. Level 7 — Thumbnail Generation & Hover Preview

Level 7 · Thumbnails

YouTube’s scrubbing preview (hover over the progress bar → see a frame from that timestamp) is powered by sprite sheets — a single image file containing hundreds of thumbnail frames laid out in a grid.

The math:

1-hour video, 1 frame extracted every 10 seconds = 360 thumbnails
Each thumbnail: 160 × 90 px (keep them small — hover previews are small)
Grid layout: 20 columns × 18 rows = 360 cells
Single sprite sheet: 3200 × 1620 px = ~500 KB (JPEG compressed)

Why a sprite sheet instead of 360 separate files?

360 HTTP requests vs 1 HTTP request. The player uses CSS background-position to show the right frame:

Sprite sheet hover preview (CSS)

/* Seek to 4m 30s = frame at 270s / 10 = frame #27 */
/* Grid col = 27 % 20 = 7, row = floor(27 / 20) = 1 */
.preview-thumb {
  width:  160px;
  height: 90px;
  background-image:    url(sprite_sheet.jpg);
  background-size:     3200px 1620px;
  background-position: -1120px -90px; /* col×160, row×90 */
}

The sprite sheet generation is another Kafka consumer — a separate lightweight worker that extracts frames with FFmpeg (ffmpeg -vf fps=0.1 -s 160x90 frame_%04d.jpg) and stitches them into a grid with ImageMagick.

9. Level 8 — Resumable Uploads

Level 8 · Resumable Uploads

A creator uploads a 10 GB 4K video on a spotty connection. At 70% — 7 GB transferred — the connection drops. Without resumable uploads, they restart from zero.

The Google Resumable Upload API pattern:

Resumable upload flow

// Step 1: Initiate — get a resumable upload URI
POST /upload/videos?uploadType=resumable
X-Upload-Content-Length: 10737418240  // 10 GB
X-Upload-Content-Type: video/mp4
// Response: 200 OK
Location: https://upload.example.com/upload/videos?upload_id=xa298sd

// Step 2: Upload in 10 MB chunks
PUT /upload/videos?upload_id=xa298sd
Content-Range: bytes 0-10485759/10737418240
// … chunk body … Response: 308 Resume Incomplete

// Step 3: After connection drop — query resume position
PUT /upload/videos?upload_id=xa298sd
Content-Range: bytes */10737418240
// Response: 308, Range: bytes=0-7340031999
// Server confirms: "I have bytes 0–7GB. Resume from byte 7,340,032,000."

// Step 4: Resume from confirmed position
PUT /upload/videos?upload_id=xa298sd
Content-Range: bytes 7340032000-10737418239/10737418240

The server tracks received byte ranges per upload session in a fast key-value store (Redis). Session expires after 24 hours of inactivity. No need to re-upload confirmed chunks.

Implementation detail: Chunk size matters. 10 MB chunks are a common default — small enough that a retry only re-sends 10 MB, large enough that the overhead of one HTTP request per chunk is negligible at typical upload speeds. At very low bandwidth, use 256 KB chunks. At fiber speeds, use 32 MB.

10. Capacity Estimation

Metric	Calculation	Value
Upload rate	Given	500 h/min
Raw upload storage/hr	500h × 60min × ~1GB/h avg raw	30 TB/hr
After transcoding (6 variants + overhead)	30 TB × ~5×	150 TB/hr
New storage per day	150 TB × 24h	3.6 PB/day
Concurrent viewers	1B h/day ÷ 86400s × avg watch 30min	~11.5M
CDN egress bandwidth	11.5M × 4 Mbps avg	46 Tbps
CDN egress per day (data)	1B h × 4 Mbps = 1B × 1800 MB	~500 PB/day
Transcoding workers needed	500h/min ÷ ~10 h/worker/min (4K)	50+ workers
Thumbnail sprite sheets/day	500h/min × 60 × avg 30min video	~900K videos/day
Segments per 1h video (all variants)	6 qualities × 600 segments	3,600 S3 objects

The uncomfortable number: 500 PB/day of CDN egress at even $0.005/GB = $2.5M/day in bandwidth alone. YouTube's CDN is almost entirely self-operated (Google's own network), reducing this cost by ~90%. For a startup, CDN cost is the single largest variable cost in video streaming.

11. Protocol Comparison: HLS vs DASH vs WebRTC

Feature	HLS	DASH	WebRTC
Typical latency	6–30 s	6–30 s	<500 ms
Adaptive Bitrate	Yes	Yes	Limited
Segment format	.ts (MPEG-TS)	.mp4 (fMP4)	RTP packets
Manifest format	.m3u8	.mpd (XML)	SDP (ICE)
Apple native	Yes (Safari)	No (need MSE)	Yes
CDN-friendly	Yes (static files)	Yes (static files)	No (peer-to-peer)
DRM support	FairPlay	Widevine / PlayReady	DTLS-SRTP
Use case	VOD · Live streaming	VOD · Live streaming	Video calls · gaming
Who uses it	Apple TV+, Twitch	YouTube, Netflix	Meet, Zoom, Discord

Low-latency variants: Both HLS and DASH have low-latency extensions (LL-HLS, LL-DASH) that achieve 1–3 s latency by using partial segments. Twitch uses LL-HLS for ~3 s live latency.

12. Summary — Levels at a Glance

Level	Problem solved	Technique
1	Baseline	Serve raw file over HTTP
2	Codec compat + bandwidth tiers	Transcoding pipeline (FFmpeg, 6 variants)
3	Buffering, seeking	Chunked HLS/DASH segments + manifest
4	Variable bandwidth	Adaptive Bitrate (ABR) switching
5	Global scale, latency	CDN with 200+ PoPs, 99%+ cache hit
6	Async processing	Kafka-driven upload pipeline
7	Hover preview	Sprite sheet thumbnails (1 file = 360 frames)
8	Large file resilience	Resumable chunked uploads

In an interview, walk through Levels 2–5 in order — transcoding, chunking, ABR, CDN. That covers 90% of what interviewers want. Then add the upload pipeline as a follow-up. Mention resumable uploads if asked about reliability. Always end with the capacity numbers — 500 PB/day egress signals that you understand the true scale.