System Design: Video Streaming — How YouTube Serves 1 Billion Hours Per Day
YouTube processes 500
hours of video every
single minute. The
infrastructure to
receive, transcode, and
deliver all of it is one
of the largest media
pipelines ever built.
Video streaming is the hardest content delivery problem in existence. Files are enormous (a single 4K movie exceeds 50 GB uncompressed), users are on wildly variable connections, the audience is global, and latency directly destroys the experience. YouTube serves 1 billion hours of video per day — roughly 114,000 years of footage, every 24 hours. This article builds the system from a naive file server to a planet-scale adaptive streaming platform, one level at a time.
The question: Design a video streaming platform like YouTube. Handle 500 hours of video uploaded per minute, 1 billion hours watched per day. Support fast start, smooth playback, and global reach.
1. The Problem
Video is the hardest content type to serve at scale. Three fundamental constraints make it uniquely difficult:
- Size. A raw 1-hour 4K camera recording can exceed 100 GB. You cannot store or serve that directly.
- Variable bandwidth. A user on 5G gets 100 Mbps; a user on rural LTE might see 500 kbps. The same file cannot serve both.
- Seeking. Users skip to the middle constantly. A naive “download from byte 0” approach makes every seek a full restart.
Every architectural decision in this article is a direct response to one of these three constraints.
2. Level 1 — Naive: Serve the Raw File
The simplest approach: users upload a video file to a server; other users download it over HTTP. A single nginx serving files from disk.
Where it fails immediately:
- A 4K video at 8 Mbps needs an 8 Mbps connection sustained to avoid buffering. Half of global internet users cannot guarantee this.
- Raw camera footage (ProRes, AVCHD) is not browser-playable. You would serve it and get zero playback.
- Seeking to minute 45 means the browser must have already downloaded 45 minutes of video, or the server must support HTTP range requests — which most naive setups do not handle efficiently.
- A single server handling 11.5 million concurrent viewers at 4 Mbps = 46 Tbps of egress. One machine has ~10 Gbps. You need 4,600 machines, all perfectly coordinated, serving the same files.
- Mobile users on 360p do not need — and cannot stream — the same file as desktop 4K viewers.
This fails on every axis: codec compatibility, variable bandwidth, seeking, and scale.
3. Level 2 — Video Transcoding Pipeline
“Netflix uses a modified
version of DASH they
call VMAF (Video
Multi-Method
Assessment Fusion) to
decide which quality
level to stream — it’s
not just bitrate, it’s
perceptual quality.”
The first fix: never store or serve the raw upload. Feed every upload into a transcoding pipeline that produces multiple output variants in browser-native codecs (H.264, VP9, AV1).
Each resolution gets a different target bitrate tuned for that resolution’s pixel count:
| Quality | Resolution | Target Bitrate | Approx. file size (1h) |
|---|---|---|---|
| 4K | 3840 × 2160 | 8 Mbps | 3.6 GB |
| 1080p | 1920 × 1080 | 4 Mbps | 1.8 GB |
| 720p | 1280 × 720 | 2.5 Mbps | 1.1 GB |
| 480p | 854 × 480 | 1 Mbps | 450 MB |
| 360p | 640 × 360 | 0.5 Mbps | 225 MB |
| 240p | 426 × 240 | 0.25 Mbps | 112 MB |
| Audio only | — | 128 kbps | 57 MB |
Interactive — Transcoding visualizer. Upload one raw video file and watch it fan out into quality variants:
🎬 Transcoding Pipeline Visualizer
Transcoding is CPU-intensive. YouTube uses a farm of dedicated transcoding workers (horizontal scale). A 1-hour 4K video can take 10–30 minutes to transcode fully, which is why newly uploaded videos sometimes show only low quality initially.
4. Level 3 — Chunked Streaming (HLS / DASH)
Even with multiple quality levels, serving a 1.8 GB 1080p file means the viewer must wait for the entire download before seeking or even starting playback. The solution: segment the video.
Split every quality variant into small chunks (typically 2–10 seconds each). Create a manifest file that lists all segments in order. The player downloads the manifest first, then fetches segments sequentially. After the first 2–3 segments arrive, playback begins.
Apple’s HLS (HTTP Live Streaming) uses .m3u8 manifest files:
#EXTM3U #EXT-X-VERSION:3 # Master playlist — lists quality variants #EXT-X-STREAM-INF:BANDWIDTH=8000000,RESOLUTION=3840x2160 4k/index.m3u8 #EXT-X-STREAM-INF:BANDWIDTH=4000000,RESOLUTION=1920x1080 1080p/index.m3u8 #EXT-X-STREAM-INF:BANDWIDTH=1000000,RESOLUTION=854x480 480p/index.m3u8
#EXTM3U #EXT-X-VERSION:3 #EXT-X-TARGETDURATION:6 #EXT-X-MEDIA-SEQUENCE:0 # Each segment is exactly 6 seconds of video #EXTINF:6.0, segment_000.ts #EXTINF:6.0, segment_001.ts #EXTINF:6.0, segment_002.ts # … 598 more segments for a 1-hour video … #EXT-X-ENDLIST
Why this solves seeking: to jump to minute 45, the player calculates which segment contains that timestamp (45 * 60 / 6 = segment_450.ts) and fetches only that segment. No need to download anything before it. The server never needs to maintain state between requests — segments are just static files.
DASH (Dynamic Adaptive Streaming over HTTP, used by YouTube and Netflix) works identically but uses XML .mpd manifest files instead of .m3u8. The segment format and ABR algorithm differ, but the core concept is the same.
5. Level 4 — Adaptive Bitrate Streaming (ABR)
“The first 2–3 seconds
of buffering before a
video starts is almost
entirely network
round-trips to fetch
the manifest and first
segment — CDN edge
nodes cut this to
<200 ms.”
The master manifest lists multiple quality variants. The player monitors available bandwidth every few seconds and dynamically switches which quality tier it fetches next. On a fast connection it upgrades to 1080p. When the network degrades it drops to 360p — seamlessly, mid-playback.
The player maintains a buffer: a rolling window of pre-fetched segments ahead of the playhead. Buffer is the resilience reserve. As long as there is buffer, a temporary bandwidth drop does not cause a freeze. A good ABR algorithm tries to keep 15–30 seconds of buffer while targeting the highest quality the connection can sustain.
Interactive — ABR Simulator:
📺 Adaptive Bitrate Simulator
The algorithm that decides when to upgrade or downgrade is non-trivial. Too aggressive upgrading causes rebuffering when bandwidth drops suddenly. Too conservative means users on fast connections watch 480p unnecessarily. Modern players use model predictive control — estimating future bandwidth from recent history to pre-emptively switch.
6. Level 5 — CDN Architecture
With 11.5 million concurrent viewers, all requests hitting origin servers is physically impossible. A single data centre cannot sustain 46+ Tbps of egress. The solution: Content Delivery Networks.
A CDN is a network of 200–2,000 geographically distributed Points of Presence (PoPs) — edge servers that cache content close to users. A viewer in Tokyo fetches segments from a Tokyo PoP, not from a US origin. Round-trip latency drops from 180ms to 8ms. For popular videos, cache hit rates exceed 99% — the origin never sees the request.
Interactive — CDN World Map:
🌍 CDN Points of Presence
CDN cache strategy for video segments:
- Popular videos (top 5%): pre-warm CDN caches proactively. Push segments to all PoPs before the video goes viral.
- Long-tail videos (95%): cache on first access. First viewer triggers an origin fetch; subsequent viewers get the cached copy.
- Cache TTL: typically 24h for video segments (content never changes), shorter for manifests (may update for live streams).
- Invalidation: if a video is taken down or re-encoded, send invalidation commands to all PoPs simultaneously.
7. Level 6 — Upload Pipeline
“YouTube re-encodes
every uploaded video
with VP9 (their
open-source codec)
which achieves 50%
better compression
than H.264 at the
same quality — that
alone saves petabytes
daily.”
The upload path is entirely separate from the playback path. It is an asynchronous data processing pipeline, not a synchronous API call. Click any component to learn more:
8. Level 7 — Thumbnail Generation & Hover Preview
YouTube’s scrubbing preview (hover over the progress bar → see a frame from that timestamp) is powered by sprite sheets — a single image file containing hundreds of thumbnail frames laid out in a grid.
The math:
- 1-hour video, 1 frame extracted every 10 seconds = 360 thumbnails
- Each thumbnail: 160 × 90 px (keep them small — hover previews are small)
- Grid layout: 20 columns × 18 rows = 360 cells
- Single sprite sheet: 3200 × 1620 px = ~500 KB (JPEG compressed)
Why a sprite sheet instead of 360 separate files?
360 HTTP requests vs 1 HTTP request. The player uses CSS background-position to show the right frame:
/* Seek to 4m 30s = frame at 270s / 10 = frame #27 */ /* Grid col = 27 % 20 = 7, row = floor(27 / 20) = 1 */ .preview-thumb { width: 160px; height: 90px; background-image: url(sprite_sheet.jpg); background-size: 3200px 1620px; background-position: -1120px -90px; /* col×160, row×90 */ }
The sprite sheet generation is another Kafka consumer — a separate lightweight worker that extracts frames with FFmpeg (ffmpeg -vf fps=0.1 -s 160x90 frame_%04d.jpg) and stitches them into a grid with ImageMagick.
9. Level 8 — Resumable Uploads
A creator uploads a 10 GB 4K video on a spotty connection. At 70% — 7 GB transferred — the connection drops. Without resumable uploads, they restart from zero.
The Google Resumable Upload API pattern:
// Step 1: Initiate — get a resumable upload URI POST /upload/videos?uploadType=resumable X-Upload-Content-Length: 10737418240 // 10 GB X-Upload-Content-Type: video/mp4 // Response: 200 OK Location: https://upload.example.com/upload/videos?upload_id=xa298sd // Step 2: Upload in 10 MB chunks PUT /upload/videos?upload_id=xa298sd Content-Range: bytes 0-10485759/10737418240 // … chunk body … Response: 308 Resume Incomplete // Step 3: After connection drop — query resume position PUT /upload/videos?upload_id=xa298sd Content-Range: bytes */10737418240 // Response: 308, Range: bytes=0-7340031999 // Server confirms: "I have bytes 0–7GB. Resume from byte 7,340,032,000." // Step 4: Resume from confirmed position PUT /upload/videos?upload_id=xa298sd Content-Range: bytes 7340032000-10737418239/10737418240
The server tracks received byte ranges per upload session in a fast key-value store (Redis). Session expires after 24 hours of inactivity. No need to re-upload confirmed chunks.
10. Capacity Estimation
| Metric | Calculation | Value |
|---|---|---|
| Upload rate | Given | 500 h/min |
| Raw upload storage/hr | 500h × 60min × ~1GB/h avg raw | 30 TB/hr |
| After transcoding (6 variants + overhead) | 30 TB × ~5× | 150 TB/hr |
| New storage per day | 150 TB × 24h | 3.6 PB/day |
| Concurrent viewers | 1B h/day ÷ 86400s × avg watch 30min | ~11.5M |
| CDN egress bandwidth | 11.5M × 4 Mbps avg | 46 Tbps |
| CDN egress per day (data) | 1B h × 4 Mbps = 1B × 1800 MB | ~500 PB/day |
| Transcoding workers needed | 500h/min ÷ ~10 h/worker/min (4K) | 50+ workers |
| Thumbnail sprite sheets/day | 500h/min × 60 × avg 30min video | ~900K videos/day |
| Segments per 1h video (all variants) | 6 qualities × 600 segments | 3,600 S3 objects |
11. Protocol Comparison: HLS vs DASH vs WebRTC
| Feature | HLS | DASH | WebRTC |
|---|---|---|---|
| Typical latency | 6–30 s | 6–30 s | <500 ms |
| Adaptive Bitrate | Yes | Yes | Limited |
| Segment format | .ts (MPEG-TS) | .mp4 (fMP4) | RTP packets |
| Manifest format | .m3u8 | .mpd (XML) | SDP (ICE) |
| Apple native | Yes (Safari) | No (need MSE) | Yes |
| CDN-friendly | Yes (static files) | Yes (static files) | No (peer-to-peer) |
| DRM support | FairPlay | Widevine / PlayReady | DTLS-SRTP |
| Use case | VOD · Live streaming | VOD · Live streaming | Video calls · gaming |
| Who uses it | Apple TV+, Twitch | YouTube, Netflix | Meet, Zoom, Discord |
Low-latency variants: Both HLS and DASH have low-latency extensions (LL-HLS, LL-DASH) that achieve 1–3 s latency by using partial segments. Twitch uses LL-HLS for ~3 s live latency.
12. Summary — Levels at a Glance
| Level | Problem solved | Technique |
|---|---|---|
| 1 | Baseline | Serve raw file over HTTP |
| 2 | Codec compat + bandwidth tiers | Transcoding pipeline (FFmpeg, 6 variants) |
| 3 | Buffering, seeking | Chunked HLS/DASH segments + manifest |
| 4 | Variable bandwidth | Adaptive Bitrate (ABR) switching |
| 5 | Global scale, latency | CDN with 200+ PoPs, 99%+ cache hit |
| 6 | Async processing | Kafka-driven upload pipeline |
| 7 | Hover preview | Sprite sheet thumbnails (1 file = 360 frames) |
| 8 | Large file resilience | Resumable chunked uploads |
In an interview, walk through Levels 2–5 in order — transcoding, chunking, ABR, CDN. That covers 90% of what interviewers want. Then add the upload pipeline as a follow-up. Mention resumable uploads if asked about reliability. Always end with the capacity numbers — 500 PB/day egress signals that you understand the true scale.