System Design: Content Moderation Pipeline — Keeping Platforms Safe at Scale
Facebook’s content
moderation workforce
is largely contracted —
~15,000 contractors
worldwide. These
reviewers are exposed
to the worst content
daily. The psychological
toll is severe; multiple
lawsuits have been
filed by traumatized
moderators.
Design the content moderation system for a social media platform with 1 billion posts per day. The system must detect: spam, hate speech, nudity, violence, misinformation, and copyright violations. Balance speed (don’t make users wait), accuracy (minimize false positives on legitimate content), and scale.
The question: Design a content moderation pipeline for a platform processing 1 billion posts per day. Detect spam, hate speech, NSFW content, violence, misinformation, and CSAM. Balance latency, accuracy, and scale. How do you handle false positives? What happens when ML is uncertain?
1. The Moderation Challenge
Three goals are perpetually in tension:
These can’t all be maximized simultaneously. Design choices reflect platform values — and those values have consequences.
2. Scale & Numbers First
Key insight: at 12,000 posts per second, every millisecond of ML inference latency × 12,000 = GPU-seconds consumed. The architecture must be ruthlessly efficient.
3. Content Types and Their Pipelines
Different content types require fundamentally different detection approaches:
| Content Type | Detection Method | Latency | Categories |
|---|---|---|---|
| Text posts | BERT toxicity classifier, keyword blocklist, n-gram spam detector | 50–100ms | Hate speech, spam, threats, misinformation triggers |
| Images | CNN NSFW classifier, PhotoDNA hash lookup, object detection | 80–200ms | Nudity, CSAM, violence, graphic gore |
| Videos | Frame-sampled image analysis + audio transcription → text pipeline | 500ms–5s | All image categories + audio-based hate speech |
| URLs | Domain reputation DB, phishing ML, SSRF-safe crawler for content | 10–50ms | Phishing, malware, misinformation domains, copyright |
4. Level 1 — Rule-Based (Fast, Dumb)
The first line of defense: pure keyword/hash matching.
class RuleBasedFilter: def __init__(self): # Exact keyword blocklist — compiled to a trie for O(n) scan self.blocklist = TrieSet(load_blocklist()) # Known-bad URL hashes (MD5 of normalized domain) self.url_hashes = BloomFilter(load_bad_domains()) def check(self, post): # Fast path: exact-match keyword in text for token in post.tokenize(): if token in self.blocklist: return Decision(action='REMOVE', reason='blocklist_match', score=1.0) # Fast path: URL domain in known-bad bloom filter for url in post.extract_urls(): if domain_hash(url) in self.url_hashes: return Decision(action='REMOVE', reason='bad_url', score=1.0) return Decision(action='PASS', score=0.0)
Properties:
- Fast: O(n) text scan with a trie, < 1ms per post
- Deterministic: same input always same output — easy to audit
- Brittle: “gun” in “begun” is a false positive; “g.u.n” bypasses it entirely
- Use only as first-pass pre-filter. Never as the sole line of defense.
5. Level 2 — ML Classifiers
Trained models for each content category, running in parallel:
import asyncio async def run_ml_classifiers(post): # All classifiers run concurrently — total latency = max(individual latencies) results = await asyncio.gather( text_toxicity_score(post.text), # BERT: 50–80ms spam_score(post), # Gradient boosted trees: 5ms image_nsfw_score(post.image), # ResNet: 80–150ms url_reputation_score(post.urls), # Lookup table: 5ms photo_dna_hash_check(post.image), # Hash lookup: <1ms return_exceptions=True ) return { 'toxicity': results[0], 'spam': results[1], 'nsfw': results[2], 'url': results[3], 'csam_hash': results[4], } def decide(scores): # CSAM: zero tolerance — hash match = immediate removal if scores['csam_hash']: return 'REMOVE', 1.0 # Any high-confidence signal = auto-remove max_score = max(scores['toxicity'], scores['spam'], scores['nsfw'], scores['url']) if max_score > 0.85: return 'REMOVE', max_score # Uncertain: route to human review queue if max_score > 0.40: return 'REVIEW', max_score # Below threshold: publish return 'PUBLISH', max_score
Classifier properties:
| Model | Architecture | Latency | Accuracy |
|---|---|---|---|
| Text toxicity | BERT-base fine-tuned | 50–80ms (GPU) | ~94% F1 |
| Image NSFW | ResNet-50 / EfficientNet | 80–150ms (GPU) | ~97% F1 |
| Spam detector | Gradient boosted trees (XGBoost) | 3–8ms (CPU) | ~99% F1 |
| URL reputation | Hash lookup + ML on domain features | 5–20ms | ~98% F1 |
| PhotoDNA CSAM | Perceptual hash matching | <1ms | Near-zero false positives |
6. The Moderation Pipeline Architecture
Interactive: Pipeline Visualizer
The Two Paths
content-submitted2. Rule-based pre-filter (<1ms)
3. ML classifiers in parallel (50–200ms)
4. Decision engine applies thresholds
5. Content published / held / auto-removed
7. Larger/slower models (cross-modal, LLM-based)
8. Human review for uncertain cases
9. Retroactive removal if slow path catches something
10. Reviewer decisions feed back to retrain models
7. Human Review Queue
PhotoDNA was created
by Hany Farid (Dartmouth)
and donated to Microsoft
in 2009. It’s now used
by Facebook, Google,
Twitter, and 200+ platforms.
The NCMEC database
contains 3M+ known
CSAM hashes. Meta
reported 27M CSAM
pieces in 2022 — the
vast majority detected
automatically.
When ML confidence falls in the uncertain range (score 0.40–0.85), content goes to human reviewers. The queue is prioritized: viral content first (to limit spread), borderline cases first within the same virality tier.
Interactive: Review Queue Demo
How the queue is structured:
- Priority ordering: viral posts (high share count) first — a post with 10,000 shares in review causes more harm per minute than a zero-share post
- Reviewer specialization: some reviewers handle hate speech, others CSAM, others misinformation — domain expertise matters
- Appeals path: removed users can appeal; a second reviewer re-evaluates cold (without seeing the first decision)
- Feedback loop: every approve/remove decision is a labeled training example — the queue is the data flywheel
8. PhotoDNA for CSAM
CSAM detection does not use ML classifiers. It uses perceptual hashing (PhotoDNA):
// PhotoDNA: robust hash that survives re-encoding function photoDNA(image): greyscale = toGreyscale(image) resized = resize(greyscale, 144x144) // DCT-based perceptual hash (144 bytes) hash = dctHash(resized) return hash // Matching: Hamming distance, not exact equality function isMatch(hash, ncmecDatabase): for known_hash in ncmecDatabase: if hammingDistance(hash, known_hash) < THRESHOLD: return true // match even if resized / re-compressed return false
Why hash-based, not ML-based?
- ML has false positives. PhotoDNA match = auto-remove with no human review, no exceptions. A false positive on CSAM detection means an innocent person’s content is deleted and possibly reported to authorities — unacceptable.
- Perceptual hashing survives re-encoding, resizing, and color shifts. ML models are easier to evade.
- The NCMEC database has 3M+ hashes. Lookup is O(1) with locality-sensitive hashing.
9. Account-Level Signals
Individual post analysis misses coordinated behavior. The other detection layer is account-level:
class AccountSignals: def scrutiny_multiplier(self, account) -> float: multiplier = 1.0 # New accounts: higher scrutiny age_hours = account.age_hours() if age_hours < 24: multiplier *= 2.5 # Velocity check: posting rate anomaly posts_per_min = account.recent_post_rate() if posts_per_min > 10: multiplier *= 3.0 # IP reputation: VPN / known bot ASN if is_proxy_ip(account.last_ip): multiplier *= 1.8 # Coordinated behavior: same content from many accounts if account.in_coordinated_cluster(): multiplier *= 4.0 return multiplier def adjusted_score(self, base_score, account) -> float: # Multiply ML score by scrutiny multiplier — may push borderline to auto-remove return min(1.0, base_score * self.scrutiny_multiplier(account))
Spam ring detection: Graph analysis finds clusters of accounts that post identical or near-identical content at coordinated times. One flagged account surfaces the ring; the whole cluster gets elevated scrutiny.
10. Cross-Platform Hash Sharing (GIFCT)
The ML moderation false
positive problem is
asymmetric: a false positive
(removing legitimate content)
is visible and generates
complaints; a false negative
(missing harmful content)
often goes unnoticed.
This asymmetry drives
under-moderation — platforms
optimize for what gets
them bad press.
Once harmful content is identified on one platform, the hash can be shared across all member platforms via the GIFCT (Global Internet Forum to Counter Terrorism) hash database:
def on_confirmed_removal(content, reason): if reason in ['terrorism', 'csam', 'violent_extremism']: # Compute perceptual hash p_hash = compute_perceptual_hash(content) # Add to our own blocklist immediately local_blocklist.add(p_hash) # Submit to GIFCT shared database gifct_api.submit_hash( hash=p_hash, category=reason, platform='our_platform' ) # All member platforms now block re-uploads automatically # even if re-encoded, resized, or slightly modified
Effect: A terrorist recruitment video removed from YouTube is blocked on Facebook, Twitter, and 20+ other platforms within minutes — before it can be re-uploaded and gain traction.
11. Capacity Estimate
| Metric | Number | Notes |
|---|---|---|
| Posts / day | 1,000,000,000 | Given requirement |
| Posts / sec (average) | ~12,000 | 1B / 86,400s |
| Posts / sec (peak) | ~50,000 | ~4x average for peak hours |
| ML inference / sec | ~50,000 | Text + image in parallel per post |
| GPU servers (ML) | ~500 | Each handles ~100 inferences/sec |
| Posts routed to human review / day | ~10M | ~1% of all posts (0.4–0.85 range) |
| Human reviewers | ~10,000 | Each reviews ~1,000 items/day |
| PhotoDNA lookups / sec | ~12,000 | Bloom filter, <1ms each |
| Kafka throughput | ~50 GB/hr | ~1KB per post × 50K/sec at peak |
12. Thresholds and the False Positive Problem
The threshold values (0.40, 0.85) are not fixed. They’re tuned by policy, not engineering:
# Thresholds vary by content category and platform policy THRESHOLDS = { 'csam': { 'auto_remove': 0.0, 'review': 0.0 }, # hash-based, zero tolerance 'terrorism': { 'auto_remove': 0.75, 'review': 0.40 }, # aggressive 'hate_speech':{ 'auto_remove': 0.90, 'review': 0.50 }, # careful — high FP rate 'spam': { 'auto_remove': 0.85, 'review': 0.60 }, # relatively safe to auto-remove 'nsfw': { 'auto_remove': 0.92, 'review': 0.50 }, # visual, clearer signal 'misinformation':{ 'auto_remove': 0.95, 'review': 0.65 }, # very conservative — high FP risk } # Lowering auto_remove threshold → fewer false negatives, MORE false positives # Raising auto_remove threshold → fewer false positives, MORE false negatives # There is no neutral setting. The threshold IS the policy.
13. The Full Architecture
content-submitted topic. Kafka buffers peak load and fans out to multiple consumer groups.