System Design: Content Moderation Pipeline — Keeping Platforms Safe at Scale

Series System Design: Web Scenarios

Facebook’s content
moderation workforce
is largely contracted —
~15,000 contractors
worldwide. These
reviewers are exposed
to the worst content
daily. The psychological
toll is severe; multiple
lawsuits have been
filed by traumatized
moderators.

Design the content moderation system for a social media platform with 1 billion posts per day. The system must detect: spam, hate speech, nudity, violence, misinformation, and copyright violations. Balance speed (don’t make users wait), accuracy (minimize false positives on legitimate content), and scale.

The question: Design a content moderation pipeline for a platform processing 1 billion posts per day. Detect spam, hate speech, NSFW content, violence, misinformation, and CSAM. Balance latency, accuracy, and scale. How do you handle false positives? What happens when ML is uncertain?

1. The Moderation Challenge

Three goals are perpetually in tension:

Goal 1

⚡ Speed

Content should be visible immediately or within seconds. Users who post and see their content disappear into a "pending" void will churn. Latency = lost engagement.

Goal 2

🛡 Safety

Harmful content must not reach users. CSAM, terrorist recruitment, coordinated harassment — these cause real-world harm and legal liability if the platform is slow to act.

Goal 3

⚖ Fairness

Legitimate content must not be suppressed. False positives silence users, especially marginalized communities whose speech patterns differ from the training majority.

These can’t all be maximized simultaneously. Design choices reflect platform values — and those values have consequences.

2. Scale & Numbers First

Posts / day

~12K

Posts / sec (avg)

~50K

Posts / sec (peak)

~10M

Human review / day

~10K

Human reviewers

< 500ms

Fast-path SLA

Key insight: at 12,000 posts per second, every millisecond of ML inference latency × 12,000 = GPU-seconds consumed. The architecture must be ruthlessly efficient.

3. Content Types and Their Pipelines

Different content types require fundamentally different detection approaches:

Content Type	Detection Method	Latency	Categories
Text posts	BERT toxicity classifier, keyword blocklist, n-gram spam detector	50–100ms	Hate speech, spam, threats, misinformation triggers
Images	CNN NSFW classifier, PhotoDNA hash lookup, object detection	80–200ms	Nudity, CSAM, violence, graphic gore
Videos	Frame-sampled image analysis + audio transcription → text pipeline	500ms–5s	All image categories + audio-based hate speech
URLs	Domain reputation DB, phishing ML, SSRF-safe crawler for content	10–50ms	Phishing, malware, misinformation domains, copyright

Interview trap: Many candidates describe a single "content moderation model." The real answer is a portfolio of specialized detectors — each tuned for its modality — running in parallel, whose outputs are combined by a decision engine.

4. Level 1 — Rule-Based (Fast, Dumb)

The first line of defense: pure keyword/hash matching.

python

class RuleBasedFilter:
    def __init__(self):
        # Exact keyword blocklist — compiled to a trie for O(n) scan
        self.blocklist = TrieSet(load_blocklist())
        # Known-bad URL hashes (MD5 of normalized domain)
        self.url_hashes = BloomFilter(load_bad_domains())

    def check(self, post):
        # Fast path: exact-match keyword in text
        for token in post.tokenize():
            if token in self.blocklist:
                return Decision(action='REMOVE', reason='blocklist_match', score=1.0)

        # Fast path: URL domain in known-bad bloom filter
        for url in post.extract_urls():
            if domain_hash(url) in self.url_hashes:
                return Decision(action='REMOVE', reason='bad_url', score=1.0)

        return Decision(action='PASS', score=0.0)

Properties:

Fast: O(n) text scan with a trie, < 1ms per post
Deterministic: same input always same output — easy to audit
Brittle: “gun” in “begun” is a false positive; “g.u.n” bypasses it entirely
Use only as first-pass pre-filter. Never as the sole line of defense.

5. Level 2 — ML Classifiers

Trained models for each content category, running in parallel:

python

import asyncio

async def run_ml_classifiers(post):
    # All classifiers run concurrently — total latency = max(individual latencies)
    results = await asyncio.gather(
        text_toxicity_score(post.text),       # BERT: 50–80ms
        spam_score(post),                    # Gradient boosted trees: 5ms
        image_nsfw_score(post.image),         # ResNet: 80–150ms
        url_reputation_score(post.urls),      # Lookup table: 5ms
        photo_dna_hash_check(post.image),     # Hash lookup: <1ms
        return_exceptions=True
    )
    return {
        'toxicity':  results[0],
        'spam':      results[1],
        'nsfw':      results[2],
        'url':       results[3],
        'csam_hash': results[4],
    }

def decide(scores):
    # CSAM: zero tolerance — hash match = immediate removal
    if scores['csam_hash']:
        return 'REMOVE', 1.0

    # Any high-confidence signal = auto-remove
    max_score = max(scores['toxicity'], scores['spam'], scores['nsfw'], scores['url'])
    if max_score > 0.85:
        return 'REMOVE', max_score

    # Uncertain: route to human review queue
    if max_score > 0.40:
        return 'REVIEW', max_score

    # Below threshold: publish
    return 'PUBLISH', max_score

Classifier properties:

Model	Architecture	Latency	Accuracy
Text toxicity	BERT-base fine-tuned	50–80ms (GPU)	~94% F1
Image NSFW	ResNet-50 / EfficientNet	80–150ms (GPU)	~97% F1
Spam detector	Gradient boosted trees (XGBoost)	3–8ms (CPU)	~99% F1
URL reputation	Hash lookup + ML on domain features	5–20ms	~98% F1
PhotoDNA CSAM	Perceptual hash matching	<1ms	Near-zero false positives

6. The Moderation Pipeline Architecture

Interactive: Pipeline Visualizer

▶ Content Moderation Pipeline — run a post through the system

Select an example above to run it through the pipeline.

The Two Paths

Synchronous — < 500ms

Fast Path

1. Post submitted → Kafka content-submitted
2. Rule-based pre-filter (<1ms)
3. ML classifiers in parallel (50–200ms)
4. Decision engine applies thresholds
5. Content published / held / auto-removed

Asynchronous — seconds to minutes

Slow Path

6. All content queued for deeper analysis
7. Larger/slower models (cross-modal, LLM-based)
8. Human review for uncertain cases
9. Retroactive removal if slow path catches something
10. Reviewer decisions feed back to retrain models

Key insight: The fast path optimistically publishes content. The slow path can retroactively remove it. This means a post might be live for seconds to minutes before removal — that tradeoff is deliberate. Most harmful content is not viral in the first 500ms.

7. Human Review Queue

PhotoDNA was created
by Hany Farid (Dartmouth)
and donated to Microsoft
in 2009. It’s now used
by Facebook, Google,
Twitter, and 200+ platforms.
The NCMEC database
contains 3M+ known
CSAM hashes. Meta
reported 27M CSAM
pieces in 2022 — the
vast majority detected
automatically.

When ML confidence falls in the uncertain range (score 0.40–0.85), content goes to human reviewers. The queue is prioritized: viral content first (to limit spread), borderline cases first within the same virality tier.

Interactive: Review Queue Demo

▶ Human Review Queue — approve, remove, or escalate

Reviewed today: 0 / 5

How the queue is structured:

Priority ordering: viral posts (high share count) first — a post with 10,000 shares in review causes more harm per minute than a zero-share post
Reviewer specialization: some reviewers handle hate speech, others CSAM, others misinformation — domain expertise matters
Appeals path: removed users can appeal; a second reviewer re-evaluates cold (without seeing the first decision)
Feedback loop: every approve/remove decision is a labeled training example — the queue is the data flywheel

8. PhotoDNA for CSAM

CSAM detection does not use ML classifiers. It uses perceptual hashing (PhotoDNA):

pseudocode

// PhotoDNA: robust hash that survives re-encoding
function photoDNA(image):
    greyscale  = toGreyscale(image)
    resized    = resize(greyscale, 144x144)
    // DCT-based perceptual hash (144 bytes)
    hash       = dctHash(resized)
    return hash

// Matching: Hamming distance, not exact equality
function isMatch(hash, ncmecDatabase):
    for known_hash in ncmecDatabase:
        if hammingDistance(hash, known_hash) < THRESHOLD:
            return true   // match even if resized / re-compressed
    return false

Why hash-based, not ML-based?

ML has false positives. PhotoDNA match = auto-remove with no human review, no exceptions. A false positive on CSAM detection means an innocent person’s content is deleted and possibly reported to authorities — unacceptable.
Perceptual hashing survives re-encoding, resizing, and color shifts. ML models are easier to evade.
The NCMEC database has 3M+ hashes. Lookup is O(1) with locality-sensitive hashing.

Legal requirement: In the US, CSAM detection and reporting to NCMEC is legally mandated under 18 U.S.C. § 2258A for electronic service providers. It is not optional.

9. Account-Level Signals

Individual post analysis misses coordinated behavior. The other detection layer is account-level:

python

class AccountSignals:
    def scrutiny_multiplier(self, account) -> float:
        multiplier = 1.0

        # New accounts: higher scrutiny
        age_hours = account.age_hours()
        if age_hours < 24:
            multiplier *= 2.5

        # Velocity check: posting rate anomaly
        posts_per_min = account.recent_post_rate()
        if posts_per_min > 10:
            multiplier *= 3.0

        # IP reputation: VPN / known bot ASN
        if is_proxy_ip(account.last_ip):
            multiplier *= 1.8

        # Coordinated behavior: same content from many accounts
        if account.in_coordinated_cluster():
            multiplier *= 4.0

        return multiplier

    def adjusted_score(self, base_score, account) -> float:
        # Multiply ML score by scrutiny multiplier — may push borderline to auto-remove
        return min(1.0, base_score * self.scrutiny_multiplier(account))

Spam ring detection: Graph analysis finds clusters of accounts that post identical or near-identical content at coordinated times. One flagged account surfaces the ring; the whole cluster gets elevated scrutiny.

The ML moderation false
positive problem is
asymmetric: a false positive
(removing legitimate content)
is visible and generates
complaints; a false negative
(missing harmful content)
often goes unnoticed.
This asymmetry drives
under-moderation — platforms
optimize for what gets
them bad press.

Once harmful content is identified on one platform, the hash can be shared across all member platforms via the GIFCT (Global Internet Forum to Counter Terrorism) hash database:

python

def on_confirmed_removal(content, reason):
    if reason in ['terrorism', 'csam', 'violent_extremism']:
        # Compute perceptual hash
        p_hash = compute_perceptual_hash(content)

        # Add to our own blocklist immediately
        local_blocklist.add(p_hash)

        # Submit to GIFCT shared database
        gifct_api.submit_hash(
            hash=p_hash,
            category=reason,
            platform='our_platform'
        )

        # All member platforms now block re-uploads automatically
        # even if re-encoded, resized, or slightly modified

Effect: A terrorist recruitment video removed from YouTube is blocked on Facebook, Twitter, and 20+ other platforms within minutes — before it can be re-uploaded and gain traction.

11. Capacity Estimate

Metric	Number	Notes
Posts / day	1,000,000,000	Given requirement
Posts / sec (average)	~12,000	1B / 86,400s
Posts / sec (peak)	~50,000	~4x average for peak hours
ML inference / sec	~50,000	Text + image in parallel per post
GPU servers (ML)	~500	Each handles ~100 inferences/sec
Posts routed to human review / day	~10M	~1% of all posts (0.4–0.85 range)
Human reviewers	~10,000	Each reviews ~1,000 items/day
PhotoDNA lookups / sec	~12,000	Bloom filter, <1ms each
Kafka throughput	~50 GB/hr	~1KB per post × 50K/sec at peak

12. Thresholds and the False Positive Problem

The threshold values (0.40, 0.85) are not fixed. They’re tuned by policy, not engineering:

python

# Thresholds vary by content category and platform policy
THRESHOLDS = {
    'csam':       { 'auto_remove': 0.0,  'review': 0.0  },  # hash-based, zero tolerance
    'terrorism':  { 'auto_remove': 0.75, 'review': 0.40 },  # aggressive
    'hate_speech':{ 'auto_remove': 0.90, 'review': 0.50 },  # careful — high FP rate
    'spam':       { 'auto_remove': 0.85, 'review': 0.60 },  # relatively safe to auto-remove
    'nsfw':       { 'auto_remove': 0.92, 'review': 0.50 },  # visual, clearer signal
    'misinformation':{ 'auto_remove': 0.95, 'review': 0.65 },  # very conservative — high FP risk
}

# Lowering auto_remove threshold → fewer false negatives, MORE false positives
# Raising auto_remove threshold → fewer false positives, MORE false negatives
# There is no neutral setting. The threshold IS the policy.

Interview signal: The best candidates recognize that threshold tuning is a values question disguised as an engineering question. "What's the right threshold?" cannot be answered without knowing platform policy on speech, legal exposure, and business priorities.

13. The Full Architecture

System components

→ Ingestion

API Gateway → Kafka content-submitted topic. Kafka buffers peak load and fans out to multiple consumer groups.

⚖ Rule Engine

Trie-based keyword blocklist + Bloom filter URL check. Runs in-process, <1ms. Immediate removals bypass ML entirely.

🤖 ML Inference Fleet

GPU cluster running specialized models. Text, image, URL classifiers in parallel. TorchServe / Triton for serving. Results aggregated by decision engine.

🛡 PhotoDNA Service

Dedicated microservice. Computes perceptual hash, checks against NCMEC database (locality-sensitive hashing). Match → auto-remove + NCMEC report.

⚖ Decision Engine

Combines all signals. Applies category-specific thresholds. Routes to: PUBLISH, REVIEW queue, or AUTO-REMOVE. Records decision + scores in audit log.

👨 Human Review System

Priority queue (viral-first). Reviewer UI with ML score explanations. Approve / Remove / Escalate. Appeals workflow. All decisions logged as training data.

🅆 Training Flywheel

Reviewer decisions → labeled dataset. Periodic model retraining. A/B testing of new model versions. Shadow mode deployment before cutover.

📋 Audit & Appeals

Immutable audit log of every moderation decision + model scores. User appeals routed to second reviewer. Regulatory compliance reporting.

14. What Interviewers Actually Want to Hear

The three-tier pipeline: Rule-based (fast/dumb) → ML classifiers (parallel, probabilistic) → human review (uncertain cases). Each tier handles what the previous couldn't. Describing only one tier is a failing answer.

The CSAM exception: Mentioning PhotoDNA, perceptual hashing, and NCMEC reporting as a separate non-ML path signals you understand the real constraints. This is legally mandated, not optional.

The feedback loop: The system improves over time because reviewer decisions become training data. Without this loop, ML models drift as content evolves. A static model is a decaying model.

1. The Moderation Challenge

2. Scale & Numbers First

3. Content Types and Their Pipelines

4. Level 1 — Rule-Based (Fast, Dumb)

5. Level 2 — ML Classifiers

6. The Moderation Pipeline Architecture

Interactive: Pipeline Visualizer

The Two Paths

7. Human Review Queue

Interactive: Review Queue Demo

8. PhotoDNA for CSAM

9. Account-Level Signals

10. Cross-Platform Hash Sharing (GIFCT)

11. Capacity Estimate

12. Thresholds and the False Positive Problem

13. The Full Architecture

14. What Interviewers Actually Want to Hear