System Design: Content Moderation Pipeline — Keeping Platforms Safe at Scale

Series System Design: Web Scenarios

Facebook’s content
moderation workforce
is largely contracted —
~15,000 contractors
worldwide. These
reviewers are exposed
to the worst content
daily. The psychological
toll is severe; multiple
lawsuits have been
filed by traumatized
moderators.

Design the content moderation system for a social media platform with 1 billion posts per day. The system must detect: spam, hate speech, nudity, violence, misinformation, and copyright violations. Balance speed (don’t make users wait), accuracy (minimize false positives on legitimate content), and scale.

The question: Design a content moderation pipeline for a platform processing 1 billion posts per day. Detect spam, hate speech, NSFW content, violence, misinformation, and CSAM. Balance latency, accuracy, and scale. How do you handle false positives? What happens when ML is uncertain?


1. The Moderation Challenge

Three goals are perpetually in tension:

Goal 1
⚡ Speed
Content should be visible immediately or within seconds. Users who post and see their content disappear into a "pending" void will churn. Latency = lost engagement.
Goal 2
🛡 Safety
Harmful content must not reach users. CSAM, terrorist recruitment, coordinated harassment — these cause real-world harm and legal liability if the platform is slow to act.
Goal 3
⚖ Fairness
Legitimate content must not be suppressed. False positives silence users, especially marginalized communities whose speech patterns differ from the training majority.

These can’t all be maximized simultaneously. Design choices reflect platform values — and those values have consequences.


2. Scale & Numbers First

1B
Posts / day
~12K
Posts / sec (avg)
~50K
Posts / sec (peak)
~10M
Human review / day
~10K
Human reviewers
< 500ms
Fast-path SLA

Key insight: at 12,000 posts per second, every millisecond of ML inference latency × 12,000 = GPU-seconds consumed. The architecture must be ruthlessly efficient.


3. Content Types and Their Pipelines

Different content types require fundamentally different detection approaches:

Content TypeDetection MethodLatencyCategories
Text posts BERT toxicity classifier, keyword blocklist, n-gram spam detector 50–100ms Hate speech, spam, threats, misinformation triggers
Images CNN NSFW classifier, PhotoDNA hash lookup, object detection 80–200ms Nudity, CSAM, violence, graphic gore
Videos Frame-sampled image analysis + audio transcription → text pipeline 500ms–5s All image categories + audio-based hate speech
URLs Domain reputation DB, phishing ML, SSRF-safe crawler for content 10–50ms Phishing, malware, misinformation domains, copyright
Interview trap: Many candidates describe a single "content moderation model." The real answer is a portfolio of specialized detectors — each tuned for its modality — running in parallel, whose outputs are combined by a decision engine.

4. Level 1 — Rule-Based (Fast, Dumb)

The first line of defense: pure keyword/hash matching.

python
class RuleBasedFilter:
    def __init__(self):
        # Exact keyword blocklist — compiled to a trie for O(n) scan
        self.blocklist = TrieSet(load_blocklist())
        # Known-bad URL hashes (MD5 of normalized domain)
        self.url_hashes = BloomFilter(load_bad_domains())

    def check(self, post):
        # Fast path: exact-match keyword in text
        for token in post.tokenize():
            if token in self.blocklist:
                return Decision(action='REMOVE', reason='blocklist_match', score=1.0)

        # Fast path: URL domain in known-bad bloom filter
        for url in post.extract_urls():
            if domain_hash(url) in self.url_hashes:
                return Decision(action='REMOVE', reason='bad_url', score=1.0)

        return Decision(action='PASS', score=0.0)

Properties:

  • Fast: O(n) text scan with a trie, < 1ms per post
  • Deterministic: same input always same output — easy to audit
  • Brittle: “gun” in “begun” is a false positive; “g.u.n” bypasses it entirely
  • Use only as first-pass pre-filter. Never as the sole line of defense.

5. Level 2 — ML Classifiers

Trained models for each content category, running in parallel:

python
import asyncio

async def run_ml_classifiers(post):
    # All classifiers run concurrently — total latency = max(individual latencies)
    results = await asyncio.gather(
        text_toxicity_score(post.text),       # BERT: 50–80ms
        spam_score(post),                    # Gradient boosted trees: 5ms
        image_nsfw_score(post.image),         # ResNet: 80–150ms
        url_reputation_score(post.urls),      # Lookup table: 5ms
        photo_dna_hash_check(post.image),     # Hash lookup: <1ms
        return_exceptions=True
    )
    return {
        'toxicity':  results[0],
        'spam':      results[1],
        'nsfw':      results[2],
        'url':       results[3],
        'csam_hash': results[4],
    }

def decide(scores):
    # CSAM: zero tolerance — hash match = immediate removal
    if scores['csam_hash']:
        return 'REMOVE', 1.0

    # Any high-confidence signal = auto-remove
    max_score = max(scores['toxicity'], scores['spam'], scores['nsfw'], scores['url'])
    if max_score > 0.85:
        return 'REMOVE', max_score

    # Uncertain: route to human review queue
    if max_score > 0.40:
        return 'REVIEW', max_score

    # Below threshold: publish
    return 'PUBLISH', max_score

Classifier properties:

ModelArchitectureLatencyAccuracy
Text toxicityBERT-base fine-tuned50–80ms (GPU)~94% F1
Image NSFWResNet-50 / EfficientNet80–150ms (GPU)~97% F1
Spam detectorGradient boosted trees (XGBoost)3–8ms (CPU)~99% F1
URL reputationHash lookup + ML on domain features5–20ms~98% F1
PhotoDNA CSAMPerceptual hash matching<1msNear-zero false positives

6. The Moderation Pipeline Architecture

Interactive: Pipeline Visualizer

▶ Content Moderation Pipeline — run a post through the system
Select an example above to run it through the pipeline.

The Two Paths

Synchronous — < 500ms
Fast Path
1. Post submitted → Kafka content-submitted
2. Rule-based pre-filter (<1ms)
3. ML classifiers in parallel (50–200ms)
4. Decision engine applies thresholds
5. Content published / held / auto-removed
Asynchronous — seconds to minutes
Slow Path
6. All content queued for deeper analysis
7. Larger/slower models (cross-modal, LLM-based)
8. Human review for uncertain cases
9. Retroactive removal if slow path catches something
10. Reviewer decisions feed back to retrain models
Key insight: The fast path optimistically publishes content. The slow path can retroactively remove it. This means a post might be live for seconds to minutes before removal — that tradeoff is deliberate. Most harmful content is not viral in the first 500ms.

7. Human Review Queue

PhotoDNA was created
by Hany Farid (Dartmouth)
and donated to Microsoft
in 2009. It’s now used
by Facebook, Google,
Twitter, and 200+ platforms.
The NCMEC database
contains 3M+ known
CSAM hashes. Meta
reported 27M CSAM
pieces in 2022 — the
vast majority detected
automatically.

When ML confidence falls in the uncertain range (score 0.40–0.85), content goes to human reviewers. The queue is prioritized: viral content first (to limit spread), borderline cases first within the same virality tier.

Interactive: Review Queue Demo

▶ Human Review Queue — approve, remove, or escalate
Reviewed today: 0 / 5

How the queue is structured:

  • Priority ordering: viral posts (high share count) first — a post with 10,000 shares in review causes more harm per minute than a zero-share post
  • Reviewer specialization: some reviewers handle hate speech, others CSAM, others misinformation — domain expertise matters
  • Appeals path: removed users can appeal; a second reviewer re-evaluates cold (without seeing the first decision)
  • Feedback loop: every approve/remove decision is a labeled training example — the queue is the data flywheel

8. PhotoDNA for CSAM

CSAM detection does not use ML classifiers. It uses perceptual hashing (PhotoDNA):

pseudocode
// PhotoDNA: robust hash that survives re-encoding
function photoDNA(image):
    greyscale  = toGreyscale(image)
    resized    = resize(greyscale, 144x144)
    // DCT-based perceptual hash (144 bytes)
    hash       = dctHash(resized)
    return hash

// Matching: Hamming distance, not exact equality
function isMatch(hash, ncmecDatabase):
    for known_hash in ncmecDatabase:
        if hammingDistance(hash, known_hash) < THRESHOLD:
            return true   // match even if resized / re-compressed
    return false

Why hash-based, not ML-based?

  • ML has false positives. PhotoDNA match = auto-remove with no human review, no exceptions. A false positive on CSAM detection means an innocent person’s content is deleted and possibly reported to authorities — unacceptable.
  • Perceptual hashing survives re-encoding, resizing, and color shifts. ML models are easier to evade.
  • The NCMEC database has 3M+ hashes. Lookup is O(1) with locality-sensitive hashing.
Legal requirement: In the US, CSAM detection and reporting to NCMEC is legally mandated under 18 U.S.C. § 2258A for electronic service providers. It is not optional.

9. Account-Level Signals

Individual post analysis misses coordinated behavior. The other detection layer is account-level:

python
class AccountSignals:
    def scrutiny_multiplier(self, account) -> float:
        multiplier = 1.0

        # New accounts: higher scrutiny
        age_hours = account.age_hours()
        if age_hours < 24:
            multiplier *= 2.5

        # Velocity check: posting rate anomaly
        posts_per_min = account.recent_post_rate()
        if posts_per_min > 10:
            multiplier *= 3.0

        # IP reputation: VPN / known bot ASN
        if is_proxy_ip(account.last_ip):
            multiplier *= 1.8

        # Coordinated behavior: same content from many accounts
        if account.in_coordinated_cluster():
            multiplier *= 4.0

        return multiplier

    def adjusted_score(self, base_score, account) -> float:
        # Multiply ML score by scrutiny multiplier — may push borderline to auto-remove
        return min(1.0, base_score * self.scrutiny_multiplier(account))

Spam ring detection: Graph analysis finds clusters of accounts that post identical or near-identical content at coordinated times. One flagged account surfaces the ring; the whole cluster gets elevated scrutiny.


10. Cross-Platform Hash Sharing (GIFCT)

The ML moderation false
positive problem is
asymmetric: a false positive
(removing legitimate content)
is visible and generates
complaints; a false negative
(missing harmful content)
often goes unnoticed.
This asymmetry drives
under-moderation — platforms
optimize for what gets
them bad press.

Once harmful content is identified on one platform, the hash can be shared across all member platforms via the GIFCT (Global Internet Forum to Counter Terrorism) hash database:

python
def on_confirmed_removal(content, reason):
    if reason in ['terrorism', 'csam', 'violent_extremism']:
        # Compute perceptual hash
        p_hash = compute_perceptual_hash(content)

        # Add to our own blocklist immediately
        local_blocklist.add(p_hash)

        # Submit to GIFCT shared database
        gifct_api.submit_hash(
            hash=p_hash,
            category=reason,
            platform='our_platform'
        )

        # All member platforms now block re-uploads automatically
        # even if re-encoded, resized, or slightly modified

Effect: A terrorist recruitment video removed from YouTube is blocked on Facebook, Twitter, and 20+ other platforms within minutes — before it can be re-uploaded and gain traction.


11. Capacity Estimate

MetricNumberNotes
Posts / day1,000,000,000Given requirement
Posts / sec (average)~12,0001B / 86,400s
Posts / sec (peak)~50,000~4x average for peak hours
ML inference / sec~50,000Text + image in parallel per post
GPU servers (ML)~500Each handles ~100 inferences/sec
Posts routed to human review / day~10M~1% of all posts (0.4–0.85 range)
Human reviewers~10,000Each reviews ~1,000 items/day
PhotoDNA lookups / sec~12,000Bloom filter, <1ms each
Kafka throughput~50 GB/hr~1KB per post × 50K/sec at peak

12. Thresholds and the False Positive Problem

The threshold values (0.40, 0.85) are not fixed. They’re tuned by policy, not engineering:

python
# Thresholds vary by content category and platform policy
THRESHOLDS = {
    'csam':       { 'auto_remove': 0.0,  'review': 0.0  },  # hash-based, zero tolerance
    'terrorism':  { 'auto_remove': 0.75, 'review': 0.40 },  # aggressive
    'hate_speech':{ 'auto_remove': 0.90, 'review': 0.50 },  # careful — high FP rate
    'spam':       { 'auto_remove': 0.85, 'review': 0.60 },  # relatively safe to auto-remove
    'nsfw':       { 'auto_remove': 0.92, 'review': 0.50 },  # visual, clearer signal
    'misinformation':{ 'auto_remove': 0.95, 'review': 0.65 },  # very conservative — high FP risk
}

# Lowering auto_remove threshold → fewer false negatives, MORE false positives
# Raising auto_remove threshold → fewer false positives, MORE false negatives
# There is no neutral setting. The threshold IS the policy.
Interview signal: The best candidates recognize that threshold tuning is a values question disguised as an engineering question. "What's the right threshold?" cannot be answered without knowing platform policy on speech, legal exposure, and business priorities.

13. The Full Architecture

System components
→ Ingestion
API Gateway → Kafka content-submitted topic. Kafka buffers peak load and fans out to multiple consumer groups.
⚖ Rule Engine
Trie-based keyword blocklist + Bloom filter URL check. Runs in-process, <1ms. Immediate removals bypass ML entirely.
🤖 ML Inference Fleet
GPU cluster running specialized models. Text, image, URL classifiers in parallel. TorchServe / Triton for serving. Results aggregated by decision engine.
🛡 PhotoDNA Service
Dedicated microservice. Computes perceptual hash, checks against NCMEC database (locality-sensitive hashing). Match → auto-remove + NCMEC report.
⚖ Decision Engine
Combines all signals. Applies category-specific thresholds. Routes to: PUBLISH, REVIEW queue, or AUTO-REMOVE. Records decision + scores in audit log.
👨 Human Review System
Priority queue (viral-first). Reviewer UI with ML score explanations. Approve / Remove / Escalate. Appeals workflow. All decisions logged as training data.
🅆 Training Flywheel
Reviewer decisions → labeled dataset. Periodic model retraining. A/B testing of new model versions. Shadow mode deployment before cutover.
📋 Audit & Appeals
Immutable audit log of every moderation decision + model scores. User appeals routed to second reviewer. Regulatory compliance reporting.

14. What Interviewers Actually Want to Hear

The three-tier pipeline: Rule-based (fast/dumb) → ML classifiers (parallel, probabilistic) → human review (uncertain cases). Each tier handles what the previous couldn't. Describing only one tier is a failing answer.
The CSAM exception: Mentioning PhotoDNA, perceptual hashing, and NCMEC reporting as a separate non-ML path signals you understand the real constraints. This is legally mandated, not optional.
The feedback loop: The system improves over time because reviewer decisions become training data. Without this loop, ML models drift as content evolves. A static model is a decaying model.