System Design: Search Engine — Inverted Index, PageRank, and Query Processing at Scale

Series Finale System Design Interview Prep — #15 of 15

Google’s original PageRank
paper (1998) titled
“The Anatomy of a
Large-Scale Hypertextual
Web Search Engine”
described an index of
only 25 million pages.
Today: 100 billion.

Design Google Search. Users type queries; you must return the 10 most relevant web pages from an index of 100 billion pages in under 200ms. This is arguably the hardest system design problem in the canon — it touches nearly every concept we have covered in this series.

The question: Design a search engine like Google. Users type queries; return the 10 most relevant results from 100 billion indexed pages in under 200ms.

1. Scale & Constraints

Before designing anything, nail the numbers. Interviewers reward candidates who reason through scale before jumping to solutions.

8.5B

Searches / day

~100K

QPS peak

100B

Indexed pages

~1 PB

Raw crawled content

~100 TB

Compressed index

<200ms

Latency target p99

The constraints immediately rule out any naive approach. At 100K QPS and 100 billion pages, even a single microsecond per document would require 10¹¹ µs — about 27 hours — to scan everything for one query.

2. The Naïve Approach — Level 1

The instinct of every beginner: scan every document, check if it contains the query terms, return the matches.

python

def linear_search(query, documents):
    results = []
    for doc in documents:          # iterate ALL 100 billion docs
        if query in doc.text:
            results.append(doc)
    return results

Why it fails: O(N) per query. At 100 billion documents and 100K QPS, you need 10¹⁶ operations per second. Even distributed across 10,000 servers you would need each server to scan 10¹² docs/second — physically impossible.

The correct answer reframes the problem: precompute the mapping from words to documents at index time, then answer queries in O(1) lookups at query time.

3. Inverted Index — Level 2

The inverted index is the core data structure of every search engine. Instead of document → words, it stores word → list of documents containing that word.

pseudocode

# Forward index (what we want to avoid querying)
doc1 → ["the", "quick", "brown", "fox"]
doc2 → ["the", "lazy", "dog", "sleeps"]

# Inverted index (what we build at index time)
"quick"  → [(doc1, freq:1, pos:[2]), (doc4, freq:1, pos:[1])]
"brown"  → [(doc1, freq:1, pos:[3]), (doc3, freq:1, pos:[1])]
"dog"    → [(doc2, freq:1, pos:[3]), (doc4, freq:1, pos:[2])]
"fox"    → [(doc1, freq:1, pos:[4])]
"lazy"   → [(doc2, freq:1, pos:[2])]

Each entry in the posting list stores:

docId — which document
term frequency — how often the term appears (relevance signal)
positions — where in the document (enables phrase queries like "quick brown fox")

Multi-word query intersection: Query “quick brown” → fetch posting list for “quick”, fetch posting list for “brown”, intersect by docId. Documents in both lists are candidates. This is O(k) where k is the size of the smaller posting list — typically tiny compared to the full corpus.

▶ Interactive Inverted Index Builder

doc1

The quick brown fox

doc2

The lazy dog sleeps

doc3

Brown bears eat fish

doc4

Quick dogs run fast

4. TF-IDF Ranking — Level 3

An inverted index tells you which documents contain a term. It does not tell you how relevant each document is. We need a scoring function.

TF-IDF was the dominant
ranking signal from
the 1970s through
the mid-2000s. Modern
engines layer PageRank,
click-through data,
and neural re-rankers
on top — but TF-IDF
remains in the mix.

TF (Term Frequency) = how often the term appears in the document, normalized by document length:

formula

TF(t, d) = count(t in d) / total_words(d)

IDF(t)   = log( N / df(t) )
           where N = total documents, df(t) = documents containing t

Score(t, d) = TF(t, d) × IDF(t)

Intuition: A term that appears 20 times in a 100-word document (TF = 0.2) beats one that appears once in a 1,000-word document (TF = 0.001). But if the term appears in every document, IDF ≈ 0 and it contributes nothing — “the” has zero discriminating power.

▶ Interactive TF-IDF Calculator

Select term:

5. PageRank — Level 4

TF-IDF ranks documents by term relevance, but misses a crucial signal: authority. A page linked to by thousands of trusted sites is more authoritative than an obscure page that happens to use the query term 50 times.

PageRank models a random surfer: someone who clicks random links forever. The fraction of time the surfer spends on a page is that page’s rank.

formula

PR(A) = (1 - d) / N  +  d × Σ( PR(Bi) / OutLinks(Bi) )

where:
  d     = damping factor (0.85) — probability surfer follows a link
  N     = total pages
  Bi    = pages that link TO page A
  1-d   = probability surfer jumps to a random page

Elasticsearch uses Lucene
under the hood. Lucene’s
inverted index is built
from immutable “segments”.
Updates write new segments;
a background merge process
compacts small segments
into larger ones —
same idea as LSM trees.

The formula is iterated until convergence (typically 50–100 iterations on the full web graph). The key insight: a link from a high-PR page is worth more than a link from a low-PR page, creating a recursive, self-referencing quality signal.

▶ Interactive PageRank Visualizer

Click a node, then click another to add a link. Then re-run.

6. Web Crawler — Level 5

Before you can index anything, you need to fetch the web. A crawler is a distributed system in its own right.

Crawl pipeline:

Seed URLs → initial set of known good pages
Frontier queue → priority queue of URLs to fetch (prioritised by PageRank estimate, freshness)
Fetcher → HTTP request, store raw HTML to object storage
Parser → extract text content, outbound links
Deduplicator → SimHash/MinHash to detect near-duplicate pages
URL normaliser → canonicalise URLs, check robots.txt
Enqueue → add new URLs back to frontier

Politeness constraints: Respect robots.txt, enforce per-domain crawl delays (e.g. 1 req/sec per domain), rotate IPs, set proper User-Agent.

Deduplication at scale: A Bloom filter with 100 billion slots uses ~125 GB of memory — feasible on a cluster. SimHash reduces an entire document to a 64-bit fingerprint; Hamming distance < 3 means near-duplicate.

▶ Crawl Simulator (BFS from seed node)

Frontier Queue

Visited Set

Skipped (robots.txt)

7. Query Processing Pipeline — Level 6

The “two-index” trick —
large stable index plus
small fresh delta index —
was described in Google’s
Bigtable paper as “tablet
compaction”. Keep the
fast path fast while
absorbing writes in
a smaller mutable index.

A user’s keypress triggers a 7-stage pipeline, all completing in under 200ms. Click each stage to expand its details:

1Query
Parsing

›

2Query
Expansion

›

3Index
Lookup

›

4Scoring
& Merge

›

5Re-
Ranking

›

6Snippet
Gen

›

7Response

8. Distributed Index Sharding

A 100 TB index cannot live on one machine. We shard it — and the choice of sharding strategy matters enormously.

Document partitioning (Google’s approach): each shard holds a contiguous range of document IDs. A query fans out to all shards in parallel; each returns its local top-K; a coordinator merges them.

Term partitioning: each shard owns a set of terms. Routing is complex (a multi-term query may need 3 different shards), but reduces fan-out for popular terms.

Google uses document partitioning because it simplifies fan-out and lets each shard independently maintain its own BM25 statistics.

Query fan-out across 4 index shards

Query Server

⬇

Shard

docId 0–25B

Shard

docId 25B–50B

Shard

docId 50B–75B

Shard

docId 75B–100B

Each returns local top-K → coordinator merges → global top-10

Replication: each shard is replicated 3×. At 100 TB index × 3 replicas = 300 TB storage. Each shard replica can serve read queries independently, distributing the 100K QPS load.

9. Handling Updates & Freshness

A breaking news story published 30 seconds ago should appear in search results. But rebuilding a 100 TB index takes days. How do you bridge these?

Strategy	Latency to Index	Complexity	Used By
Batch rebuild	Hours–days	Low	Early Google
Delta index	Minutes	Medium	Google Caffeine
Real-time stream	Seconds	High	Twitter, Elasticsearch

Google’s Caffeine (2010): A large, stable base index handles the bulk of queries. A small, fast delta index absorbs new and updated documents. At query time, results from both are merged. Periodically, the delta is compacted into the base index. This is exactly the LSM tree approach applied to search.

Real-time indexing pipeline:

pipeline

Crawler → Kafka topic "new-docs"
  → Flink job: parse, tokenise, compute BM25 stats
  → Elasticsearch write (inverted index segment)
  → Segment available to queries within ~2–5 seconds

10. Autocomplete

As users type, the search box shows suggestions. This is a latency-critical feature: suggestions must appear within 50–100ms of each keystroke.

Data structure: A Trie (prefix tree) where each node stores the top-K queries that pass through that prefix. Lookup is O(L) where L is the prefix length — effectively O(1) for practical query lengths.

At scale: Google processes 8.5 billion queries/day. The top-K suggestions at each prefix node are derived from actual query frequency logs, refreshed periodically. The Trie is sharded by first letter or first two letters.

▶ Interactive Trie Autocomplete

11. Capacity Estimation

Component	Size / Rate	Notes
Index size (compressed)	~100 TB	After inverted index + compression
Daily crawl volume	~1B pages/day	New + updated pages
Query QPS	~100,000 / s	8.5B searches / 86,400s
Index shards	1,000 shards	~100 GB each at 100 TB total
Replication factor	3×	300 TB total storage
Query servers	~500	200 QPS per server × 500 = 100K QPS
Latency target	< 200ms p99	End-to-end including re-ranking
Crawler fetchers	~1,000	~12,000 pages/sec sustained

12. Architecture Summary

The complete system has two major subsystems: the online query path (latency-critical) and the offline index build path (throughput-critical).

Architecture Overview

User

→

Load Balancer

→

Query Servers

→

Result Assembler

→

User

Query Servers fan out to:

Index Shards (×1,000)

PageRank Store

ML Re-Ranker (BERT)

Autocomplete Trie

Offline Index Build Path

Crawler Fleet

→

Object Storage (raw HTML)

→

Kafka + Flink

→

Index Writer → Shards

13. Trade-offs & Interview Tips

What interviewers want to hear: Start with the inverted index. Explain TF-IDF scoring. Then add PageRank. Then discuss fan-out sharding and the coordinator merge. Then tackle freshness. Each step earns points; you do not need to cover all of them.

The BERT re-ranking trick: A common follow-up is "how do you use ML for ranking?" The correct answer is a two-stage pipeline: cheap BM25 retrieval narrows from 100B to top-100 candidates, then expensive BERT scoring runs only on those 100. This makes the neural model feasible at search-engine scale.

Common mistakes: Saying "shard by URL hash" without explaining that fan-out requires querying all shards anyway. Forgetting that the index must be kept fresh. Proposing a single index server for 100 TB of data.

BERT-based re-ranking runs
on only the top-100
candidates from classical
retrieval — not all 100B
documents. The cheap
index lookup first,
then the expensive
neural model on a
tiny candidate set.
Two-stage pipelines
are the key pattern.

14. Key Takeaways

Inverted index transforms an O(N) scan into an O(1) lookup — the single most important data structure in search.
TF-IDF + BM25 scores by term relevance within a document.
PageRank adds inter-document authority via the link graph.
Two-stage ranking: fast classical retrieval → expensive neural re-ranking on a small candidate set.
Document-partitioned sharding with coordinator merge is the practical approach at 100 billion pages.
Delta index (Caffeine pattern) decouples freshness latency from the cost of rebuilding a 100 TB base index.

Series Finale — #15 of 15

This is article #15 of 15 in the System Design Interview Series. We've covered everything from Bloom Filters to distributed transactions to building a search engine from scratch — and the common thread running through all of them is the same principle: start simple, measure, then scale only what hurts.

The inverted index is a Bloom filter applied to retrieval. PageRank is consistent hashing applied to authority. The delta index is the LSM tree applied to freshness. Every hard problem in distributed systems is a variation of a pattern you have already seen.

Start from the beginning of the series: Article #1 — How Gmail Checks Billions of Emails for Uniqueness (Bloom Filters)