System Design: Search Engine — Inverted Index, PageRank, and Query Processing at Scale
Google’s original PageRank
paper (1998) titled
“The Anatomy of a
Large-Scale Hypertextual
Web Search Engine”
described an index of
only 25 million pages.
Today: 100 billion.
Design Google Search. Users type queries; you must return the 10 most relevant web pages from an index of 100 billion pages in under 200ms. This is arguably the hardest system design problem in the canon — it touches nearly every concept we have covered in this series.
The question: Design a search engine like Google. Users type queries; return the 10 most relevant results from 100 billion indexed pages in under 200ms.
1. Scale & Constraints
Before designing anything, nail the numbers. Interviewers reward candidates who reason through scale before jumping to solutions.
The constraints immediately rule out any naive approach. At 100K QPS and 100 billion pages, even a single microsecond per document would require 1011 µs — about 27 hours — to scan everything for one query.
2. The Naïve Approach — Level 1
The instinct of every beginner: scan every document, check if it contains the query terms, return the matches.
def linear_search(query, documents): results = [] for doc in documents: # iterate ALL 100 billion docs if query in doc.text: results.append(doc) return results
Why it fails: O(N) per query. At 100 billion documents and 100K QPS, you need 1016 operations per second. Even distributed across 10,000 servers you would need each server to scan 1012 docs/second — physically impossible.
The correct answer reframes the problem: precompute the mapping from words to documents at index time, then answer queries in O(1) lookups at query time.
3. Inverted Index — Level 2
The inverted index is the core data structure of every search engine. Instead of document → words, it stores word → list of documents containing that word.
# Forward index (what we want to avoid querying) doc1 → ["the", "quick", "brown", "fox"] doc2 → ["the", "lazy", "dog", "sleeps"] # Inverted index (what we build at index time) "quick" → [(doc1, freq:1, pos:[2]), (doc4, freq:1, pos:[1])] "brown" → [(doc1, freq:1, pos:[3]), (doc3, freq:1, pos:[1])] "dog" → [(doc2, freq:1, pos:[3]), (doc4, freq:1, pos:[2])] "fox" → [(doc1, freq:1, pos:[4])] "lazy" → [(doc2, freq:1, pos:[2])]
Each entry in the posting list stores:
- docId — which document
- term frequency — how often the term appears (relevance signal)
- positions — where in the document (enables phrase queries like
"quick brown fox")
Multi-word query intersection: Query “quick brown” → fetch posting list for “quick”, fetch posting list for “brown”, intersect by docId. Documents in both lists are candidates. This is O(k) where k is the size of the smaller posting list — typically tiny compared to the full corpus.
4. TF-IDF Ranking — Level 3
An inverted index tells you which documents contain a term. It does not tell you how relevant each document is. We need a scoring function.
TF-IDF was the dominant
ranking signal from
the 1970s through
the mid-2000s. Modern
engines layer PageRank,
click-through data,
and neural re-rankers
on top — but TF-IDF
remains in the mix.
TF (Term Frequency) = how often the term appears in the document, normalized by document length:
TF(t, d) = count(t in d) / total_words(d)
IDF(t) = log( N / df(t) )
where N = total documents, df(t) = documents containing t
Score(t, d) = TF(t, d) × IDF(t)
Intuition: A term that appears 20 times in a 100-word document (TF = 0.2) beats one that appears once in a 1,000-word document (TF = 0.001). But if the term appears in every document, IDF ≈ 0 and it contributes nothing — “the” has zero discriminating power.
5. PageRank — Level 4
TF-IDF ranks documents by term relevance, but misses a crucial signal: authority. A page linked to by thousands of trusted sites is more authoritative than an obscure page that happens to use the query term 50 times.
PageRank models a random surfer: someone who clicks random links forever. The fraction of time the surfer spends on a page is that page’s rank.
PR(A) = (1 - d) / N + d × Σ( PR(Bi) / OutLinks(Bi) ) where: d = damping factor (0.85) — probability surfer follows a link N = total pages Bi = pages that link TO page A 1-d = probability surfer jumps to a random page
Elasticsearch uses Lucene
under the hood. Lucene’s
inverted index is built
from immutable “segments”.
Updates write new segments;
a background merge process
compacts small segments
into larger ones —
same idea as LSM trees.
The formula is iterated until convergence (typically 50–100 iterations on the full web graph). The key insight: a link from a high-PR page is worth more than a link from a low-PR page, creating a recursive, self-referencing quality signal.
6. Web Crawler — Level 5
Before you can index anything, you need to fetch the web. A crawler is a distributed system in its own right.
Crawl pipeline:
- Seed URLs → initial set of known good pages
- Frontier queue → priority queue of URLs to fetch (prioritised by PageRank estimate, freshness)
- Fetcher → HTTP request, store raw HTML to object storage
- Parser → extract text content, outbound links
- Deduplicator → SimHash/MinHash to detect near-duplicate pages
- URL normaliser → canonicalise URLs, check robots.txt
- Enqueue → add new URLs back to frontier
Politeness constraints: Respect robots.txt, enforce per-domain crawl delays (e.g. 1 req/sec per domain), rotate IPs, set proper User-Agent.
Deduplication at scale: A Bloom filter with 100 billion slots uses ~125 GB of memory — feasible on a cluster. SimHash reduces an entire document to a 64-bit fingerprint; Hamming distance < 3 means near-duplicate.
7. Query Processing Pipeline — Level 6
The “two-index” trick —
large stable index plus
small fresh delta index —
was described in Google’s
Bigtable paper as “tablet
compaction”. Keep the
fast path fast while
absorbing writes in
a smaller mutable index.
A user’s keypress triggers a 7-stage pipeline, all completing in under 200ms. Click each stage to expand its details:
Parsing
Expansion
Lookup
& Merge
Ranking
Gen
8. Distributed Index Sharding
A 100 TB index cannot live on one machine. We shard it — and the choice of sharding strategy matters enormously.
Document partitioning (Google’s approach): each shard holds a contiguous range of document IDs. A query fans out to all shards in parallel; each returns its local top-K; a coordinator merges them.
Term partitioning: each shard owns a set of terms. Routing is complex (a multi-term query may need 3 different shards), but reduces fan-out for popular terms.
Google uses document partitioning because it simplifies fan-out and lets each shard independently maintain its own BM25 statistics.
Replication: each shard is replicated 3×. At 100 TB index × 3 replicas = 300 TB storage. Each shard replica can serve read queries independently, distributing the 100K QPS load.
9. Handling Updates & Freshness
A breaking news story published 30 seconds ago should appear in search results. But rebuilding a 100 TB index takes days. How do you bridge these?
| Strategy | Latency to Index | Complexity | Used By |
|---|---|---|---|
| Batch rebuild | Hours–days | Low | Early Google |
| Delta index | Minutes | Medium | Google Caffeine |
| Real-time stream | Seconds | High | Twitter, Elasticsearch |
Google’s Caffeine (2010): A large, stable base index handles the bulk of queries. A small, fast delta index absorbs new and updated documents. At query time, results from both are merged. Periodically, the delta is compacted into the base index. This is exactly the LSM tree approach applied to search.
Real-time indexing pipeline:
Crawler → Kafka topic "new-docs" → Flink job: parse, tokenise, compute BM25 stats → Elasticsearch write (inverted index segment) → Segment available to queries within ~2–5 seconds
10. Autocomplete
As users type, the search box shows suggestions. This is a latency-critical feature: suggestions must appear within 50–100ms of each keystroke.
Data structure: A Trie (prefix tree) where each node stores the top-K queries that pass through that prefix. Lookup is O(L) where L is the prefix length — effectively O(1) for practical query lengths.
At scale: Google processes 8.5 billion queries/day. The top-K suggestions at each prefix node are derived from actual query frequency logs, refreshed periodically. The Trie is sharded by first letter or first two letters.
11. Capacity Estimation
| Component | Size / Rate | Notes |
|---|---|---|
| Index size (compressed) | ~100 TB | After inverted index + compression |
| Daily crawl volume | ~1B pages/day | New + updated pages |
| Query QPS | ~100,000 / s | 8.5B searches / 86,400s |
| Index shards | 1,000 shards | ~100 GB each at 100 TB total |
| Replication factor | 3× | 300 TB total storage |
| Query servers | ~500 | 200 QPS per server × 500 = 100K QPS |
| Latency target | < 200ms p99 | End-to-end including re-ranking |
| Crawler fetchers | ~1,000 | ~12,000 pages/sec sustained |
12. Architecture Summary
The complete system has two major subsystems: the online query path (latency-critical) and the offline index build path (throughput-critical).
13. Trade-offs & Interview Tips
BERT-based re-ranking runs
on only the top-100
candidates from classical
retrieval — not all 100B
documents. The cheap
index lookup first,
then the expensive
neural model on a
tiny candidate set.
Two-stage pipelines
are the key pattern.
14. Key Takeaways
- Inverted index transforms an O(N) scan into an O(1) lookup — the single most important data structure in search.
- TF-IDF + BM25 scores by term relevance within a document.
- PageRank adds inter-document authority via the link graph.
- Two-stage ranking: fast classical retrieval → expensive neural re-ranking on a small candidate set.
- Document-partitioned sharding with coordinator merge is the practical approach at 100 billion pages.
- Delta index (Caffeine pattern) decouples freshness latency from the cost of rebuilding a 100 TB base index.
This is article #15 of 15 in the System Design Interview Series. We've covered everything from Bloom Filters to distributed transactions to building a search engine from scratch — and the common thread running through all of them is the same principle: start simple, measure, then scale only what hurts.
The inverted index is a Bloom filter applied to retrieval. PageRank is consistent hashing applied to authority. The delta index is the LSM tree applied to freshness. Every hard problem in distributed systems is a variation of a pattern you have already seen.
Start from the beginning of the series: Article #1 — How Gmail Checks Billions of Emails for Uniqueness (Bloom Filters)