System Design: Search Autocomplete — From Trie to Google-Scale
Series #6 of 15 in
System Design Interview
deep-dives. Each post
stands alone but they
build on each other.
The question: Design a search autocomplete system like Google’s search bar. When a user types “sys”, show the top 5 suggestions in under 100ms. Handle 10M unique queries/day and 500M users.
This prompt hides a beautiful ladder of trade-offs — from a one-liner SQL query to a distributed, personalized, multi-layer system serving millions of keystrokes per second. Let’s climb every rung.
1. Why Autocomplete Is Hard
On the surface, autocomplete looks trivial: the user types a few characters, you return matching words. But at Google’s scale every requirement becomes a constraint:
- Speed — sub-100ms feels instant; anything over 300ms feels broken. Every character triggers a new request.
- Relevance — alphabetical order is useless. Return the most searched completions, not the lexicographically first ones.
- Personalization — your autocomplete should differ from a stranger’s. Your location, language, and history matter.
- Scale — Google processes ~63,000 searches per second. Every keystroke in every search box is a request. That’s potentially hundreds of thousands of autocomplete calls per second worldwide.
2. Level 1 — SQL LIKE Query
The simplest possible solution: a table of queries and their hit counts, queried with a LIKE prefix match.
SELECT query, count FROM search_queries WHERE query LIKE 'sys%' ORDER BY count DESC LIMIT 5;
With a B-tree index on query this is fast for small datasets — a prefix scan on a sorted index is O(log N + K). At 10 million unique queries the index still helps. But at 1 billion rows the prefix scan becomes a full table scan of a subtree that could touch millions of rows, and sorting them all by count is O(N log N). Database CPU spikes; p99 latency balloons past a second.
When SQL is fine vs when it breaks
| Scale | Index scan rows | p99 latency | Verdict |
|---|---|---|---|
| 1M queries | ~100 | < 5ms | ✓ Fine |
| 100M queries | ~10,000 | ~50ms | ~ Borderline |
| 1B queries | ~100,000 | > 500ms | ✗ Too slow |
3. Level 2 — Trie (Prefix Tree)
A trie (from retrieval) is a tree where each path from root to node spells out a prefix. Every node stores one character, and a flag indicating whether that position marks the end of a complete word.
class TrieNode { constructor() { this.children = {}; // char → TrieNode this.isEnd = false; // marks complete word this.frequency = 0; // how often this query was searched this.topK = []; // cached top-5 completions (Level 5) } } class Trie { constructor() { this.root = new TrieNode(); } insert(word, freq = 1) { let node = this.root; for (const ch of word) { if (!node.children[ch]) node.children[ch] = new TrieNode(); node = node.children[ch]; } node.isEnd = true; node.frequency = freq; } search(prefix) { let node = this.root; for (const ch of prefix) { if (!node.children[ch]) return []; // prefix not found node = node.children[ch]; } // DFS from this node, collecting all complete words const results = []; this._dfs(node, prefix, results); return results.sort((a, b) => b.freq - a.freq).slice(0, 5); } _dfs(node, current, results) { if (node.isEnd) results.push({ word: current, freq: node.frequency }); for (const [ch, child] of Object.entries(node.children)) this._dfs(child, current + ch, results); } }
Lookup time: O(P + K) where P = prefix length, K = number of completions under that prefix. Completely independent of dictionary size — no table scan.
Interactive Trie Visualiser
“Google’s autocomplete
handles ~63,000 searches
per second — every keystroke
in every search box worldwide.”
🌲 Trie Visualiser
4. Level 3 — Trie with Frequency Ranking
Augment the trie to store a frequency score at each terminal node. The score represents how many times that exact query was searched. On every autocomplete:
- Traverse trie along the prefix — O(P)
- DFS from the prefix node collecting all complete words — O(subtree size)
- Sort collected words by frequency descending — O(K log K)
- Return top 5
def autocomplete(trie_root, prefix: str, k: int = 5) -> list: # 1. Traverse to prefix node node = trie_root for ch in prefix: if ch not in node.children: return [] node = node.children[ch] # 2. DFS collecting all complete words under this node results = [] def dfs(n, current): if n.is_end: results.append((current, n.frequency)) for ch, child in n.children.items(): dfs(child, current + ch) dfs(node, prefix) # 3. Sort by frequency, return top-k results.sort(key=lambda x: x[1], reverse=True) return [word for word, _ in results[:k]]
Problem: if the subtree under a popular prefix contains millions of words (think: every query starting with “a”), the DFS visits all of them. That’s slow.
Fix: cache top-K at every node (Level 5 below).
5. Level 4 — Compressed Trie (Radix Tree)
A standard trie wastes memory when words have long unique tails. “system”, “systems”, and “syslog” share “sys”, but then diverge. The nodes for “t-e-m” and “l-o-g” each have only a single child — a chain with no branching.
Radix tree / Patricia trie merges single-child chains into a single edge labelled with the full substring:
Standard Trie vs Compressed Radix Tree
STANDARD TRIE — "system", "systems", "syslog"
root
└─ s
└─ y
└─ s
├─ t
│ └─ e
│ └─ m ★ freq:92k
│ └─ s ★ freq:110k
└─ l
└─ o
└─ g ★ freq:44k
RADIX TREE — same words
root └─ "sys" ├─ "tem" ★ freq:92k │ └─ "s" ★ freq:110k └─ "log" ★ freq:44k
The radix tree collapses 10 nodes into 4. For a dictionary with millions of long-tail queries, memory savings can reach 60–80%.
Memory comparison: standard trie stores one node per character. English has average word length ≈ 7 chars → 7 nodes per word. Radix tree stores one node per branching point — often just 2-3 nodes per word.
6. Level 5 — Top-K Caching at Every Node
“The top-k cache at each trie
node is a classic space-time
trade-off: use more memory
but turn an O(n) subtree
traversal into O(1).”
Instead of traversing the subtree on every query, cache the top-5 completions at every single node. A query for prefix “sys” simply reads the cache at the “s → y → s” node — O(P) and nothing more.
def insert_with_cache(root, word: str, freq: int, k: int = 5): path = [] # track nodes visited so we can update their caches node = root for ch in word: if ch not in node.children: node.children[ch] = TrieNode() path.append(node) node = node.children[ch] node.is_end = True node.frequency = freq path.append(node) # Walk back up, updating top-k cache at every ancestor for ancestor in path: candidate = (word, freq) # If word is better than worst in current top-k, replace it if len(ancestor.top_k) < k: ancestor.top_k.append(candidate) elif freq > ancestor.top_k[-1][1]: ancestor.top_k[-1] = candidate ancestor.top_k.sort(key=lambda x: x[1], reverse=True) def fast_autocomplete(root, prefix: str) -> list: node = root for ch in prefix: if ch not in node.children: return [] node = node.children[ch] return [w for w, _ in node.top_k] # O(1) — no traversal needed!
Complexity: O(P) lookup — prefix length only, completely independent of how many words are in the trie.
Memory overhead: Each node stores up to 5 (word, freq) tuples. For 100M nodes (a large trie), with average word length 8 bytes: 100M × 5 × 16 bytes ≈ 8 GB extra. Acceptable for an in-memory service on a beefy machine.
7. Level 6 — Production Architecture
Click any component in the diagram for details.
🏗️ Autocomplete System Architecture
Key design decisions:
Trie updated hourly
The main trie is rebuilt from aggregated logs every hour using an offline Spark/Hadoop job. The new trie is snapshotted to object storage, then loaded by the autocomplete service via a blue-green swap — zero downtime.
Real-time trending via Redis
Very recent queries (last 60 minutes) are tracked in Redis sorted sets. On each autocomplete request, Redis top-K is merged with the trie top-K. This captures breaking news and viral topics without rebuilding the trie.
CDN for popular prefixes
The top 10,000 most-queried prefixes are cached at the CDN edge with a 5-minute TTL. These cover the vast majority of traffic (power law distribution). Cache hit rate for autocomplete is typically 80–90%.
Filter service
A blocking list and ML classifier sit between the trie results and the user. Adult content, illegal queries, and defamatory autocompletes are removed. This runs in microseconds using a separate in-memory set.
8. Level 7 — Personalization Layer
After retrieving the global top-5 from the trie, a personalization service re-ranks suggestions using per-user signals. The final score for a suggestion is a weighted blend:
def personal_score(suggestion, user_ctx) -> float: # Base: global frequency (log-scaled to prevent dominance) base = math.log1p(suggestion.global_freq) # Boost: user searched this exact query in the last 30 days hist = 3.0 if suggestion.query in user_ctx.recent_queries else 1.0 # Boost: query matches user's primary language lang = 1.5 if suggestion.language == user_ctx.language else 0.8 # Boost: query is trending in user's geographic region geo = 1.3 if suggestion.region == user_ctx.region else 1.0 return base * hist * lang * geo
“Personalization is the reason
your autocomplete differs from
your colleague’s — the global
trie is identical, but the
re-ranking layer is per-user.”
User history is stored as a Redis list (per user, capped at 100 recent queries). This lookup adds ~1ms of latency — acceptable because it runs in parallel with the trie lookup.
The re-ranking step takes the global top-15 (not top-5), re-scores each, and returns the best 5. Expanding from 5 to 15 at the trie stage increases the chance that a personalized result bubbles up to the final list.
9. Level 8 — Distributed Trie Sharding
A single trie for all languages and all queries cannot fit in one machine’s RAM. English alone has hundreds of millions of unique search queries.
Sharding strategies:
Shard-by-first-character (simplest)
26 shards for a-z, plus one for digits/symbols. A query "system" goes to shard S. Shard S holds all tries for prefixes starting with 's'. Simple, predictable routing. Problem: letters are not uniformly distributed — 's' and 'c' get far more traffic than 'x' and 'q'. Requires over-provisioning hot shards or secondary sharding by prefix length (s1–s3, s4–s6, ...).
Shard-by-language (recommended)
Each language gets its own trie cluster. English, Chinese, Spanish, Arabic, Japanese, etc. The user's Accept-Language header routes to the right cluster. This also makes personalization easier (language is already known). Drawback: small languages get underprovisioned or under-served.
Consistent hashing by prefix
Hash the first 3 characters of the prefix to select a node. Virtual nodes handle uneven distribution. Adding/removing nodes causes minimal reshuffling. This is the most flexible approach but requires a coordination layer (ZooKeeper or etcd) to maintain the ring mapping.
Each shard runs as a separate service. The autocomplete API gateway fans out to 1 shard (deterministic routing), so there is no fan-out latency problem. The shard count can scale independently as the query volume grows.
10. Live Autocomplete Demo
🔍 Try It — Real Trie in Your Browser
11. Latency vs Freshness Trade-off
Update Frequency Comparison
| Update Frequency | Freshness | Memory Overhead | Complexity | Used By |
|---|---|---|---|---|
| Real-time | Perfect | Very High | High | Twitter trending |
| Every minute | Near-real-time | Medium | Medium | News sites |
| Every hour recommended | Slightly stale | Low | Easy | Google, Bing |
| Daily rebuild | Stale | Minimal | Simplest | Internal tools |
The hybrid approach — hourly trie + real-time Redis layer — gives near-real-time freshness for trending topics while keeping the main trie simple and cheap to operate.
Summary — The Escalation Ladder
| Level | Approach | Lookup | Memory | Freshness |
|---|---|---|---|---|
| 1 | SQL LIKE |
O(log N + K) | DB | Real-time |
| 2 | In-memory Trie | O(P + subtree) | ~1 GB/100M words | Hourly |
| 3 | Trie + frequency | O(P + subtree + K log K) | ~1 GB | Hourly |
| 4 | Radix tree | O(P + subtree) | ~400 MB | Hourly |
| 5 | Top-K cache | O(P) | ~9 GB | Hourly |
| 6 | Distributed | O(P) | Horizontal | Hourly + Redis |
| 7 | Personalized | O(P) + 1ms re-rank | Per-user Redis | Hourly + real-time |
The interview answer is almost never “Level 1”. Start at Level 3 (trie with frequency), explain why, then escalate only when the interviewer pushes on scale.