System Design: Search Autocomplete — From Trie to Google-Scale

Series #6 of 15 in
System Design Interview
deep-dives. Each post
stands alone but they
build on each other.

System Design Interview Series — #6 of 15

The question: Design a search autocomplete system like Google’s search bar. When a user types “sys”, show the top 5 suggestions in under 100ms. Handle 10M unique queries/day and 500M users.

This prompt hides a beautiful ladder of trade-offs — from a one-liner SQL query to a distributed, personalized, multi-layer system serving millions of keystrokes per second. Let’s climb every rung.

1. Why Autocomplete Is Hard

On the surface, autocomplete looks trivial: the user types a few characters, you return matching words. But at Google’s scale every requirement becomes a constraint:

Speed — sub-100ms feels instant; anything over 300ms feels broken. Every character triggers a new request.
Relevance — alphabetical order is useless. Return the most searched completions, not the lexicographically first ones.
Personalization — your autocomplete should differ from a stranger’s. Your location, language, and history matter.
Scale — Google processes ~63,000 searches per second. Every keystroke in every search box is a request. That’s potentially hundreds of thousands of autocomplete calls per second worldwide.

2. Level 1 — SQL `LIKE` Query

The simplest possible solution: a table of queries and their hit counts, queried with a LIKE prefix match.

SQL

SELECT query, count
FROM   search_queries
WHERE  query LIKE 'sys%'
ORDER BY count DESC
LIMIT  5;

With a B-tree index on query this is fast for small datasets — a prefix scan on a sorted index is O(log N + K). At 10 million unique queries the index still helps. But at 1 billion rows the prefix scan becomes a full table scan of a subtree that could touch millions of rows, and sorting them all by count is O(N log N). Database CPU spikes; p99 latency balloons past a second.

When SQL is fine vs when it breaks

Scale	Index scan rows	p99 latency	Verdict
1M queries	~100	< 5ms	✓ Fine
100M queries	~10,000	~50ms	~ Borderline
1B queries	~100,000	> 500ms	✗ Too slow

3. Level 2 — Trie (Prefix Tree)

A trie (from retrieval) is a tree where each path from root to node spells out a prefix. Every node stores one character, and a flag indicating whether that position marks the end of a complete word.

JavaScript — Trie Node

class TrieNode {
  constructor() {
    this.children  = {};        // char → TrieNode
    this.isEnd     = false;     // marks complete word
    this.frequency = 0;         // how often this query was searched
    this.topK      = [];         // cached top-5 completions (Level 5)
  }
}

class Trie {
  constructor() { this.root = new TrieNode(); }

  insert(word, freq = 1) {
    let node = this.root;
    for (const ch of word) {
      if (!node.children[ch]) node.children[ch] = new TrieNode();
      node = node.children[ch];
    }
    node.isEnd     = true;
    node.frequency = freq;
  }

  search(prefix) {
    let node = this.root;
    for (const ch of prefix) {
      if (!node.children[ch]) return [];   // prefix not found
      node = node.children[ch];
    }
    // DFS from this node, collecting all complete words
    const results = [];
    this._dfs(node, prefix, results);
    return results.sort((a, b) => b.freq - a.freq).slice(0, 5);
  }

  _dfs(node, current, results) {
    if (node.isEnd) results.push({ word: current, freq: node.frequency });
    for (const [ch, child] of Object.entries(node.children))
      this._dfs(child, current + ch, results);
  }
}

Lookup time: O(P + K) where P = prefix length, K = number of completions under that prefix. Completely independent of dictionary size — no table scan.

Interactive Trie Visualiser

“Google’s autocomplete
handles ~63,000 searches
per second — every keystroke
in every search box worldwide.”

🌲 Trie Visualiser

Pre-loaded with 7 words. Type a prefix to explore.

4. Level 3 — Trie with Frequency Ranking

Augment the trie to store a frequency score at each terminal node. The score represents how many times that exact query was searched. On every autocomplete:

Traverse trie along the prefix — O(P)
DFS from the prefix node collecting all complete words — O(subtree size)
Sort collected words by frequency descending — O(K log K)
Return top 5

Python — DFS with frequency

def autocomplete(trie_root, prefix: str, k: int = 5) -> list:
    # 1. Traverse to prefix node
    node = trie_root
    for ch in prefix:
        if ch not in node.children:
            return []
        node = node.children[ch]

    # 2. DFS collecting all complete words under this node
    results = []

    def dfs(n, current):
        if n.is_end:
            results.append((current, n.frequency))
        for ch, child in n.children.items():
            dfs(child, current + ch)

    dfs(node, prefix)

    # 3. Sort by frequency, return top-k
    results.sort(key=lambda x: x[1], reverse=True)
    return [word for word, _ in results[:k]]

Problem: if the subtree under a popular prefix contains millions of words (think: every query starting with “a”), the DFS visits all of them. That’s slow.

Fix: cache top-K at every node (Level 5 below).

5. Level 4 — Compressed Trie (Radix Tree)

A standard trie wastes memory when words have long unique tails. “system”, “systems”, and “syslog” share “sys”, but then diverge. The nodes for “t-e-m” and “l-o-g” each have only a single child — a chain with no branching.

Radix tree / Patricia trie merges single-child chains into a single edge labelled with the full substring:

Standard Trie vs Compressed Radix Tree

STANDARD TRIE — "system", "systems", "syslog"

root
└─ s
   └─ y
      └─ s
         ├─ t
         │  └─ e
         │     └─ m ★ freq:92k
         │        └─ s ★ freq:110k
         └─ l
            └─ o
               └─ g ★ freq:44k

RADIX TREE — same words

root
└─ "sys"
   ├─ "tem" ★ freq:92k
   │  └─ "s" ★ freq:110k
   └─ "log" ★ freq:44k

The radix tree collapses 10 nodes into 4. For a dictionary with millions of long-tail queries, memory savings can reach 60–80%.

Memory comparison: standard trie stores one node per character. English has average word length ≈ 7 chars → 7 nodes per word. Radix tree stores one node per branching point — often just 2-3 nodes per word.

6. Level 5 — Top-K Caching at Every Node

“The top-k cache at each trie
node is a classic space-time
trade-off: use more memory
but turn an O(n) subtree
traversal into O(1).”

Instead of traversing the subtree on every query, cache the top-5 completions at every single node. A query for prefix “sys” simply reads the cache at the “s → y → s” node — O(P) and nothing more.

Python — insert with top-K cache propagation

def insert_with_cache(root, word: str, freq: int, k: int = 5):
    path = []            # track nodes visited so we can update their caches
    node = root
    for ch in word:
        if ch not in node.children:
            node.children[ch] = TrieNode()
        path.append(node)
        node = node.children[ch]
    node.is_end   = True
    node.frequency = freq
    path.append(node)

    # Walk back up, updating top-k cache at every ancestor
    for ancestor in path:
        candidate = (word, freq)
        # If word is better than worst in current top-k, replace it
        if len(ancestor.top_k) < k:
            ancestor.top_k.append(candidate)
        elif freq > ancestor.top_k[-1][1]:
            ancestor.top_k[-1] = candidate
        ancestor.top_k.sort(key=lambda x: x[1], reverse=True)

def fast_autocomplete(root, prefix: str) -> list:
    node = root
    for ch in prefix:
        if ch not in node.children:
            return []
        node = node.children[ch]
    return [w for w, _ in node.top_k]   # O(1) — no traversal needed!

Complexity: O(P) lookup — prefix length only, completely independent of how many words are in the trie.

Memory overhead: Each node stores up to 5 (word, freq) tuples. For 100M nodes (a large trie), with average word length 8 bytes: 100M × 5 × 16 bytes ≈ 8 GB extra. Acceptable for an in-memory service on a beefy machine.

7. Level 6 — Production Architecture

Click any component in the diagram for details.

🏗️ Autocomplete System Architecture

Click any component above to learn what it does.

Key design decisions:

Trie updated hourly

The main trie is rebuilt from aggregated logs every hour using an offline Spark/Hadoop job. The new trie is snapshotted to object storage, then loaded by the autocomplete service via a blue-green swap — zero downtime.

Real-time trending via Redis

Very recent queries (last 60 minutes) are tracked in Redis sorted sets. On each autocomplete request, Redis top-K is merged with the trie top-K. This captures breaking news and viral topics without rebuilding the trie.

CDN for popular prefixes

The top 10,000 most-queried prefixes are cached at the CDN edge with a 5-minute TTL. These cover the vast majority of traffic (power law distribution). Cache hit rate for autocomplete is typically 80–90%.

Filter service

A blocking list and ML classifier sit between the trie results and the user. Adult content, illegal queries, and defamatory autocompletes are removed. This runs in microseconds using a separate in-memory set.

8. Level 7 — Personalization Layer

After retrieving the global top-5 from the trie, a personalization service re-ranks suggestions using per-user signals. The final score for a suggestion is a weighted blend:

Python — re-ranking formula

def personal_score(suggestion, user_ctx) -> float:
    # Base: global frequency (log-scaled to prevent dominance)
    base   = math.log1p(suggestion.global_freq)

    # Boost: user searched this exact query in the last 30 days
    hist   = 3.0 if suggestion.query in user_ctx.recent_queries else 1.0

    # Boost: query matches user's primary language
    lang   = 1.5 if suggestion.language == user_ctx.language else 0.8

    # Boost: query is trending in user's geographic region
    geo    = 1.3 if suggestion.region == user_ctx.region else 1.0

    return base * hist * lang * geo

“Personalization is the reason
your autocomplete differs from
your colleague’s — the global
trie is identical, but the
re-ranking layer is per-user.”

User history is stored as a Redis list (per user, capped at 100 recent queries). This lookup adds ~1ms of latency — acceptable because it runs in parallel with the trie lookup.

The re-ranking step takes the global top-15 (not top-5), re-scores each, and returns the best 5. Expanding from 5 to 15 at the trie stage increases the chance that a personalized result bubbles up to the final list.

9. Level 8 — Distributed Trie Sharding

A single trie for all languages and all queries cannot fit in one machine’s RAM. English alone has hundreds of millions of unique search queries.

Sharding strategies:

Shard-by-first-character (simplest)

26 shards for a-z, plus one for digits/symbols. A query "system" goes to shard S. Shard S holds all tries for prefixes starting with 's'. Simple, predictable routing. Problem: letters are not uniformly distributed — 's' and 'c' get far more traffic than 'x' and 'q'. Requires over-provisioning hot shards or secondary sharding by prefix length (s1–s3, s4–s6, ...).

Shard-by-language (recommended)

Each language gets its own trie cluster. English, Chinese, Spanish, Arabic, Japanese, etc. The user's Accept-Language header routes to the right cluster. This also makes personalization easier (language is already known). Drawback: small languages get underprovisioned or under-served.

Consistent hashing by prefix

Hash the first 3 characters of the prefix to select a node. Virtual nodes handle uneven distribution. Adding/removing nodes causes minimal reshuffling. This is the most flexible approach but requires a coordination layer (ZooKeeper or etcd) to maintain the ring mapping.

Each shard runs as a separate service. The autocomplete API gateway fans out to 1 shard (deterministic routing), so there is no fan-out latency problem. The shard count can scale independently as the query volume grows.

10. Live Autocomplete Demo

🔍 Try It — Real Trie in Your Browser

Start typing to search 50 pre-loaded tech terms.

11. Latency vs Freshness Trade-off

Update Frequency Comparison

Update Frequency	Freshness	Memory Overhead	Complexity	Used By
Real-time	Perfect	Very High	High	Twitter trending
Every minute	Near-real-time	Medium	Medium	News sites
Every hour recommended	Slightly stale	Low	Easy	Google, Bing
Daily rebuild	Stale	Minimal	Simplest	Internal tools

The hybrid approach — hourly trie + real-time Redis layer — gives near-real-time freshness for trending topics while keeping the main trie simple and cheap to operate.

Summary — The Escalation Ladder

Level	Approach	Lookup	Memory	Freshness
1	SQL `LIKE`	O(log N + K)	DB	Real-time
2	In-memory Trie	O(P + subtree)	~1 GB/100M words	Hourly
3	Trie + frequency	O(P + subtree + K log K)	~1 GB	Hourly
4	Radix tree	O(P + subtree)	~400 MB	Hourly
5	Top-K cache	O(P)	~9 GB	Hourly
6	Distributed	O(P)	Horizontal	Hourly + Redis
7	Personalized	O(P) + 1ms re-rank	Per-user Redis	Hourly + real-time

The interview answer is almost never “Level 1”. Start at Level 3 (trie with frequency), explain why, then escalate only when the interviewer pushes on scale.