System Design: How Gmail Checks Billions of Emails for Uniqueness
The interview question sounds simple: “How would you design Gmail’s email uniqueness check?”
Every time someone tries to create a new Gmail account, Google must answer one question in under 100 milliseconds: is this email address already taken? With over 1.8 billion active Gmail accounts — and billions of historical ones — this is far from trivial.
Most candidates start with “just query the database.” That answer survives the first 30 seconds. What follows separates junior engineers from senior ones.
The Problem at Scale
📊 Gmail processes
roughly 20 billion
email uniqueness
checks per month.
Every millisecond
of latency is felt
at the signup form.
Let’s nail the requirements:
- Scale: 1.8 billion existing email addresses, growing by millions per day
- Latency: sub-100ms response (user is watching a spinner at the signup form)
- Availability: 99.999% uptime — this blocks account creation, a revenue-critical path
- Correctness asymmetry: A false negative (saying an email is free when it isn’t) lets two people share an account — a catastrophic security failure. A false positive (saying it’s taken when it isn’t) is just annoying.
Level 1: Naive SQL — The First Instinct
Full Table Scan
-- The naive approach SELECT COUNT(*) FROM users WHERE email = 'alice@gmail.com'; -- Slightly better: short-circuits on first match SELECT EXISTS( SELECT 1 FROM users WHERE email = 'alice@gmail.com' );**For 10,000 users:** instant. A warm buffer pool holds the entire heap; PostgreSQL scans it in microseconds. **For 1.8 billion users:** A full table scan over a 200 GB table — reading at 3 GB/s on a fast NVMe — takes ~67 seconds. Completely unusable.
🔍 A B-tree index onemail reduces the
lookup from O(n) to
O(log n) — roughly
31 comparisons
for 2 billion rows.
Each might be a
separate page read.
You’d immediately add an index:
-- Add a unique index (enforces uniqueness at write time too) CREATE UNIQUE INDEX idx_users_email ON users(email); -- With the index, this is now O(log n) SELECT EXISTS( SELECT 1 FROM users WHERE email = 'alice@gmail.com' );
O(log n) for 2 billion rows is about 31 comparisons. On NVMe that’s fast. But under high concurrency — thousands of simultaneous signups — you hit hard limits:
- Working set: A B-tree index on 1.8B email strings (avg ~22 bytes each) consumes ~55 GB of memory to stay hot. That’s an enormous, expensive working set.
- Write amplification: Every new signup modifies index pages, causing lock contention and write stalls under peak load.
- No horizontal scaling: One database, one index. You can read-replica it, but you can’t shard the uniqueness check trivially.
Level 2: Indexed Lookup + Redis Cache
Hash Index + Cache Layer
async function isEmailTaken(email) { // L1: Check Redis cache — ~0.5ms const cached = await redis.get('email:' + email); if (cached !== null) return cached === '1'; // L2: Cache miss — query the database const exists = await db.query( 'SELECT EXISTS(SELECT 1 FROM users WHERE email = ?)', [email] ); // Write-back with 1-hour TTL await redis.setex('email:' + email, 3600, exists ? '1' : '0'); return exists; }
Better. A Redis GET handles millions of requests per second at ~0.5ms latency. Hot emails (common names, frequently checked) stay cached and the database barely sees them.
But three deep problems remain:
-
Cache misses dominate the signup path. When a real user tries a new email they’ve never tried before, you always miss the cache. Every single signup attempt hits the database.
-
TTL invalidation is dangerous. If you cache “email is free” with a 1-hour TTL and someone claims it during that hour, the next check returns stale data.
-
Memory still scales linearly. Storing all 1.8B emails in Redis requires roughly 80 bytes per entry: ~144 GB of RAM just for this cache.
💾 Redis uses ~50–100
bytes per string key
plus internal struct
overhead. 1.8B entries
at ~80 bytes = 144 GB.
At $10/GB/month on
a cloud provider,
that’s $1,440/month
for this single cache.
Level 3: Bloom Filters — The Key Insight
Probabilistic Membership Testing
How a Bloom Filter Works
Start with a bit array of m bits, all zero. Choose k independent hash functions.
Adding alice@gmail.com:
- Compute k hash positions: h₁(“alice@gmail.com”), h₂(“alice@gmail.com”), …, hₖ(“alice@gmail.com”)
- Set those k bit positions to
1
Querying bob@gmail.com:
- Compute the same k hash positions for “bob@gmail.com”
- If any bit is
0→ definitely NOT in set (cannot be a false negative) - If all bits are
1→ probably in set (could be a false positive)
The false positive arises because different emails can coincidentally set the same bits. If “alice”, “charlie”, and “dave” together set bits 3, 7, and 12 — and “bob” also maps to positions 3, 7, and 12 — the filter falsely reports bob as present.
class BloomFilter { constructor(m, k) { this.m = m; // bit array size this.k = k; // number of hash functions this.bits = new Uint8Array(m); } add(item) { for (let i = 0; i < this.k; i++) { const pos = this._hash(item, i) % this.m; this.bits[pos] = 1; } } mightContain(item) { for (let i = 0; i < this.k; i++) { const pos = this._hash(item, i) % this.m; if (!this.bits[pos]) return false; // guaranteed NOT present } return true; // probably present } _hash(str, seed) { let h = (seed + 1) * 2654435761; for (let j = 0; j < str.length; j++) { h = ((h * 31) + str.charCodeAt(j)) & 0x7fffffff; } return h; } }
Interactive Bloom Filter Visualizer
🔬 Try it: Addalice@gmail.com,
then querybob@gmail.com
→ “Definitely not”.
Add alice, then
query alice
→ “Probably in set”.
Try to trigger a
false positive!
🔬 Bloom Filter — Interactive Visualizer
Level 4: Scalable Bloom Filter — Sharding
🗂️ A Bloom filter
must fit in one
process’s memory.
Sharding lets you
spread the space
across machines
and add replicas
for fault tolerance.
A 1.2 GB Bloom filter for 10 billion emails fits comfortably on one machine. But it’s a single point of failure, it can’t scale reads beyond one node’s CPU, and rebuilding it on restart takes minutes.
The solution: partition the email space across multiple Bloom filter nodes using consistent hashing.
function routeToShard(email) { // Deterministic routing: same email always hits same node const idx = fnv1a(email) % bloomNodes.length; return bloomNodes[idx]; } async function checkEmail(email) { const node = routeToShard(email); if (!node.healthy) { // Node down: degrade gracefully to DB lookup return db.emailExists(email); } return node.mightContain(email); }
Properties:
- Each node handles ~1/n of the email space independently
- A failed node degrades gracefully to the L2 database for its shard only
- Each node can be replicated for read scaling and high availability
- Nodes can be pre-warmed from a snapshot, minimising cold-start impact
Level 5: Counting Bloom Filter — Handling Deletions
🗑️ Account deletions
happen millions of
times per day.
A standard Bloom
filter can never
“unset” a bit —
once set, always set,
even after deletion.
A classic Bloom filter has one critical limitation: you cannot delete elements. You can’t un-set a bit because it might have been set by a completely different email address.
Why this matters: When users delete their Gmail accounts, those email addresses should eventually become available again. With a regular Bloom filter, a deleted email stays forever “probably taken.”
Solution: Replace each bit with a small counter (typically 4 bits, values 0–15).
| Field | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| Counter | 0 | 0 | 2 | 0 | 1 | 0 | 3 | 0 | 1 | 0 | 2 | 0 |
Add increments the k counters. Delete decrements them. An element “exists” if all its k counters are greater than zero.
class CountingBloomFilter { constructor(m, k) { this.m = m; this.k = k; this.counters = new Uint8Array(m); // 4-bit per slot (clamp at 15) } add(item) { for (let i = 0; i < this.k; i++) { const p = this._hash(item, i) % this.m; if (this.counters[p] < 15) this.counters[p]++; } } remove(item) { if (!this.mightContain(item)) return; for (let i = 0; i < this.k; i++) { const p = this._hash(item, i) % this.m; if (this.counters[p] > 0) this.counters[p]--; } } mightContain(item) { for (let i = 0; i < this.k; i++) { if (this.counters[this._hash(item, i) % this.m] === 0) return false; } return true; } }
Trade-off: 4× the memory of a standard Bloom filter (4-bit counters vs 1-bit flags). For Gmail’s scale, this is still 50–80× cheaper than storing full email strings. Well worth it for deletion support.
Level 6: Production Reality — Two-Phase Check
⚡ Google’s actual
architecture is
not public, but
the two-phase
layered approach
is a well-known
pattern across
major storage
systems.
No production system trusts a Bloom filter alone for a critical write path. The correct architecture is a two-phase check: the Bloom filter acts as a fast probabilistic pre-filter; an exact lookup resolves ambiguity.
The decision logic:
- BF says NO → Email is definitely free. Skip the DB entirely. Return “available” immediately. This is the fast path and handles the overwhelming majority of new-email attempts.
- BF says YES → Email is probably taken. Do the exact DB lookup to confirm.
- DB says NO (false positive) → BF was wrong; email is actually free. Rare (~1% of BF hits with good tuning). Log it so you can monitor filter saturation.
- DB says YES → Email is genuinely taken. Return “email already in use.”
async function checkEmailAvailability(email) { // Phase 1: Bloom filter — O(k), in-memory, ~0.1ms const mightExist = bloomFilter.mightContain(email); if (!mightExist) { // Guaranteed free — no DB hit needed metrics.increment('bf.definite_miss'); return { available: true, latencyMs: 0.1, source: 'bloom' }; } // Phase 2: Exact DB lookup — O(log n), ~10ms const inDb = await db.emailExists(email); if (!inDb) { // False positive — BF was wrong, email is actually free metrics.increment('bf.false_positive'); } return { available: !inDb, source: 'db' }; }
bf.false_positive / bf.total_hits in your metrics. If it climbs above your target (e.g. 2%), the filter is becoming saturated — rebuild it with a larger m or higher k. This is a normal operational concern, not a sign of a broken design.
Complexity Comparison
| Approach | Space | Query Time | False Positive | False Negative | Deletions |
|---|---|---|---|---|---|
| Naive SQL scan | O(n) | O(n) | 0% | 0% | Yes |
| B-tree index | O(n) | O(log n) | 0% | 0% | Yes |
| Redis cache + index | O(n) | O(1) hot | 0% | 0% | TTL risk |
| Bloom Filter | O(m) fixed | O(k) | ~1% | 0% | No |
| Counting BF | O(4m) fixed | O(k) | ~1% | 0% | Yes |
| BF + DB (two-phase) | O(m) + O(n) | O(k) + rare DB | 0% effective | 0% | Yes |
📐 The formula for
optimal k given m
and n is:
k = (m/n) × ln 2
At n=1B, m=9.6GB
that gives k ≈ 6.6,
so use k = 7 hash
functions for best
false positive rate.
False Positive Rate Calculator
🧮 A 1 GB Bloom
filter can represent
10 billion emails
at ~1% false positive
rate — that’s the
magic of probabilistic
data structures. The
formula assumes
ideal, independent
hash functions.
The false positive probability for a Bloom filter with m bits, k hash functions, and n inserted items:
🧮 False Positive Rate Calculator
Interview Tips
🎯 The best answers
mention monitoring:
“I’d track the FP
rate in real-time
and trigger a filter
rebuild when it
exceeds 2%.” That
shows operational
maturity beyond
just the theory.
Here’s what separates a good answer from a great one in the room:
Walk up the levels deliberately
Don't jump to Bloom filters. Start with SQL, explain why it fails, add an index, explain its limits, add a cache, explain its memory problem. Then motivate the probabilistic approach. Show your thinking.
Name the asymmetry explicitly
Say: "false negatives are catastrophic, false positives are acceptable and recoverable." This shows you think about system correctness relative to business impact, not just data structure properties.
Propose the two-phase approach
Never suggest a Bloom filter alone. Say: "we use it as a fast pre-filter — on a probable-hit we still do an exact DB lookup to eliminate false positives. False positives just add a rare extra DB query."
Bring up sharding unprompted
"A 10 GB Bloom filter must live on one machine, so for horizontal scale and HA we partition the email space with consistent hashing across multiple BF nodes, each independently replicated."
Volunteer the deletion problem
"One key limitation: standard Bloom filters can't delete. When accounts are removed, we'd need a Counting Bloom Filter — 4-bit counters instead of single bits, so we can decrement on delete."
Quantify the memory savings
Interviewers love numbers. "Storing 1.8B emails as strings needs ~180 GB. A Bloom filter for the same at 1% FP rate needs about 2.2 GB — roughly 80× less memory."
🔮 Bloom filters are
everywhere: Google
Chrome uses one to
check if a URL is
malicious before
making a network
request — it ships
a pre-built filter
of millions of known
bad URLs baked
into the binary.
Next in the series: Article #2 — Designing a Rate Limiter — token bucket vs. sliding window log vs. sliding window counter, and why distributed rate limiting is harder than it sounds.