System Design: Distributed Cache — Consistent Hashing, Eviction, and Beyond
Series #4 of 15 in
System Design Interview
deep-dives. Each post
stands alone but they
build on each other.
The question: Design a distributed cache like Memcached/Redis that serves 1 million requests per second with sub-millisecond latency and handles node failures gracefully.
This is a layered problem. You start with a HashMap in memory, quickly discover why that breaks, then build up through consistent hashing, eviction policies, and stampede prevention until you have a system that could actually run in production.
1. Why Cache? The Latency Numbers
The fundamental asymmetry that makes caching valuable: memory is 50,000× faster than a database query. That’s not an exaggeration — it’s physics.
| Storage Layer | Typical Latency | Relative Scale |
|---|---|---|
| L1 CPU Cache | ~1 ns | |
| L2/L3 CPU Cache | ~10 ns | |
| RAM / In-Process Cache | ~100 ns | |
| Redis (network + RAM) | ~0.3–1 ms | |
| SSD Random Read | ~100 µs | |
| Database Query (indexed) | ~5–50 ms | |
| Cross-datacenter Network | ~150 ms |
The 80/20 rule of data access holds remarkably well in production: roughly 20% of your data receives 80% of your reads. Cache that hot 20% in RAM and you can absorb the overwhelming majority of traffic before it ever reaches your database.
A PostgreSQL server optimised for read-heavy workloads handles around 10,000–50,000 queries per second. Redis on the same hardware handles 1,000,000 operations per second. That factor of 20–100× is why caching exists.
2. Level 1: Single-Node Cache
The simplest implementation is a HashMap in memory. Java’s LinkedHashMap with access-order tracking gives us LRU eviction essentially for free:
public class LRUCache { private final int capacity; private final LinkedHashMap<String, String> store; public LRUCache(int capacity) { this.capacity = capacity; // accessOrder=true: get() moves entry to the back (most-recent end) this.store = new LinkedHashMap<>(capacity, 0.75f, true) { protected boolean removeEldestEntry(Map.Entry<String,String> e) { return size() > capacity; } }; } public synchronized String get(String key) { return store.getOrDefault(key, null); // null = cache miss } public synchronized void set(String key, String value) { store.put(key, value); } }
This works beautifully for a single-server deployment. Gets are a HashMap lookup — sub-microsecond. Eviction is automatic. Memory is bounded.
This breaks the moment you need any of:
- Fault tolerance — server crashes → entire cache gone → thundering herd on DB
- Scale-out — data is too large for one machine’s RAM
- Multiple app servers — server A has
user:42cached, server B doesn’t; the same key produces inconsistent results across servers
3. Level 2: Naive Sharding — The Modulo Trap
The natural extension is distributing keys across N cache nodes using modulo hashing:
node_index = hash(key) % N // e.g. hash("user:42") % 3 = 1 → Node 1 // This is consistent: same key always routes to same node // Works perfectly... until N changes
// 3-node cluster: hash("user:42") % 3 = 1 → Node 1 ✓ (cache hit) // After adding Node 4 (N=4): // hash("user:42") % 4 = 3 → Node 3 ✗ (cache miss — data is on Node 1!) // Keys remapped when adding 1 node to a 3-node cluster: keys_remapped = 1 - (N_old / N_new) = 1 - (3/4) = 75% // Removing 1 node from 3 (N=2): ~50% of keys remap // Removing 1 node from 10 (N=9): ~10% of keys remap — still painful at scale
When 75% of your cache keys suddenly miss simultaneously, every one of those requests falls through to your database. Your DB, sized for perhaps 5% of traffic, receives 100% of traffic in an instant. This is a cache stampede — it has taken down production systems at Twitter, Reddit, and GitHub.
4. Level 3: Consistent Hashing
The consistent hashing ring is one of the most elegant algorithms in distributed systems — adding a node to a 10-node cluster only moves ~10% of keys, not 90%. Invented by Karger et al. at MIT in 1997, it underpins virtually every distributed cache and DHT today.
Consistent hashing maps both nodes and keys onto a shared circular space — the ring — spanning from 0 to 2³²−1. A key is served by the first node you encounter travelling clockwise from the key’s position.
The magic property: when you add or remove one node from an N-node cluster, only ~1/N of keys need to relocate. The rest stay put.
import hashlib, bisect class ConsistentHashRing: def __init__(self, vnodes_per_node=150): self.vnodes_per_node = vnodes_per_node self.ring = {} # hash_pos → node_name self.sorted_keys = [] # sorted list of hash positions def _hash(self, key: str) -> int: return int(hashlib.md5(key.encode()).hexdigest(), 16) def add_node(self, node: str): for i in range(self.vnodes_per_node): vkey = node + "-vnode-" + str(i) h = self._hash(vkey) self.ring[h] = node bisect.insort(self.sorted_keys, h) def remove_node(self, node: str): for i in range(self.vnodes_per_node): vkey = node + "-vnode-" + str(i) h = self._hash(vkey) del self.ring[h] self.sorted_keys.remove(h) def get_node(self, key: str) -> str: h = self._hash(key) idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys) return self.ring[self.sorted_keys[idx]]
Virtual nodes solve uneven distribution. Without them, three nodes at positions 30°, 32°, and 35° would leave a 320° arc unbalanced. Each physical node is represented by 100–200 virtual positions spread across the ring, making load distribution nearly uniform.
5. Level 4: Eviction Policies
LFU sounds better than LRU in theory — why evict something recently used when you could evict the least popular item? But LFU is harder to implement correctly: old-but-popular items accumulate high counts and become “immortal”. Redis uses approximated LRU by default; LFU was only added in Redis 4.0.
When the cache is full and a new item must be inserted, which existing item gets evicted? Four common strategies:
| Policy | Evicts | Best for | Weakness |
|---|---|---|---|
| LRU — Least Recently Used | The item not accessed for the longest time | General-purpose workloads | Poor for scan-heavy workloads (one full table scan evicts your entire working set) |
| LFU — Least Frequently Used | The item accessed the fewest times | Hot-item workloads with a stable access pattern | Doesn’t adapt quickly to changing access patterns |
| FIFO — First In, First Out | The oldest-inserted item | Streaming pipelines | Ignores access frequency entirely |
| Random | A random item | Simple caches, approximations | Unpredictable; occasionally evicts hot items |
Click items to access them, then click Add New Item to see which each policy evicts.
6. Level 5: Write Policies
Reads are the easy part. The difficult question is: when the application writes new data, what happens to the cache?
| Policy | Write path | Consistency | Perf | Risk |
|---|---|---|---|---|
| Write-through | Write to cache AND DB in the same request (synchronous) | Strong | Medium | Latency added to every write |
| Write-behind (write-back) | Write to cache immediately; flush to DB asynchronously | Eventual | Fast | Data loss if cache node crashes before flush |
| Write-around | Write directly to DB, skip cache entirely | Strong | Medium | Cache miss on next read; cold-start penalty |
| Cache-aside (lazy loading) | Invalidate cache on write; populate on next cache miss | Near-strong | Medium | Brief window of stale data; thundering herd on cold start |
For most production systems: write-through for critical user data, write-behind for analytics and counters. The latency overhead of write-through (one extra network round-trip to the cache) is usually acceptable compared to the risk of data loss in write-behind.
def update_user_profile(user_id: str, data: dict): # 1. Write to DB first (source of truth) db.execute("UPDATE users SET ... WHERE id = %s", user_id) # 2. Immediately update cache (synchronous — same request) cache_key = "user:" + user_id redis.setex(cache_key, 3600, json.dumps(data)) # 3. Both DB and cache are now consistent # Next read hits cache; no stale data window
7. Level 6: Cache Invalidation
— Phil Karlton, Netscape, ~1996
Cache invalidation is hard because the cache and the database are two different systems that can diverge. Three main strategies:
TTL-based (Time-To-Live) — Every cache entry expires after N seconds. Simple, always eventually consistent. Drawback: stale data can exist for up to TTL seconds; setting TTL too short defeats the purpose of caching.
Event-based — DB changes trigger cache invalidation via events (DB triggers, CDC/Debezium, application-level pub/sub). Cache is updated in near real-time. Complex to implement; requires reliable event delivery — what if the invalidation event is lost?
Version-based — Embed a version token in the cache key: user:42:v7. When the object changes, increment the version. Old cache entries become unreachable (and eventually expire by TTL). Simple, correct, but leaves dead entries in memory until they TTL out.
def get_user(user_id: str) -> dict: version = redis.get("user:" + user_id + ":version") or "1" cache_key = "user:" + user_id + ":v" + version.decode() cached = redis.get(cache_key) if cached: return json.loads(cached) # cache miss — fetch from DB user = db.query("SELECT * FROM users WHERE id = %s", user_id) redis.setex(cache_key, 3600, json.dumps(user)) return user def update_user(user_id: str, data: dict): db.execute("UPDATE users SET ... WHERE id = %s", user_id) # increment version — old cache entry becomes unreachable immediately redis.incr("user:" + user_id + ":version")
8. Level 7: Cache Stampede Prevention
Cache stampede is a real production incident waiting to happen. Every major site has been taken down by it at some point. When a popular item’s TTL expires, 10,000 simultaneous requests hit the DB. The DB slows under load, slowing the cache-fill, which means more requests continue hitting the DB, causing further slowdown — a death spiral.
The cache stampede (also called thundering herd or dog-pile effect) happens when a popular cached item expires and many concurrent requests all experience a miss simultaneously, all querying the database at once.
Three solutions, in order of elegance:
A — Mutex Lock: The first request acquires a lock and regenerates the cache. All other requests wait.
def get_with_lock(key: str) -> str: value = redis.get(key) if value: return value lock_key = key + ":lock" # NX = only set if Not eXists; EX = expire in 10s acquired = redis.set(lock_key, "1", nx=True, ex=10) if acquired: value = expensive_db_query(key) redis.setex(key, 3600, value) redis.delete(lock_key) return value else: # another process is regenerating; return stale or wait briefly time.sleep(0.05) return redis.get(key) or fallback(key)
B — Early Recomputation: Refresh the cache shortly before it expires. Set TTL to 1 hour, but trigger background recomputation when 50 minutes have elapsed.
C — XFetch Algorithm (probabilistic early expiration): A mathematically elegant solution from Facebook Research (2015). Each request independently decides — with increasing probability as expiry approaches — whether to recompute. No locks, no coordination.
import math, time, random def xfetch_get(key: str, beta: float = 1.0) -> str: """ beta > 1.0 = eager recomputation (less stampede risk) beta = 1.0 = balanced (default) beta < 1.0 = lazy (more stampede risk) """ cached = redis.hgetall(key) # stores: value, delta, expiry if cached: value = cached["value"] delta = float(cached["delta"]) # last recompute duration (seconds) expiry = float(cached["expiry"]) # absolute unix expiry # XFetch decision: recompute if this inequality holds now = time.time() if now - beta * delta * math.log(random.random()) < expiry: return value # still "fresh enough", serve it # recompute — this request wins the probabilistic race t0 = time.time() value = expensive_db_query(key) delta = time.time() - t0 ttl = 3600 redis.hset(key, mapping={ "value": value, "delta": str(delta), "expiry": str(time.time() + ttl), }) redis.expire(key, ttl) return value
The insight behind XFetch: the probability of recomputing increases exponentially as the expiry approaches, weighted by how long recomputation takes (delta). A query that takes 2 seconds to recompute starts early-recomputing much sooner than a 10ms query.
9. Level 8: Two-Tier Cache (L1 + L2)
For ultra-high-throughput systems, even Redis at 0.3ms introduces too much network latency. The solution is a two-tier hierarchy:
JVM / Python process
HashMap, ~100ns
3 shards + replicas, ~0.3ms
Primary + read replicas
| Tier | Size | Latency | Scope | Eviction |
|---|---|---|---|---|
| L1 in-process | 100–500 MB | ~100 ns | Per-instance | LRU (Caffeine, Guava) |
| L2 Redis Cluster | 50–500 GB | ~0.3 ms | Shared across all instances | LRU or LFU |
| Database | Unlimited | 5–50 ms | Ground truth | N/A |
The critical challenge with L1: when data changes, you must invalidate L1 on all app server instances, not just the one handling the write. Solutions:
- Short TTL on L1 (5–30 seconds) — accept brief staleness, simple to implement
- Pub/Sub invalidation — on write, publish an invalidation event to a Redis channel; all app servers subscribe and clear their local L1 entry
- Version-tagged keys — L1 key includes a version number; stale entries are simply never hit
class TieredCache: def __init__(self): self.l1 = {} # in-process dict (tiny, fast) self.l1_ttl = {} # expiry timestamps self.redis = RedisCluster(...) self.pubsub = self.redis.pubsub() self.pubsub.subscribe("cache:invalidate", self._on_invalidate) def get(self, key: str): # L1 check (in-process, no network) if key in self.l1 and time.time() < self.l1_ttl.get(key, 0): return self.l1[key] # L2 check (Redis, one network hop) value = self.redis.get(key) if value: self.l1[key] = value self.l1_ttl[key] = time.time() + 30 # 30s L1 TTL return value # DB fallback value = db.query(key) self.redis.setex(key, 3600, value) self.l1[key] = value self.l1_ttl[key] = time.time() + 30 return value def invalidate(self, key: str): self.redis.delete(key) # notify ALL app server instances to drop their L1 entry self.redis.publish("cache:invalidate", key) def _on_invalidate(self, msg): key = msg["data"] self.l1.pop(key, None) self.l1_ttl.pop(key, None)
10. Interactive Cache Simulator
Try building your own cache. Type GET key or SET key value to interact. The simulator applies your chosen eviction policy when the cache is full.
11. Capacity Planning
How much RAM does your distributed cache actually need? The formula:
total_ram_bytes = num_keys × (avg_key_size + avg_value_size + overhead_per_entry) × overhead_factor // Redis adds ~90 bytes/key overhead × replication_factor // each replica is a full copy // Redis key overhead: ~90 bytes per key (dict entry + object header + expiry) // Memcached: ~60 bytes per key // overhead_factor ≈ 1.2–1.5 (memory fragmentation, hash table load factor)
The Interview Answer
If asked “Design a distributed cache” in a system design interview, here is the progression that will impress:
-
Clarify — read-heavy or write-heavy? What consistency guarantees are needed? What’s the hot-data size? Single-region or multi-region?
-
Start simple — in-process HashMap with LRU eviction. Acknowledge limitations immediately: not distributed, not fault-tolerant.
-
Introduce sharding — explain modulo hashing, then immediately explain why it breaks on node changes.
-
Introduce consistent hashing — walk through the ring, virtual nodes, and the ~1/N remapping property. This is the core concept.
-
Eviction — LRU for general-purpose, LFU for stable hot-item workloads. Mention Redis approximated LRU.
-
Write policy — write-through for consistency, write-behind for throughput. Explain the tradeoff.
-
Stampede prevention — mutex lock is obvious; XFetch is the impressive answer. Show you know about probabilistic early expiration.
-
Two-tier — mention L1 (in-process) + L2 (Redis Cluster) for latency-critical paths. Discuss invalidation.
-
Capacity — back-of-envelope: “10M objects × (40 + 512 + 90 bytes) × 1.3 fragmentation × 2 replicas ≈ 17 GB of Redis RAM”.
The candidate who walks through all nine levels — and explains why each escalation is needed — will stand out from the majority who jump straight to “use Redis”.