System Design: Distributed Cache — Consistent Hashing, Eviction, and Beyond

Series #4 of 15 in
System Design Interview
deep-dives. Each post
stands alone but they
build on each other.

System Design Interview Series — #4 of 15

The question: Design a distributed cache like Memcached/Redis that serves 1 million requests per second with sub-millisecond latency and handles node failures gracefully.

This is a layered problem. You start with a HashMap in memory, quickly discover why that breaks, then build up through consistent hashing, eviction policies, and stampede prevention until you have a system that could actually run in production.



1. Why Cache? The Latency Numbers

The fundamental asymmetry that makes caching valuable: memory is 50,000× faster than a database query. That’s not an exaggeration — it’s physics.

Storage Layer Typical Latency Relative Scale
L1 CPU Cache ~1 ns
L2/L3 CPU Cache ~10 ns
RAM / In-Process Cache ~100 ns
Redis (network + RAM) ~0.3–1 ms
SSD Random Read ~100 µs
Database Query (indexed) ~5–50 ms
Cross-datacenter Network ~150 ms

The 80/20 rule of data access holds remarkably well in production: roughly 20% of your data receives 80% of your reads. Cache that hot 20% in RAM and you can absorb the overwhelming majority of traffic before it ever reaches your database.

A PostgreSQL server optimised for read-heavy workloads handles around 10,000–50,000 queries per second. Redis on the same hardware handles 1,000,000 operations per second. That factor of 20–100× is why caching exists.


2. Level 1: Single-Node Cache

The simplest implementation is a HashMap in memory. Java’s LinkedHashMap with access-order tracking gives us LRU eviction essentially for free:

Java — LRU cache with LinkedHashMap
public class LRUCache {
    private final int capacity;
    private final LinkedHashMap<String, String> store;

    public LRUCache(int capacity) {
        this.capacity = capacity;
        // accessOrder=true: get() moves entry to the back (most-recent end)
        this.store = new LinkedHashMap<>(capacity, 0.75f, true) {
            protected boolean removeEldestEntry(Map.Entry<String,String> e) {
                return size() > capacity;
            }
        };
    }

    public synchronized String get(String key) {
        return store.getOrDefault(key, null);   // null = cache miss
    }

    public synchronized void set(String key, String value) {
        store.put(key, value);
    }
}

This works beautifully for a single-server deployment. Gets are a HashMap lookup — sub-microsecond. Eviction is automatic. Memory is bounded.

This breaks the moment you need any of:

  • Fault tolerance — server crashes → entire cache gone → thundering herd on DB
  • Scale-out — data is too large for one machine’s RAM
  • Multiple app servers — server A has user:42 cached, server B doesn’t; the same key produces inconsistent results across servers

3. Level 2: Naive Sharding — The Modulo Trap

The natural extension is distributing keys across N cache nodes using modulo hashing:

Pseudocode — modulo sharding
node_index = hash(key) % N       // e.g. hash("user:42") % 3 = 1 → Node 1

// This is consistent: same key always routes to same node
// Works perfectly... until N changes
The problem — adding or removing a node
// 3-node cluster: hash("user:42") % 3 = 1  →  Node 1  ✓ (cache hit)

// After adding Node 4 (N=4):
// hash("user:42") % 4 = 3  →  Node 3  ✗ (cache miss — data is on Node 1!)

// Keys remapped when adding 1 node to a 3-node cluster:
keys_remapped = 1 - (N_old / N_new) = 1 - (3/4) = 75%

// Removing 1 node from 3 (N=2): ~50% of keys remap
// Removing 1 node from 10 (N=9): ~10% of keys remap — still painful at scale

When 75% of your cache keys suddenly miss simultaneously, every one of those requests falls through to your database. Your DB, sized for perhaps 5% of traffic, receives 100% of traffic in an instant. This is a cache stampede — it has taken down production systems at Twitter, Reddit, and GitHub.


4. Level 3: Consistent Hashing

The consistent hashing ring is one of the most elegant algorithms in distributed systems — adding a node to a 10-node cluster only moves ~10% of keys, not 90%. Invented by Karger et al. at MIT in 1997, it underpins virtually every distributed cache and DHT today.

Consistent hashing maps both nodes and keys onto a shared circular space — the ring — spanning from 0 to 2³²−1. A key is served by the first node you encounter travelling clockwise from the key’s position.

The magic property: when you add or remove one node from an N-node cluster, only ~1/N of keys need to relocate. The rest stay put.

Python — consistent hashing ring core
import hashlib, bisect

class ConsistentHashRing:
    def __init__(self, vnodes_per_node=150):
        self.vnodes_per_node = vnodes_per_node
        self.ring   = {}         # hash_pos → node_name
        self.sorted_keys = []    # sorted list of hash positions

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        for i in range(self.vnodes_per_node):
            vkey  = node + "-vnode-" + str(i)
            h     = self._hash(vkey)
            self.ring[h] = node
            bisect.insort(self.sorted_keys, h)

    def remove_node(self, node: str):
        for i in range(self.vnodes_per_node):
            vkey = node + "-vnode-" + str(i)
            h    = self._hash(vkey)
            del self.ring[h]
            self.sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        h   = self._hash(key)
        idx = bisect.bisect(self.sorted_keys, h) % len(self.sorted_keys)
        return self.ring[self.sorted_keys[idx]]

Virtual nodes solve uneven distribution. Without them, three nodes at positions 30°, 32°, and 35° would leave a 320° arc unbalanced. Each physical node is represented by 100–200 virtual positions spread across the ring, making load distribution nearly uniform.

Interactive — Consistent Hashing Ring
 

5. Level 4: Eviction Policies

LFU sounds better than LRU in theory — why evict something recently used when you could evict the least popular item? But LFU is harder to implement correctly: old-but-popular items accumulate high counts and become “immortal”. Redis uses approximated LRU by default; LFU was only added in Redis 4.0.

When the cache is full and a new item must be inserted, which existing item gets evicted? Four common strategies:

Policy Evicts Best for Weakness
LRU — Least Recently Used The item not accessed for the longest time General-purpose workloads Poor for scan-heavy workloads (one full table scan evicts your entire working set)
LFU — Least Frequently Used The item accessed the fewest times Hot-item workloads with a stable access pattern Doesn’t adapt quickly to changing access patterns
FIFO — First In, First Out The oldest-inserted item Streaming pipelines Ignores access frequency entirely
Random A random item Simple caches, approximations Unpredictable; occasionally evicts hot items
Interactive — Eviction Policy Comparison

Click items to access them, then click Add New Item to see which each policy evicts.


6. Level 5: Write Policies

Reads are the easy part. The difficult question is: when the application writes new data, what happens to the cache?

Policy Write path Consistency Perf Risk
Write-through Write to cache AND DB in the same request (synchronous) Strong Medium Latency added to every write
Write-behind (write-back) Write to cache immediately; flush to DB asynchronously Eventual Fast Data loss if cache node crashes before flush
Write-around Write directly to DB, skip cache entirely Strong Medium Cache miss on next read; cold-start penalty
Cache-aside (lazy loading) Invalidate cache on write; populate on next cache miss Near-strong Medium Brief window of stale data; thundering herd on cold start

For most production systems: write-through for critical user data, write-behind for analytics and counters. The latency overhead of write-through (one extra network round-trip to the cache) is usually acceptable compared to the risk of data loss in write-behind.

Python — write-through pattern
def update_user_profile(user_id: str, data: dict):
    # 1. Write to DB first (source of truth)
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)

    # 2. Immediately update cache (synchronous — same request)
    cache_key = "user:" + user_id
    redis.setex(cache_key, 3600, json.dumps(data))

    # 3. Both DB and cache are now consistent
    # Next read hits cache; no stale data window

7. Level 6: Cache Invalidation

"There are only two hard things in Computer Science: cache invalidation and naming things."
— Phil Karlton, Netscape, ~1996

Cache invalidation is hard because the cache and the database are two different systems that can diverge. Three main strategies:

TTL-based (Time-To-Live) — Every cache entry expires after N seconds. Simple, always eventually consistent. Drawback: stale data can exist for up to TTL seconds; setting TTL too short defeats the purpose of caching.

Event-based — DB changes trigger cache invalidation via events (DB triggers, CDC/Debezium, application-level pub/sub). Cache is updated in near real-time. Complex to implement; requires reliable event delivery — what if the invalidation event is lost?

Version-based — Embed a version token in the cache key: user:42:v7. When the object changes, increment the version. Old cache entries become unreachable (and eventually expire by TTL). Simple, correct, but leaves dead entries in memory until they TTL out.

Python — version-based key invalidation
def get_user(user_id: str) -> dict:
    version  = redis.get("user:" + user_id + ":version") or "1"
    cache_key = "user:" + user_id + ":v" + version.decode()

    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # cache miss — fetch from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    redis.setex(cache_key, 3600, json.dumps(user))
    return user

def update_user(user_id: str, data: dict):
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # increment version — old cache entry becomes unreachable immediately
    redis.incr("user:" + user_id + ":version")

8. Level 7: Cache Stampede Prevention

Cache stampede is a real production incident waiting to happen. Every major site has been taken down by it at some point. When a popular item’s TTL expires, 10,000 simultaneous requests hit the DB. The DB slows under load, slowing the cache-fill, which means more requests continue hitting the DB, causing further slowdown — a death spiral.

The cache stampede (also called thundering herd or dog-pile effect) happens when a popular cached item expires and many concurrent requests all experience a miss simultaneously, all querying the database at once.

Three solutions, in order of elegance:

A — Mutex Lock: The first request acquires a lock and regenerates the cache. All other requests wait.

Python — mutex lock pattern
def get_with_lock(key: str) -> str:
    value = redis.get(key)
    if value:
        return value

    lock_key = key + ":lock"
    # NX = only set if Not eXists; EX = expire in 10s
    acquired = redis.set(lock_key, "1", nx=True, ex=10)

    if acquired:
        value = expensive_db_query(key)
        redis.setex(key, 3600, value)
        redis.delete(lock_key)
        return value
    else:
        # another process is regenerating; return stale or wait briefly
        time.sleep(0.05)
        return redis.get(key) or fallback(key)

B — Early Recomputation: Refresh the cache shortly before it expires. Set TTL to 1 hour, but trigger background recomputation when 50 minutes have elapsed.

C — XFetch Algorithm (probabilistic early expiration): A mathematically elegant solution from Facebook Research (2015). Each request independently decides — with increasing probability as expiry approaches — whether to recompute. No locks, no coordination.

Python — XFetch probabilistic early expiration
import math, time, random

def xfetch_get(key: str, beta: float = 1.0) -> str:
    """
    beta > 1.0 = eager recomputation (less stampede risk)
    beta = 1.0 = balanced (default)
    beta < 1.0 = lazy (more stampede risk)
    """
    cached = redis.hgetall(key)   # stores: value, delta, expiry

    if cached:
        value  = cached["value"]
        delta  = float(cached["delta"])   # last recompute duration (seconds)
        expiry = float(cached["expiry"])   # absolute unix expiry

        # XFetch decision: recompute if this inequality holds
        now = time.time()
        if now - beta * delta * math.log(random.random()) < expiry:
            return value   # still "fresh enough", serve it

    # recompute — this request wins the probabilistic race
    t0    = time.time()
    value = expensive_db_query(key)
    delta = time.time() - t0
    ttl   = 3600

    redis.hset(key, mapping={
        "value":  value,
        "delta":  str(delta),
        "expiry": str(time.time() + ttl),
    })
    redis.expire(key, ttl)
    return value

The insight behind XFetch: the probability of recomputing increases exponentially as the expiry approaches, weighted by how long recomputation takes (delta). A query that takes 2 seconds to recompute starts early-recomputing much sooner than a 10ms query.


9. Level 8: Two-Tier Cache (L1 + L2)

For ultra-high-throughput systems, even Redis at 0.3ms introduces too much network latency. The solution is a two-tier hierarchy:

App Server
JVM / Python process
L1: In-process
HashMap, ~100ns
L2: Redis Cluster
3 shards + replicas, ~0.3ms
PostgreSQL
Primary + read replicas
Tier Size Latency Scope Eviction
L1 in-process 100–500 MB ~100 ns Per-instance LRU (Caffeine, Guava)
L2 Redis Cluster 50–500 GB ~0.3 ms Shared across all instances LRU or LFU
Database Unlimited 5–50 ms Ground truth N/A

The critical challenge with L1: when data changes, you must invalidate L1 on all app server instances, not just the one handling the write. Solutions:

  1. Short TTL on L1 (5–30 seconds) — accept brief staleness, simple to implement
  2. Pub/Sub invalidation — on write, publish an invalidation event to a Redis channel; all app servers subscribe and clear their local L1 entry
  3. Version-tagged keys — L1 key includes a version number; stale entries are simply never hit
Python — two-tier cache with pub/sub invalidation
class TieredCache:
    def __init__(self):
        self.l1     = {}                    # in-process dict (tiny, fast)
        self.l1_ttl = {}                    # expiry timestamps
        self.redis  = RedisCluster(...)
        self.pubsub = self.redis.pubsub()
        self.pubsub.subscribe("cache:invalidate", self._on_invalidate)

    def get(self, key: str):
        # L1 check (in-process, no network)
        if key in self.l1 and time.time() < self.l1_ttl.get(key, 0):
            return self.l1[key]

        # L2 check (Redis, one network hop)
        value = self.redis.get(key)
        if value:
            self.l1[key]     = value
            self.l1_ttl[key] = time.time() + 30   # 30s L1 TTL
            return value

        # DB fallback
        value = db.query(key)
        self.redis.setex(key, 3600, value)
        self.l1[key]     = value
        self.l1_ttl[key] = time.time() + 30
        return value

    def invalidate(self, key: str):
        self.redis.delete(key)
        # notify ALL app server instances to drop their L1 entry
        self.redis.publish("cache:invalidate", key)

    def _on_invalidate(self, msg):
        key = msg["data"]
        self.l1.pop(key, None)
        self.l1_ttl.pop(key, None)

10. Interactive Cache Simulator

Try building your own cache. Type GET key or SET key value to interact. The simulator applies your chosen eviction policy when the cache is full.

Interactive — Cache Simulator
Quick:
Hits: 0
Misses: 0
Hit rate:

11. Capacity Planning

How much RAM does your distributed cache actually need? The formula:

Formula — cache RAM estimation
total_ram_bytes = num_keys
                × (avg_key_size + avg_value_size + overhead_per_entry)
                × overhead_factor      // Redis adds ~90 bytes/key overhead
                × replication_factor    // each replica is a full copy

// Redis key overhead: ~90 bytes per key (dict entry + object header + expiry)
// Memcached: ~60 bytes per key
// overhead_factor ≈ 1.2–1.5 (memory fragmentation, hash table load factor)
Interactive — Capacity Calculator

The Interview Answer

If asked “Design a distributed cache” in a system design interview, here is the progression that will impress:

  1. Clarify — read-heavy or write-heavy? What consistency guarantees are needed? What’s the hot-data size? Single-region or multi-region?

  2. Start simple — in-process HashMap with LRU eviction. Acknowledge limitations immediately: not distributed, not fault-tolerant.

  3. Introduce sharding — explain modulo hashing, then immediately explain why it breaks on node changes.

  4. Introduce consistent hashing — walk through the ring, virtual nodes, and the ~1/N remapping property. This is the core concept.

  5. Eviction — LRU for general-purpose, LFU for stable hot-item workloads. Mention Redis approximated LRU.

  6. Write policy — write-through for consistency, write-behind for throughput. Explain the tradeoff.

  7. Stampede prevention — mutex lock is obvious; XFetch is the impressive answer. Show you know about probabilistic early expiration.

  8. Two-tier — mention L1 (in-process) + L2 (Redis Cluster) for latency-critical paths. Discuss invalidation.

  9. Capacity — back-of-envelope: “10M objects × (40 + 512 + 90 bytes) × 1.3 fragmentation × 2 replicas ≈ 17 GB of Redis RAM”.

The candidate who walks through all nine levels — and explains why each escalation is needed — will stand out from the majority who jump straight to “use Redis”.