System Design: Key-Value Store — From HashMap to Dynamo-Style Distributed DB
The Amazon Dynamo paper
(2007) is one of the most
influential papers in
distributed systems — it
introduced consistent
hashing + vector clocks
+ quorum reads/writes
as a practical package.
A key-value store is the simplest possible database interface: get(key), put(key, value), delete(key). Yet making one that’s distributed, fault-tolerant, and consistent is one of the hardest problems in computer science. This post walks the full journey — from a single-machine HashMap to a Dynamo-style ring that handles petabytes, partial failures, and concurrent writes.
The question: Design a distributed key-value store like Amazon DynamoDB or Apache Cassandra. Support GET/PUT/DELETE, handle node failures, maintain consistency, scale to petabytes.
1. The Problem
The key-value interface is brutally simple:
interface KeyValueStore { get(key: string): Value | null; put(key: string, value: Value): void; delete(key: string): void; }
Yet to make this distributed, fault-tolerant, and consistent at petabyte scale you must solve:
- Partitioning — how to split data across hundreds of nodes
- Replication — how many copies, and where
- Consistency — what happens when replicas disagree
- Failure detection — how nodes know when peers are dead
- Recovery — how to heal after partitions
We’ll build up layer by layer, explaining why each decision exists before showing how it works.
2. Level 1 — In-Memory HashMap
The simplest implementation: a hash map in memory.
class SimpleKVStore { private final Map<String, byte[]> store = new HashMap<>(); public byte[] get(String key) { return store.get(key); // O(1) } public void put(String key, byte[] value) { store.put(key, value); // O(1) } public void delete(String key) { store.remove(key); // O(1) } }
Problems:
- No persistence — process crash loses all data
- Single machine — bound by one machine’s RAM
- No fault tolerance — one machine going down means 100% downtime
Adding a Write-Ahead Log (WAL)
Before any mutation, append it to a log file. On startup, replay the log to rebuild state.
class WALKVStore { private final Map<String, byte[]> store = new HashMap<>(); private final FileWriter wal; public void put(String key, byte[] value) throws IOException { // 1. Write to WAL first (crash-safe) wal.write("PUT," + key + "," + Base64.encode(value) + "\n"); wal.flush(); // fsync for durability // 2. Apply to in-memory map store.put(key, value); } public void recover() throws IOException { // Replay WAL on restart for (String line : Files.readAllLines(walPath)) { String[] parts = line.split(","); if (parts[0].equals("PUT")) store.put(parts[1], Base64.decode(parts[2])); else if (parts[0].equals("DEL")) store.remove(parts[1]); } } }
The WAL solves durability. But the data is still on one machine. We need to scale.
3. Level 2 — Single-Node with Persistence (LSM Tree)
LSM trees are write-optimized
by design: every write is a
sequential append. Random
writes are the nemesis of
spinning disks, and LSM
eliminates them entirely.
Random disk writes (updating a B-Tree in place) are the bottleneck for write-heavy workloads. The Log-Structured Merge Tree (LSM Tree) turns all writes into sequential appends:
Write path:
- Write goes to in-memory MemTable (a sorted balanced BST, e.g. Red-Black tree)
- When MemTable exceeds threshold (~64 MB), flush to an immutable SSTable file on disk
- Background compaction merges SSTables, removing old versions and tombstones
Read path:
- Check MemTable (newest data first)
- Check SSTables from newest to oldest (aided by Bloom filters to skip non-existent keys)
4. Level 3 — CAP Theorem
Before distributing, we need the fundamental theorem of distributed systems.
- Consistency (C): Every read receives the most recent write
- Availability (A): Every request receives a response (not necessarily the most recent)
- Partition Tolerance (P): The system continues operating despite network partitions
△ CAP Triangle — Click a system to explore
5. Level 4 — Consistent Hashing for Partitioning
A naive approach: node = hash(key) % N. Problem: when N changes (node added/removed), every key remaps. With 1 billion keys and 10 nodes, adding one node moves ~900 million keys.
Consistent hashing maps both keys and nodes onto the same circular ring (0 to 2³²). A key belongs to the first node clockwise from its hash.
- Adding a node: only keys between the new node and its predecessor need to move (~1/N of keys)
- Removing a node: only that node’s keys need to move to its successor
6. Level 5 — Replication
Quorum with N=3, W=2, R=2
is the sweet spot: you can
tolerate one node failure
on both reads and writes
while still maintaining
strong consistency.
Data is replicated to N consecutive nodes on the ring (the preference list). Three strategies for handling writes:
| Strategy | How | Consistency | Latency |
|---|---|---|---|
| Synchronous | Coordinator waits for all N replicas to ACK | Strong | Slowest (worst case: slowest node) |
| Asynchronous | Coordinator writes to 1, replicates in background | Eventual | Fastest |
| Quorum | Write to W, read from R, require W+R > N | Tunable | Middle |
The quorum approach is the key insight from the Dynamo paper: by tuning W and R, you can slide a dial between consistency and performance.
◆ Quorum Calculator — Tune N, W, R
7. Level 6 — Conflict Resolution with Vector Clocks
With async replication, two nodes can independently accept writes for the same key, resulting in conflicting versions. How do you know which is “correct”?
Last-Write-Wins (LWW)
Use timestamps: keep the value with the highest timestamp. Simple, but clocks drift — NTP can be off by tens of milliseconds. A causally earlier write with a slightly later system clock wins incorrectly.
Vector Clocks
Each value carries a version vector — a map of {nodeId → counter}. When a node updates a value, it increments its own counter.
Causality vs. Conflict
Two events are causally related if one happened-before the other (Lamport’s happens-before relation). Vector clocks capture this precisely:
- V1 dominates V2 if every counter in V1 ≥ corresponding counter in V2 → V1 is a successor, discard V2
- Neither dominates → concurrent → conflict, application must merge
8. Level 7 — Gossip Protocol for Failure Detection
In a large cluster, a central failure detector is a single point of failure. Gossip protocols give us decentralized, scalable failure detection.
How it works:
- Every T seconds, each node picks K random peers and sends a “heartbeat” message with its membership list
- The receiver merges the membership list (taking max generation/heartbeat counts)
- If a node’s heartbeat hasn’t increased in T_suspect seconds → mark suspect
- If still silent after T_dead seconds → mark dead, redistribute its keys
∿ Gossip Protocol Visualizer — Click a node to kill it
9. Level 8 — Read Repair & Hinted Handoff
Two mechanisms that keep the cluster eventually consistent without coordinator intervention:
Read Repair
When the coordinator sends a quorum read, it compares versions from R replicas. If any replica returns a stale version, the coordinator sends an async write to bring it up to date.
Hinted Handoff
If the target replica for a write is temporarily down, another node stores the write with a “hint” (the intended recipient’s ID). When the target comes back online, the hint-holder delivers the queued writes.
10. Full Architecture
◌ Complete System Architecture
Request flow for PUT("user:42", data) with N=3, W=2:
- Client sends to any node (becomes coordinator)
- Coordinator hashes
"user:42"→ position on ring → finds 3 consecutive nodes - Coordinator sends write to all 3 replicas
- Waits for W=2 ACKs → responds success to client
- Third replica ACKs asynchronously (best-effort)
- If a replica is down → hinted handoff stores the write
- Read repair and anti-entropy keep replicas synchronized
11. Comparison
| Feature | Redis | DynamoDB | Cassandra | RocksDB |
|---|---|---|---|---|
| Persistence | Optional (RDB/AOF) | Yes (managed) | Yes (SSTable) | Yes (LSM) |
| Distribution | Cluster mode | Fully managed | Yes (ring) | Single node only |
| Consistency | Strong (single node) | Tunable | Tunable | Single-node strong |
| CAP | CP | AP | AP | — |
| Data model | Rich types (list, set, sorted set, hash) | Document + KV | Wide-column | Pure KV bytes |
| Best for | Cache, leaderboards, sessions | Serverless, variable scale | Time-series, IoT, writes | Embedded storage engine |
| Latency | Sub-ms (in-memory) | Single-digit ms | Low ms | Sub-ms |
12. Interview Cheat Sheet
When answering this question in an interview, hit these points in order:
- Clarify requirements — read-heavy vs write-heavy? Consistency vs latency? Data size?
- Start simple — HashMap + WAL → single node works, but doesn’t scale
- Add partitioning — consistent hashing for minimal key redistribution on topology changes
- Add replication — N=3, pick W and R based on consistency needs
- Failure detection — gossip protocol, no central coordinator
- Recovery — read repair for temporary skew, hinted handoff for node downtime, anti-entropy (Merkle trees) for long-term divergence
- Conflict resolution — LWW for simple cases, vector clocks when application semantics matter
Quorum with N=3, W=2, R=2
is the sweet spot: tolerates
one node failure on both
reads and writes while
maintaining strong
consistency.
What distinguishes a great answer:
- Explain why LSM trees are better than B-trees for write-heavy KV stores (sequential vs random I/O)
- Explain why consistent hashing (minimal key redistribution) vs modulo hashing (full reshuffle)
- Explain the quorum math: W+R > N means the read set and write set overlap by at least one node
- Know the tradeoff: vector clocks give you conflict detection but not conflict resolution — that’s the application’s job
Next in the series: #10 — Design a Notification System — fan-out patterns, push vs pull, and at-least-once delivery guarantees.