System Design: Key-Value Store — From HashMap to Dynamo-Style Distributed DB

Series System Design Interview Prep — #9 of 15

The Amazon Dynamo paper
(2007) is one of the most
influential papers in
distributed systems — it
introduced consistent
hashing + vector clocks
+ quorum reads/writes
as a practical package.

A key-value store is the simplest possible database interface: get(key), put(key, value), delete(key). Yet making one that’s distributed, fault-tolerant, and consistent is one of the hardest problems in computer science. This post walks the full journey — from a single-machine HashMap to a Dynamo-style ring that handles petabytes, partial failures, and concurrent writes.

The question: Design a distributed key-value store like Amazon DynamoDB or Apache Cassandra. Support GET/PUT/DELETE, handle node failures, maintain consistency, scale to petabytes.



1. The Problem

The key-value interface is brutally simple:

interface
interface KeyValueStore {
  get(key: string): Value | null;
  put(key: string, value: Value): void;
  delete(key: string): void;
}

Yet to make this distributed, fault-tolerant, and consistent at petabyte scale you must solve:

  • Partitioning — how to split data across hundreds of nodes
  • Replication — how many copies, and where
  • Consistency — what happens when replicas disagree
  • Failure detection — how nodes know when peers are dead
  • Recovery — how to heal after partitions

We’ll build up layer by layer, explaining why each decision exists before showing how it works.


2. Level 1 — In-Memory HashMap

The simplest implementation: a hash map in memory.

java
class SimpleKVStore {
    private final Map<String, byte[]> store = new HashMap<>();

    public byte[] get(String key) {
        return store.get(key);   // O(1)
    }

    public void put(String key, byte[] value) {
        store.put(key, value);   // O(1)
    }

    public void delete(String key) {
        store.remove(key);      // O(1)
    }
}

Problems:

  • No persistence — process crash loses all data
  • Single machine — bound by one machine’s RAM
  • No fault tolerance — one machine going down means 100% downtime

Adding a Write-Ahead Log (WAL)

Before any mutation, append it to a log file. On startup, replay the log to rebuild state.

java
class WALKVStore {
    private final Map<String, byte[]> store = new HashMap<>();
    private final FileWriter wal;

    public void put(String key, byte[] value) throws IOException {
        // 1. Write to WAL first (crash-safe)
        wal.write("PUT," + key + "," + Base64.encode(value) + "\n");
        wal.flush(); // fsync for durability
        // 2. Apply to in-memory map
        store.put(key, value);
    }

    public void recover() throws IOException {
        // Replay WAL on restart
        for (String line : Files.readAllLines(walPath)) {
            String[] parts = line.split(",");
            if (parts[0].equals("PUT"))
                store.put(parts[1], Base64.decode(parts[2]));
            else if (parts[0].equals("DEL"))
                store.remove(parts[1]);
        }
    }
}

The WAL solves durability. But the data is still on one machine. We need to scale.


3. Level 2 — Single-Node with Persistence (LSM Tree)

LSM trees are write-optimized
by design: every write is a
sequential append. Random
writes are the nemesis of
spinning disks, and LSM
eliminates them entirely.

Random disk writes (updating a B-Tree in place) are the bottleneck for write-heavy workloads. The Log-Structured Merge Tree (LSM Tree) turns all writes into sequential appends:

Write path:

  1. Write goes to in-memory MemTable (a sorted balanced BST, e.g. Red-Black tree)
  2. When MemTable exceeds threshold (~64 MB), flush to an immutable SSTable file on disk
  3. Background compaction merges SSTables, removing old versions and tombstones

Read path:

  1. Check MemTable (newest data first)
  2. Check SSTables from newest to oldest (aided by Bloom filters to skip non-existent keys)
Write Path: ───────────────────────────────────────────────────────────── Client PUT(k, v) │ ▼ ┌─────────────────────────────────────────────┐ │ WAL (append-only, for crash recovery) │ └─────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ MemTable (sorted in memory, ~64 MB cap) │ ◄── fast O(log n) └──────────────────┬──────────────────────────┘ │ flush when full ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ SSTable₀ │ │ SSTable₁ │ │ SSTable₂ │ ◄── immutable, sorted └──────────┘ └──────────┘ └──────────┘ │ │ │ └──────────────┴─────────────┘ │ compaction (background) ▼ ┌──────────────┐ │ SSTable_new │ ◄── merged, deduped, tombstones removed └──────────────┘ Used by: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB
Bloom filters are probabilistic data structures that answer "is this key definitely NOT in this SSTable?" in O(1). They eliminate most disk reads for missing keys — critical when you have dozens of SSTable files.

4. Level 3 — CAP Theorem

Before distributing, we need the fundamental theorem of distributed systems.

CAP Theorem: A distributed system can guarantee at most two of: Consistency, Availability, and Partition Tolerance. In practice, network partitions will happen — so you're really choosing between C and A.
  • Consistency (C): Every read receives the most recent write
  • Availability (A): Every request receives a response (not necessarily the most recent)
  • Partition Tolerance (P): The system continues operating despite network partitions

△ CAP Triangle — Click a system to explore

Click a system label to learn where it sits on the CAP triangle.

5. Level 4 — Consistent Hashing for Partitioning

A naive approach: node = hash(key) % N. Problem: when N changes (node added/removed), every key remaps. With 1 billion keys and 10 nodes, adding one node moves ~900 million keys.

Consistent hashing maps both keys and nodes onto the same circular ring (0 to 2³²). A key belongs to the first node clockwise from its hash.

  • Adding a node: only keys between the new node and its predecessor need to move (~1/N of keys)
  • Removing a node: only that node’s keys need to move to its successor
Consistent Hash Ring (simplified, 4 nodes): hash(key_a) = 45 │ 0 ▼ 2³² ────────────────────────────────────────── │ [Node A: 60] [Node B: 170] │ │ │ │ [Node D: 310] [Node C: 240] │ ────────────────────────────────────────── key_a hashes to 45 → next node clockwise = Node A (60) key_b hashes to 200 → next node clockwise = Node C (240) Virtual Nodes: each physical node occupies V positions on the ring (e.g. V=150). This smooths load distribution even with heterogeneous hardware, and is used by Cassandra (vnodes) and DynamoDB.
Why virtual nodes? Without them, when a node is removed, all its load moves to a single successor — creating a hotspot. With 150 virtual nodes per physical node, the departed load fans out across all remaining nodes proportionally.

6. Level 5 — Replication

Quorum with N=3, W=2, R=2
is the sweet spot: you can
tolerate one node failure
on both reads and writes
while still maintaining
strong consistency.

Data is replicated to N consecutive nodes on the ring (the preference list). Three strategies for handling writes:

Strategy How Consistency Latency
Synchronous Coordinator waits for all N replicas to ACK Strong Slowest (worst case: slowest node)
Asynchronous Coordinator writes to 1, replicates in background Eventual Fastest
Quorum Write to W, read from R, require W+R > N Tunable Middle

The quorum approach is the key insight from the Dynamo paper: by tuning W and R, you can slide a dial between consistency and performance.

◆ Quorum Calculator — Tune N, W, R

3 total replicas
2 write quorum
2 read quorum

7. Level 6 — Conflict Resolution with Vector Clocks

With async replication, two nodes can independently accept writes for the same key, resulting in conflicting versions. How do you know which is “correct”?

Last-Write-Wins (LWW)

Use timestamps: keep the value with the highest timestamp. Simple, but clocks drift — NTP can be off by tens of milliseconds. A causally earlier write with a slightly later system clock wins incorrectly.

Vector Clocks

Each value carries a version vector — a map of {nodeId → counter}. When a node updates a value, it increments its own counter.

Vector Clock Example: ───────────────────────────────────────────────────────── Initial: key="cart", value=[], vc={} Client A writes to Node 1: value=["book"], vc={N1:1} Client B writes to Node 2: value=["pen"], vc={N2:1} ↓ (both accepted — nodes were partitioned) Later, Node 3 reads both versions: Version 1: value=["book"], vc={N1:1} Version 2: value=["pen"], vc={N2:1} Neither vector clock dominates the other (can't compare N1:1 vs N2:1) → CONFLICT detected → application must merge → ["book", "pen"] If version had been: Version 1: value=["book"], vc={N1:1} Version 2: value=["book","pen"], vc={N1:1, N2:1} Then V2 dominates V1 (all counters ≥) → V2 wins, no conflict.
Amazon shopping cart is the canonical example: two devices add items concurrently during a network partition. Vector clocks detect the conflict; the client merges by taking the union. This is safer than LWW (which could silently discard a purchase).

Causality vs. Conflict

Two events are causally related if one happened-before the other (Lamport’s happens-before relation). Vector clocks capture this precisely:

  • V1 dominates V2 if every counter in V1 ≥ corresponding counter in V2 → V1 is a successor, discard V2
  • Neither dominates → concurrent → conflict, application must merge

8. Level 7 — Gossip Protocol for Failure Detection

In a large cluster, a central failure detector is a single point of failure. Gossip protocols give us decentralized, scalable failure detection.

How it works:

  1. Every T seconds, each node picks K random peers and sends a “heartbeat” message with its membership list
  2. The receiver merges the membership list (taking max generation/heartbeat counts)
  3. If a node’s heartbeat hasn’t increased in T_suspect seconds → mark suspect
  4. If still silent after T_dead seconds → mark dead, redistribute its keys

∿ Gossip Protocol Visualizer — Click a node to kill it

Healthy Suspect Dead

9. Level 8 — Read Repair & Hinted Handoff

Two mechanisms that keep the cluster eventually consistent without coordinator intervention:

Read Repair

When the coordinator sends a quorum read, it compares versions from R replicas. If any replica returns a stale version, the coordinator sends an async write to bring it up to date.

Read Repair: Client GET("user:42") │ Coordinator reads from R=2 replicas: │── Node A: value="Alice", vc={N1:5} ◄── newer │── Node B: value="Alice_old", vc={N1:3} ◄── stale │ Return {N1:5} to client (newest version) │ Async: coordinator sends PUT to Node B to bring it up to date ↓ Node B: value="Alice", vc={N1:5} ◄── healed

Hinted Handoff

If the target replica for a write is temporarily down, another node stores the write with a “hint” (the intended recipient’s ID). When the target comes back online, the hint-holder delivers the queued writes.

Hinted Handoff: PUT("order:99", data) → should go to Node C (down!) │ Coordinator stores on Node D instead: { data: ..., hint: "deliver to Node C when it recovers" } │ Node C recovers → Node D detects this (via gossip) │ Node D delivers the hinted write to Node C Node D deletes the hint │ Cluster fully consistent again
Hinted handoff has a time limit. If Node C is down for too long (e.g. a week), hints are discarded to avoid unbounded storage. This is why Cassandra uses anti-entropy with Merkle trees for long-term reconciliation — comparing tree hashes to find diverged data ranges.

10. Full Architecture

◌ Complete System Architecture

Hover over a component to learn about its role.

Request flow for PUT("user:42", data) with N=3, W=2:

  1. Client sends to any node (becomes coordinator)
  2. Coordinator hashes "user:42" → position on ring → finds 3 consecutive nodes
  3. Coordinator sends write to all 3 replicas
  4. Waits for W=2 ACKs → responds success to client
  5. Third replica ACKs asynchronously (best-effort)
  6. If a replica is down → hinted handoff stores the write
  7. Read repair and anti-entropy keep replicas synchronized

11. Comparison

Feature Redis DynamoDB Cassandra RocksDB
Persistence Optional (RDB/AOF) Yes (managed) Yes (SSTable) Yes (LSM)
Distribution Cluster mode Fully managed Yes (ring) Single node only
Consistency Strong (single node) Tunable Tunable Single-node strong
CAP CP AP AP
Data model Rich types (list, set, sorted set, hash) Document + KV Wide-column Pure KV bytes
Best for Cache, leaderboards, sessions Serverless, variable scale Time-series, IoT, writes Embedded storage engine
Latency Sub-ms (in-memory) Single-digit ms Low ms Sub-ms

12. Interview Cheat Sheet

When answering this question in an interview, hit these points in order:

  1. Clarify requirements — read-heavy vs write-heavy? Consistency vs latency? Data size?
  2. Start simple — HashMap + WAL → single node works, but doesn’t scale
  3. Add partitioning — consistent hashing for minimal key redistribution on topology changes
  4. Add replication — N=3, pick W and R based on consistency needs
  5. Failure detection — gossip protocol, no central coordinator
  6. Recovery — read repair for temporary skew, hinted handoff for node downtime, anti-entropy (Merkle trees) for long-term divergence
  7. Conflict resolution — LWW for simple cases, vector clocks when application semantics matter

Quorum with N=3, W=2, R=2
is the sweet spot: tolerates
one node failure on both
reads and writes while
maintaining strong
consistency.

The Dynamo paper (2007) is required reading. In 12 pages, Amazon described how they solved all of this in production — consistent hashing, vector clocks, quorum NWR, gossip-based membership, hinted handoff, and Merkle tree anti-entropy. Every major distributed KV store since has been a variation on this design.

What distinguishes a great answer:

  • Explain why LSM trees are better than B-trees for write-heavy KV stores (sequential vs random I/O)
  • Explain why consistent hashing (minimal key redistribution) vs modulo hashing (full reshuffle)
  • Explain the quorum math: W+R > N means the read set and write set overlap by at least one node
  • Know the tradeoff: vector clocks give you conflict detection but not conflict resolution — that’s the application’s job

Next in the series: #10 — Design a Notification System — fan-out patterns, push vs pull, and at-least-once delivery guarantees.