System Design: Distributed Transactions — 2PC, Saga Pattern, and Event Sourcing
2PC was invented for
distributed databases
in the 1970s. It works,
but the coordinator
crash problem is why
modern microservices
almost universally
prefer Sagas instead.
An e-commerce checkout looks simple from the outside: click Buy, get a confirmation email. Under the hood, that single user action touches four completely independent microservices — Payment, Inventory, Shipment, and Notification — each with its own database, its own failure modes, and zero awareness of what the others are doing. Keeping them consistent is one of the hardest unsolved problems in distributed systems.
The question: An e-commerce order involves charging the payment, reserving inventory, creating a shipment record, and sending a confirmation email — across 4 microservices with 4 separate databases. How do you ensure all succeed or all roll back?
1. The Problem with Distributed State
In a single relational database, ACID guarantees are free:
- Atomicity — either all rows update or none do
- Consistency — constraints are always valid
- Isolation — concurrent transactions don’t see each other’s partial writes
- Durability — committed data survives crashes
Across microservices, none of these guarantees exist. Each service owns its own database. There is no shared transaction manager. The network between services is unreliable. A service can crash at any moment.
The failure scenario is concrete: Payment service charges the customer’s card ✓, then the Inventory service crashes mid-reservation. The customer is charged. No order is placed. Support tickets flood in.
2. Level 1 — The Naïve Approach (Wrong Answer)
The first instinct is to just call services sequentially:
def place_order(order): # Step 1: charge payment payment_result = payment_service.charge(order.user_id, order.amount) # Step 2: reserve inventory ← what if this crashes? inventory_result = inventory_service.reserve(order.items) # Step 3: create shipment ← what if THIS crashes? shipment_result = shipment_service.create(order) # Step 4: send confirmation email email_service.send_confirmation(order.user_id, order.id) return "order placed"
There is no error handling. There is no rollback. If inventory_service.reserve() raises a network timeout, the payment is gone and no one knows. If shipment_service.create() fails after inventory is reserved, you now have reserved stock for an order that doesn’t exist.
This is unacceptable in any financial system. In a real interview, naming this anti-pattern early and immediately moving to solutions is a strong signal.
3. Level 2 — Two-Phase Commit (2PC)
2PC introduces a coordinator that orchestrates a two-round protocol across all participants:
Phase 1 — Prepare: The coordinator asks every participant: “Can you commit this transaction?” Each participant locks its resources, writes to a redo log, and votes YES or NO. Crucially, a YES vote is a promise — the participant guarantees it can commit if asked.
Phase 2 — Commit or Rollback: If every participant voted YES, the coordinator sends COMMIT to all. If any voted NO (or timed out), the coordinator sends ROLLBACK to all. Only once the coordinator gets acknowledgements from everyone does the transaction complete.
Why 2PC Fails in Practice
The fatal flaw is the coordinator crash problem. Once a participant votes YES in Phase 1, it is locked — it cannot unilaterally commit or rollback. It must wait for Phase 2. If the coordinator crashes after Phase 1 but before Phase 2, every participant holds its locks forever. The system deadlocks until someone manually recovers the coordinator’s log and replays Phase 2.
In a microservices environment with dozens of services, this is catastrophic. 2PC also requires all participants to be available simultaneously — one slow service blocks everyone. It is effectively incompatible with the independent deployability and failure isolation that microservices promise.
4. Level 3 — The Saga Pattern
A Saga breaks the distributed transaction into a sequence of local transactions, each of which publishes an event or message when it completes. If any step fails, the Saga executes compensating transactions — undo operations — for every step that already succeeded.
The key insight: compensating transactions are not rollbacks. Rollbacks are atomic and invisible. Compensations are new transactions that undo the effect of previous ones. Other services may have already seen the intermediate state. This means Sagas provide eventual consistency, not ACID isolation.
Choreography vs Orchestration
Choreography: Services react to events. No central controller. Payment service emits PaymentCharged → Inventory service listens and reserves stock → emits InventoryReserved → Shipment service creates record → etc. Simple to set up, hard to trace and debug.
Orchestration: A dedicated Saga Orchestrator sends explicit commands to each service and tracks the state machine. When a step fails, the orchestrator issues compensating commands in reverse order. More complex upfront, much easier to observe and reason about.
Compensating Transactions Are Not Free
Every forward step must have a pre-designed compensating action. RefundPayment needs to call the payment provider’s refund API. ReleaseInventory needs to decrement the reservation counter. Compensations can also fail — which requires their own retry logic and dead-letter queues. The distributed systems complexity doesn’t disappear; it moves into the compensation design.
class OrderSagaOrchestrator: def execute(self, order): completed = [] steps = [ ("create_order", self.create_order, self.cancel_order), ("charge_payment", self.charge_payment, self.refund_payment), ("reserve_inv", self.reserve_inv, self.release_inv), ("create_shipment", self.create_shipment, self.cancel_shipment), ("send_email", self.send_email, None), ] for name, forward, compensate in steps: try: forward(order) completed.append((name, compensate)) except Exception as e: # run compensating transactions in reverse for _, comp in reversed(completed): if comp: comp(order) # must be idempotent raise SagaFailedError(name, e)
5. Level 4 — The Outbox Pattern
Sagas rely on events being reliably published to a message bus (Kafka, RabbitMQ). Here lies a subtle trap: the dual-write problem.
When a service handles a command, it needs to do two things atomically:
- Write to its own database (e.g., update order status to
PAYMENT_CHARGED) - Publish an event to Kafka (
PaymentCharged)
These two operations cannot be wrapped in a single transaction — one is a database write, the other is a network call to a different system. If the database commits but Kafka publish fails, the downstream services never see the event. The saga stalls silently.
The Outbox Pattern solves this by making Kafka publishing an asynchronous side effect of the database write:
-- outbox table lives in the SAME database as your business data CREATE TABLE outbox_events ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), aggregate_id TEXT NOT NULL, aggregate_type TEXT NOT NULL, event_type TEXT NOT NULL, payload JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), published_at TIMESTAMPTZ );
# Inside a single database transaction: with db.transaction(): # business write db.execute( "UPDATE orders SET status='PAYMENT_CHARGED' WHERE id=?", order_id ) # outbox write — same transaction, same atomicity guarantee db.execute( "INSERT INTO outbox_events (aggregate_id, aggregate_type, event_type, payload)" "VALUES (?, 'Order', 'PaymentCharged', ?)", order_id, json.dumps(payload) ) # If this transaction commits, BOTH writes are durable. # If it rolls back, NEITHER write happens. # A separate relay process (Debezium CDC) tails the outbox table # and publishes events to Kafka — independently, with retries.
The relay process (typically Debezium reading the database’s change-data-capture stream) publishes the outbox event to Kafka and marks it as published_at. The relay can retry on Kafka failures without risk of data loss — the event is safely in the database until delivered. This gives you exactly-once delivery semantics at the cost of a small write amplification.
6. Level 5 — Event Sourcing
Event sourcing flips the storage model entirely. Instead of storing the current state of an entity, you store the sequence of events that produced that state. The current state is derived by replaying events from the beginning (or from a snapshot).
# Traditional storage: current state snapshot orders_table = { "order_id": "ord-42", "status": "shipped", "amount": 99.00, "items": [...] } # Event sourcing: ordered log of immutable events event_store = [ { "type": "OrderPlaced", "orderId": "ord-42", "userId": "u-7", "ts": "10:00:01" }, { "type": "PaymentCharged", "orderId": "ord-42", "amount": 99.00, "ts": "10:00:03" }, { "type": "InventoryReserved", "orderId": "ord-42", "items": [...], "ts": "10:00:05" }, { "type": "OrderShipped", "orderId": "ord-42", "tracking": "TRK-9", "ts": "10:02:10" }, ] def rebuild_order_state(events): state = {} for e in events: if e["type"] == "OrderPlaced": state = { "id": e["orderId"], "status": "placed", "userId": e["userId"] } elif e["type"] == "PaymentCharged": state["status"] = "paid"; state["amount"] = e["amount"] elif e["type"] == "InventoryReserved": state["status"] = "reserved" elif e["type"] == "OrderShipped": state["status"] = "shipped"; state["tracking"] = e["tracking"] return state
Event sourcing sounds
simple but the devil is
in schema evolution —
when your event format
changes, you need to
migrate or version
every event ever stored.
This is why many teams
start with it and
regret it.
Snapshot Optimization
For aggregates with long event histories (thousands of events), replaying from event zero on every read is expensive. The solution is periodic snapshots: after every N events, serialize the current state to a snapshot store. On read, load the most recent snapshot then replay only the events that occurred after it.
7. Level 6 — CQRS
Command Query Responsibility Segregation (CQRS) is the natural companion to Event Sourcing. The insight is that write workloads (processing commands, enforcing invariants) and read workloads (serving queries, generating views) have almost nothing in common — so separate them.
Write side: Commands → Aggregate (validates invariants, enforces business rules) → Event Store → Event Bus
Read side: Multiple independent projections subscribe to the event bus and maintain their own read-optimized stores. An Order History Projection might maintain a flat table sorted by user + timestamp. An Inventory Projection maintains current stock levels. An Analytics Projection feeds a data warehouse. Each is rebuilt independently from the event stream.
# WRITE SIDE — command handler class OrderAggregate: def handle_place_order(self, cmd): if self.status != "new": raise InvalidStateError("Order already placed") self.apply(OrderPlaced(cmd.order_id, cmd.user_id, cmd.items)) def apply(self, event): self.event_store.append(event) self.event_bus.publish(event) # via outbox pattern # READ SIDE — projections (one per read model) class OrderHistoryProjection: def on_order_placed(self, event): self.read_db.upsert("order_history", { "order_id": event.order_id, "user_id": event.user_id, "placed_at": event.timestamp, "status": "placed" }) def on_order_shipped(self, event): self.read_db.update("order_history", where={"order_id": event.order_id}, set={"status": "shipped", "tracking": event.tracking} )
The key benefit: the read model is disposable. If you need a new query pattern, build a new projection and replay all historical events into it. The event store is the single source of truth.
8. Comparison: When to Use What
| Pattern | Consistency | Complexity | Latency | Best For |
|---|---|---|---|---|
| 2PC | Strong (ACID) | High | High (locks) | Financial, same-team DBs |
| Saga (Choreography) | Eventual | Medium | Low | Simple microservice flows |
| Saga (Orchestration) | Eventual | High | Low | Complex, observable flows |
| Outbox Pattern | Eventual | Low | Low | Reliable event publishing |
| Event Sourcing | Eventual | Very High | Low | Audit trails, time-travel |
| CQRS | Eventual | Very High | Low (reads) | High-read, complex domains |
9. Real-World: Amazon Order Placement
Amazon’s order flow uses Saga orchestration extensively. Here’s a simplified walk-through of how placing a single order works, with the compensating actions that fire on failure:
| Step | Service | Compensation |
|---|---|---|
| 1 | Create order record (PENDING) | Delete pending order |
| 2 | Authorize payment (hold, not capture) | Release payment authorization |
| 3 | Reserve inventory at warehouse | Release inventory reservation |
| 4 | Assign carrier and tracking | Cancel carrier booking |
| 5 | Capture payment (finalize charge) | Issue refund |
| 6 | Trigger fulfillment pick-pack | Cancel pick-pack job |
| 7 | Send confirmation email | (no compensation needed) |
Notice that payment authorization (step 2) and payment capture (step 5) are separate. This is intentional: holding a reservation is cheaper to reverse than a completed charge. The saga is designed so the expensive, hard-to-compensate operations happen as late as possible.
Amazon’s saga orchestrator persists its state to DynamoDB. If the orchestrator itself crashes mid-saga, it rehydrates from the persisted state and resumes from the last completed step. This is why every step must be idempotent.
10. Idempotency Is Non-Negotiable
Every saga step — forward and compensating — must be safely retryable. A network timeout doesn’t mean the operation failed; the downstream service may have succeeded and the response was just lost. Retrying a non-idempotent operation doubles the charge, reserves twice the inventory, sends two emails.
The standard solution is the idempotency key pattern:
# Client side: generate a stable key from the logical operation idempotency_key = hash(saga_id + "." + step_name) # e.g., "saga-ord-42.charge_payment" → deterministic UUID # Server side (payment service): check before processing def charge_payment(request): existing = db.get("idempotency_keys", request.idempotency_key) if existing: return existing.result # replay cached result, don't re-charge result = stripe.charge(request.amount, request.card_token) db.set("idempotency_keys", request.idempotency_key, result, ttl=86400) return result
The idempotency key table is keyed by (service, key) and stores the result. Retries within the TTL window return the cached response immediately without re-executing the business logic. Stripe, Braintree, and every major payment processor support this natively via the Idempotency-Key header.
Summary
Distributed transactions are fundamentally a problem of trust across failure boundaries. The evolution from naïve sequential calls → 2PC → Saga → Event Sourcing + CQRS reflects the industry’s growing understanding that strong consistency across microservices is usually not worth the coupling it requires.
For most e-commerce systems today the answer is Saga orchestration + Outbox pattern:
- Outbox guarantees reliable event delivery without dual-write risk
- Orchestrated Sagas give you observable, debuggable compensation flows
- Idempotency keys make every step safely retryable
- Eventually-consistent is acceptable for order processing (customers don’t expect sub-millisecond confirmation)
Reserve Event Sourcing and CQRS for domains where audit history is a first-class requirement (financial ledgers, healthcare records, compliance systems) — not as a default architectural choice.
The Saga pattern was
introduced by Hector
Garcia-Molina in 1987
for long-lived
transactions. The
distributed systems
community rediscovered
it for microservices
30 years later.