System Design: Multi-Region Failover — Global Load Balancing and Zero-Downtime Disaster Recovery

Interview question: Your application runs in a single AWS region (US-East). Design it to be globally available and survive a complete regional failure. Users in Asia should get under 100 ms latency. If US-East goes down, US-West should take over within 30 seconds with no data loss. Walk me through the architecture.

This is one of the most common senior-level system design questions. It tests your understanding of DNS, database replication, distributed consistency, and operational runbooks. Let's build the answer systematically.

---

1. Why single-region is not enough

{: class="marginalia" } The AWS us-east-1 outage on December 7 2021 took down Alexa, Ring, Kindle, Prime Video, and large parts of the internet for 8+ hours. Even AWS's own status page was down — it was hosted in us-east-1.

A single-region architecture has a deceptively simple failure mode: everything in that region can go dark simultaneously, and there is no fallback. This is not a theoretical concern.

The failure categories break down into four buckets:

Natural disasters and physical failures. AWS us-east-1 suffered a catastrophic failure in December 2021 triggered by a runaway automated capacity scaling event that misconfigured network devices across the region. The cascading failure was so severe that AWS's own tooling used to remediate the issue was also hosted in the impacted region.

Network partitions. Undersea fiber cables are cut several hundred times per year globally. A significant cut can dramatically degrade or sever connectivity between continents, partitioning your users from your data center even if the data center itself is healthy.

Human error and bad deployments. A misconfigured firewall rule, a database migration gone wrong, or a cascading config change can take down a region-worth of infrastructure within minutes. These are statistically the most common cause of major outages.

Regulatory constraints. GDPR (EU), data localisation laws (India, Russia, China), and financial regulations in many jurisdictions require data to physically reside within specific regions. Single-region designs often cannot meet these requirements for global products.

The solution space involves running your application in multiple regions simultaneously or keeping a warm standby ready to absorb traffic on short notice.

---

2. The two fundamental models

Before designing anything, you must choose between two primary multi-region architectures. They have very different cost, complexity, and recovery characteristics.

Active-Passive (Primary-Standby)

How it works: One region (primary) serves 100% of traffic. A second region (standby) runs in a reduced or idle state but continuously receives replicated data from the primary. If the primary fails, traffic is redirected to the standby, which is promoted to accept writes.

RTO (Recovery Time Objective — how long the system is down): 30–90 seconds depending on DNS TTL and promotion time.
RPO (Recovery Point Objective — how much data can be lost): seconds, bounded by async replication lag at moment of failure.
Cost: Moderate. The standby can run at reduced capacity or use smaller instance types since it handles no production traffic.

Active-Active

How it works: Both regions serve production traffic simultaneously. Users are routed to their nearest region by a global load balancer. Each region can accept writes, and changes are replicated bidirectionally. If one region fails, 100% of traffic shifts to the surviving region — no "failover" step needed at the application layer.

RTO: Near-zero. DNS TTL of 30–60 seconds is often the only delay.
RPO: Near-zero for synchronous replication, seconds for async replication.
Cost: High. Full duplicate infrastructure in every active region. The data layer complexity is substantial — bidirectional replication introduces split-brain risks (covered in section 5).

▶ Interactive: Active-Passive vs Active-Active

Simulate regional failure (US-East goes down)
Active-Passive
Primary
🇺🇸 US-East
● Serving 100% traffic
Standby
🇺🇸 US-West
○ Receiving replicated data
RTO
~60s
RPO
~5s
Cost overhead
+35%
Active-Active
Region A
🇺🇸 US-East
● Serving 50% traffic
Region B
🇺🇸 US-West
● Serving 50% traffic
RTO
~30s
RPO
~1s
Cost overhead
+70%
---

3. DNS-based global routing

{: class="marginalia" } DNS TTL is a double-edged sword. Low TTL (30s) means fast failover propagation but more DNS queries (cost, slight latency). High TTL (300s) means stale caches linger for 5 minutes after a failover event. 60 seconds is a pragmatic default for production failover systems.

The entry point for all global traffic routing is DNS. Two services dominate this space: AWS Route 53 and Cloudflare. Both support the routing policies you need for multi-region architectures.

Latency-based routing

When a user in Tokyo queries your domain, Route 53 measures the latency from various AWS regions to that user's resolver and returns the IP of the closest healthy region. This is not pure geographic routing — it is network-topology-aware. A user in Singapore might be routed to us-west-2 if their ISP's peering makes that faster than ap-southeast-1.

DNS routing diagram:
🇷🇼 Tokyo user Route 53 ap-northeast-1 (Tokyo)
🇬🇧 London user Route 53 eu-west-2 (London)
🇺🇸 New York user Route 53 us-east-1 (N. Virginia)
🇺🇸 Seattle user Route 53 us-west-2 (Oregon)

Health checks

Route 53 health checks poll your /health endpoint every 10 seconds from multiple AWS edge locations. If 3 consecutive checks fail (meaning ~30 seconds of confirmed failure), Route 53 removes that region's DNS record from responses. Traffic automatically shifts to the next healthy region according to your fallback configuration.

Important design detail: your /health endpoint must be a deep health check. A surface-level HTTP 200 from a server that cannot reach its database is worse than useless — it will keep the unhealthy region in rotation. Your health check should verify connectivity to the database, any required caches, and any downstream dependencies.

// Express.js: deep health check endpoint
app.get('/health', async (req, res) => {
  try {
    // Check DB connectivity
    await db.raw('SELECT 1');

    // Check Redis connectivity
    await redis.ping();

    // Check replication lag is acceptable (< 30s)
    const lag = await getReplicationLagSeconds();
    if (lag > 30) throw new Error('Replication lag critical: ' + lag + 's');

    res.json({ status: 'ok', region: process.env.AWS_REGION, lag });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message });
  }
});

Failover DNS records

Route 53 supports a failover routing policy with explicit primary and secondary records. When the primary health check fails, Route 53 automatically serves the secondary record. This is the Active-Passive mechanism at the DNS layer.

# Terraform: Route 53 failover configuration
resource "aws_route53_record" "primary" {
  zone_id         = aws_route53_zone.main.zone_id
  name            = "api.example.com"
  type            = "A"
  set_identifier = "primary"
  ttl             = 60
  records         = ["52.1.2.3"]  # US-East ALB

  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.us_east.id
}

resource "aws_route53_record" "secondary" {
  zone_id         = aws_route53_zone.main.zone_id
  name            = "api.example.com"
  type            = "A"
  set_identifier = "secondary"
  ttl             = 60
  records         = ["54.4.5.6"]  # US-West ALB

  failover_routing_policy { type = "SECONDARY" }
  # No health check needed — only activated when primary fails
}
---

4. Database replication strategies

{: class="marginalia" } Netflix's chaos engineering practice — Chaos Monkey, Chaos Kong (which kills entire AWS regions) — was born from a 2008 database corruption incident that took Netflix down for 3 days. They decided the only way to build resilience was to continuously inject failure in production.

Replicating stateless application servers across regions is trivial — deploy more EC2 instances. The database layer is where multi-region architecture gets genuinely difficult. Every strategy involves a trade-off between consistency, latency, and complexity.

Read replicas (easy win)

Read replicas solve the latency problem for read-heavy workloads. Deploy a read replica in each region. Local users get sub-10ms read latency. Writes still route to the primary region. Replication lag means replicas may serve slightly stale data — acceptable for most read paths.

Cross-region replication options

TechnologyReplication lagConsistencyAuto failoverCost
MySQL GTID async replication10–100msEventually consistentManual / MHALow
PostgreSQL logical replication10–200msEventually consistentManual / PatroniLow
Aurora Global Database< 1sEventually consistentAutomated (~1min)Medium
CockroachDB multi-regionSynchronousSerializableAutomaticHigh
Google SpannerSynchronousExternal consistencyAutomaticVery high

Aurora Global Database is the pragmatic choice for AWS-based architectures. It replicates at the storage level (not at the SQL layer), achieves sub-1-second lag globally, and supports automated failover in under 1 minute with no data loss for the synchronous portion of writes.

CockroachDB and Spanner offer synchronous multi-region writes with strong consistency guarantees. The cost is write latency: a write must be acknowledged by quorum across regions before returning to the caller. If US-East and EU-West are 80ms apart, writes take at minimum 80ms round-trip. This is often acceptable for financial or inventory systems but not for high-frequency user activity.

▶ Replication lag visualizer

Simulated network lag: 50ms
🗄
Primary
US-East
Accepting writes
WAL stream
🗄
Replica
EU-West
Read-only
Replication lag 50ms
---

5. The split-brain problem

Split-brain is the distributed systems nightmare: two nodes, both believing they are the authoritative primary, simultaneously accepting conflicting writes.

In an active-active setup with async replication: US-East accepts a write that changes user Bob's email to bob@newdomain.com. EU-West, one second later, accepts a write that changes Bob's email to bob@work.com. A network partition prevents these changes from syncing. When the partition heals, which write wins?

The problem is not merely theoretical. Cassandra, DynamoDB, and MongoDB all have configurable consistency levels precisely because they have encountered this reality at scale.

Conflict resolution strategies

Last-write-wins (LWW). Each write is timestamped. On conflict, the write with the later timestamp wins. Simple to implement, but silently discards data. Clock skew between servers (even with NTP, typically 1–10ms) can cause the wrong write to win.

Vector clocks / causal consistency. Each write carries a vector of version counters per node. Causally related writes can be ordered; concurrent writes are detected and flagged for resolution. Used by Amazon Dynamo, Riak. Complex to implement but provides semantic conflict detection rather than silent data loss.

CRDT-based merges. Conflict-free Replicated Data Types are data structures mathematically guaranteed to merge without conflicts. Counters, sets, and maps have CRDT implementations. Shopping cart contents can be merged; a "delete item" operation needs careful handling (tombstones). Redis CRDT and Riak use this approach.

Region affinity / geo-partitioning. Assign each data partition to an owning region. User records for EU users are owned by EU-West; user records for US users are owned by US-East. Each region is authoritative for its own partition. No cross-region write conflicts for normal operations. Reads can be served from any region via replication. This is CockroachDB's geo-partitioned replicas model.

Practical advice for interviews: When asked about active-active, acknowledge split-brain immediately. Explain that most production systems either (a) avoid it by making one region authoritative for each data shard, or (b) accept eventual consistency for non-critical data and use conflict resolution for everything else. Very few teams implement true bidirectional write conflict resolution — the operational complexity is immense.
---

6. The failover runbook — animated

{: class="marginalia" } The difference between RTO and RPO is subtle but critical. RTO is how long the system is down — a business and SLA question. RPO is how much data can be lost — a data integrity question. A system with RTO=5min and RPO=1hr might recover quickly but lose an hour of orders. Both numbers must be agreed with the business, not just engineering.

Understanding the theory is one thing. Knowing the exact sequence of events during a real failover — and where the gaps are — is what separates a good answer from a great one.

▶ Failover timeline — US-East failure

T+0s
1
US-East stops responding
T+0s
Power failure, network partition, or catastrophic deployment. All health checks begin failing.
2
Health checks failing (×1, ×2)
T+10s — T+20s
Route 53 checks from multiple edge locations. Two failures not yet sufficient to trigger failover — transient blips are common.
3
3rd consecutive failure — DNS failover triggered
T+30s
Route 53 removes US-East from DNS responses. New DNS queries resolve to US-West. Existing connections still hitting US-East.
4
Aurora replica promotion begins
T+30s (parallel)
Aurora Global Database detects loss of primary write endpoint. Begins promoting US-West replica. Applies any buffered WAL changes.
5
DNS TTL expires — clients re-resolve
T+30s — T+90s
Clients cached the old DNS record. As their 60s TTL expires, they re-query and receive the US-West IP. This is the "dark window" where requests fail.
6
Aurora promotion complete — US-West accepts writes
T+60s — T+90s
US-West replica is now the primary. Any writes made to US-East after the last replicated transaction are lost (RPO window).
7
Traffic fully on US-West — PagerDuty fires
T+90s — T+120s
Most clients now hitting US-West. Error rate returns to baseline. On-call engineers paged. Incident response begins.
8
US-East recovers — re-sync and re-add
T+N minutes
US-East comes back online. It must re-sync from US-West (now primary) before being re-added to DNS. Rush to re-add can cause a second failover if US-East is still unstable.
---

7. Stateless vs stateful services

The reason multi-region is complex is that stateful services cannot simply be cloned. Here is the full inventory of stateful components in a typical web application and the recommended strategy for each.

ServiceMulti-region strategyNotes
Application servers (Node, Java, Go)Deploy identical copies in each region behind regional ALBsTrivial. Keep truly stateless — no in-process session state.
PostgreSQL / MySQLAurora Global Database (async, <1s lag)Best managed option on AWS. Failover in ~60s.
Redis / session cacheElastiCache Global DatastoreAsync replication. On failover, some sessions expire — users re-login. Design for this.
S3 / object storageCross-Region Replication (CRR)Async. New objects replicate within minutes. Existing objects need a one-time copy job.
Elasticsearch / OpenSearchCross-Cluster Replication (CCR)Follower index in each region. Reads from local, writes to primary.
Kafka / event streamsMirrorMaker 2.0Replicates topic partitions across clusters. Consumer offsets translated.
Scheduled jobs / cronLeader-election: only one region runs jobsUse a distributed lock (DynamoDB conditional writes, Redis SETNX) to elect the primary scheduler.

A special case worth discussing: user session state. If you store session data in Redis and your Redis Global Datastore has async replication lag, a failover event means some recently-authenticated users lose their sessions. They see a logged-out screen. This is acceptable in most products — it is a minor annoyance, not data loss. Design your application to handle it gracefully: clear error messages, redirect to login.

Worse would be storing any financial transaction state in a session. Never do this. All durable state must live in the database, not in cache.

---

8. Chaos engineering — testing your failover

The most dangerous failure mode in a multi-region architecture is believing your failover works without having tested it under realistic conditions. Runbooks become stale. Configuration drift happens. A health check endpoint gets accidentally broken in a deploy.

Chaos engineering is the practice of deliberately injecting failures into production systems to verify their resilience. The foundational principle: if you do not test your failure scenarios regularly, you will discover them for the first time during an actual incident.

GameDay exercises

A GameDay is a scheduled, coordinated exercise where engineers deliberately take down a region, service, or dependency during low-traffic hours (typically 2–4 AM). The team practices the entire incident response flow: detection, communication, diagnosis, failover execution, and recovery. Lessons learned feed back into the runbook.

Netflix Chaos Monkey

Netflix's Chaos Monkey tool randomly terminates EC2 instances in their production environment during business hours. The rationale: if an instance can be lost at any moment, every service must be built to survive it. Chaos Kong goes further — it terminates entire availability zones or entire AWS regions on a schedule.

# Chaos Mesh (Kubernetes) — network partition experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition-us-east
  namespace: default
spec:
  action: partition
  mode: all
  selector:
    namespaces: ["production"]
    labelSelectors:
      "region": "us-east-1"
  direction: both
  duration: "2m"   # Partition lasts 2 minutes

AWS Fault Injection Simulator (FIS) is the managed alternative — it provides pre-built templates for injecting EC2 failures, RDS failovers, and network disruptions, with safety guardrails to prevent runaway experiments.

The only way to know your failover works is to exercise it regularly. A runbook that has never been executed under pressure is a hypothesis, not a plan.
---

9. Capacity planning and cost

Multi-region is not free. Interviewers frequently ask about cost implications, and "it's expensive" is not a sufficient answer. You should be able to reason about the magnitude.

ComponentCost modelRough estimate
Duplicate API server fleet (active-passive standby)Reduced-capacity standby: ~30% of primary fleet+30% compute cost
Aurora Global Database replicationPer-GB replicated data transferred~$0.20/GB across regions
Route 53 health checksPer health check per month$0.50/check/month
Cross-region data transfer (API responses)Per-GB egress to other regions~$0.02/GB
S3 Cross-Region ReplicationPer-request + per-GB~$0.005/1000 requests + $0.02/GB
ElastiCache Global Datastore replicationPer-GB replicated~$0.20/GB

For a mid-size application running $10,000/month in a single region, expect active-passive multi-region to add 35–45% overhead (~$3,500–$4,500/month). Active-active with full duplicate capacity adds 70–100%.

The cost question always leads to: what is the cost of downtime? For a $1M/hour revenue business, 90 seconds of outage costs $25,000. Multi-region at $4,000/month ($48,000/year) easily justifies itself even at a single major outage per year. For a $10,000/month revenue business, the math is different — a read replica for latency plus a manual failover runbook may be sufficient.

---

10. Putting it all together: the reference architecture

Given the interview question — survive US-East failure, serve Asia under 100ms, US-West takes over in 30 seconds — here is the concrete architecture recommendation:

Three active regions: us-east-1 (primary), us-west-2 (hot standby / secondary), ap-northeast-1 (Tokyo, read-only replica for APAC latency).

DNS: Route 53 with latency-based routing. Health checks every 10 seconds, failover threshold 3×. TTL 60 seconds. Tokyo region serves APAC reads; US regions serve writes. All write requests for APAC users are proxied back to US-East over the private AWS backbone (~80ms, acceptable for writes).

Database: Aurora Global Database with us-east-1 as write primary, us-west-2 and ap-northeast-1 as read replicas. Automated failover to us-west-2 on primary failure (<60s). APAC replica used only for reads — no writes, no promotion risk.

Cache: ElastiCache Global Datastore. Accept that some sessions will expire on failover. Build graceful re-authentication UX.

Object storage: S3 with CRR enabled to all three regions. Use S3 Transfer Acceleration for user uploads to minimize cross-region latency.

Chaos engineering: Monthly GameDay exercises failing us-east-1. Quarterly full-region Chaos Kong exercise.

# Summary: RTO/RPO targets vs achieved

# Target
RTO: 30 seconds
RPO: 0 seconds (no data loss)

# Achieved with Aurora Global + Route 53
RTO: 60-90 seconds  # DNS TTL 60s + Aurora promotion 60s, some overlap
RPO: 0-5 seconds    # Aurora Global replication lag at moment of failure

# To achieve strict 30s RTO:
#   - Reduce DNS TTL to 30s (higher DNS cost, more queries)
#   - Pre-warm standby connections (keep app servers hot, not idle)
#   - Use Aurora Global "managed planned failover" for DNS-independent switchover

# To achieve strict 0s RPO:
#   - Switch to CockroachDB or Spanner (synchronous multi-region writes)
#   - Accept ~80ms write latency penalty (round-trip to quorum)

Trade-off framing for the interview

End with the trade-offs clearly articulated. An interviewer is not looking for a perfect system — they are looking for a candidate who understands that every architectural decision is a trade-off, and who can reason about those trade-offs explicitly.

For this architecture: we chose active-passive over active-active because it avoids split-brain complexity at lower cost. We accepted 60–90 seconds RTO instead of the target 30 seconds because Aurora Global promotion takes time — reducing this further requires synchronous replication and its associated write latency cost. We accepted near-zero (not absolute zero) RPO because synchronous replication would add 80ms+ to every write operation — unacceptable for interactive web traffic. These are business trade-offs, not engineering failures.