System Design: Multi-Region Failover — Global Load Balancing and Zero-Downtime Disaster Recovery
Interview question: Your application runs in a single AWS region (US-East). Design it to be globally available and survive a complete regional failure. Users in Asia should get under 100 ms latency. If US-East goes down, US-West should take over within 30 seconds with no data loss. Walk me through the architecture.
This is one of the most common senior-level system design questions. It tests your understanding of DNS, database replication, distributed consistency, and operational runbooks. Let's build the answer systematically.
---1. Why single-region is not enough
{: class="marginalia" } The AWS us-east-1 outage on December 7 2021 took down Alexa, Ring, Kindle, Prime Video, and large parts of the internet for 8+ hours. Even AWS's own status page was down — it was hosted in us-east-1.A single-region architecture has a deceptively simple failure mode: everything in that region can go dark simultaneously, and there is no fallback. This is not a theoretical concern.
The failure categories break down into four buckets:
Natural disasters and physical failures. AWS us-east-1 suffered a catastrophic failure in December 2021 triggered by a runaway automated capacity scaling event that misconfigured network devices across the region. The cascading failure was so severe that AWS's own tooling used to remediate the issue was also hosted in the impacted region.
Network partitions. Undersea fiber cables are cut several hundred times per year globally. A significant cut can dramatically degrade or sever connectivity between continents, partitioning your users from your data center even if the data center itself is healthy.
Human error and bad deployments. A misconfigured firewall rule, a database migration gone wrong, or a cascading config change can take down a region-worth of infrastructure within minutes. These are statistically the most common cause of major outages.
Regulatory constraints. GDPR (EU), data localisation laws (India, Russia, China), and financial regulations in many jurisdictions require data to physically reside within specific regions. Single-region designs often cannot meet these requirements for global products.
The solution space involves running your application in multiple regions simultaneously or keeping a warm standby ready to absorb traffic on short notice.
---2. The two fundamental models
Before designing anything, you must choose between two primary multi-region architectures. They have very different cost, complexity, and recovery characteristics.
Active-Passive (Primary-Standby)
How it works: One region (primary) serves 100% of traffic. A second region (standby) runs in a reduced or idle state but continuously receives replicated data from the primary. If the primary fails, traffic is redirected to the standby, which is promoted to accept writes.
RTO (Recovery Time Objective — how long the system is down): 30–90 seconds depending on DNS TTL and promotion time.
RPO (Recovery Point Objective — how much data can be lost): seconds, bounded by async replication lag at moment of failure.
Cost: Moderate. The standby can run at reduced capacity or use smaller instance types since it handles no production traffic.
Active-Active
How it works: Both regions serve production traffic simultaneously. Users are routed to their nearest region by a global load balancer. Each region can accept writes, and changes are replicated bidirectionally. If one region fails, 100% of traffic shifts to the surviving region — no "failover" step needed at the application layer.
RTO: Near-zero. DNS TTL of 30–60 seconds is often the only delay.
RPO: Near-zero for synchronous replication, seconds for async replication.
Cost: High. Full duplicate infrastructure in every active region. The data layer complexity is substantial — bidirectional replication introduces split-brain risks (covered in section 5).
3. DNS-based global routing
{: class="marginalia" } DNS TTL is a double-edged sword. Low TTL (30s) means fast failover propagation but more DNS queries (cost, slight latency). High TTL (300s) means stale caches linger for 5 minutes after a failover event. 60 seconds is a pragmatic default for production failover systems.The entry point for all global traffic routing is DNS. Two services dominate this space: AWS Route 53 and Cloudflare. Both support the routing policies you need for multi-region architectures.
Latency-based routing
When a user in Tokyo queries your domain, Route 53 measures the latency from various AWS regions to that user's resolver and returns the IP of the closest healthy region. This is not pure geographic routing — it is network-topology-aware. A user in Singapore might be routed to us-west-2 if their ISP's peering makes that faster than ap-southeast-1.
Health checks
Route 53 health checks poll your /health endpoint every 10 seconds from multiple AWS edge locations. If 3 consecutive checks fail (meaning ~30 seconds of confirmed failure), Route 53 removes that region's DNS record from responses. Traffic automatically shifts to the next healthy region according to your fallback configuration.
Important design detail: your /health endpoint must be a deep health check. A surface-level HTTP 200 from a server that cannot reach its database is worse than useless — it will keep the unhealthy region in rotation. Your health check should verify connectivity to the database, any required caches, and any downstream dependencies.
// Express.js: deep health check endpoint app.get('/health', async (req, res) => { try { // Check DB connectivity await db.raw('SELECT 1'); // Check Redis connectivity await redis.ping(); // Check replication lag is acceptable (< 30s) const lag = await getReplicationLagSeconds(); if (lag > 30) throw new Error('Replication lag critical: ' + lag + 's'); res.json({ status: 'ok', region: process.env.AWS_REGION, lag }); } catch (err) { res.status(503).json({ status: 'unhealthy', error: err.message }); } });
Failover DNS records
Route 53 supports a failover routing policy with explicit primary and secondary records. When the primary health check fails, Route 53 automatically serves the secondary record. This is the Active-Passive mechanism at the DNS layer.
# Terraform: Route 53 failover configuration resource "aws_route53_record" "primary" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" set_identifier = "primary" ttl = 60 records = ["52.1.2.3"] # US-East ALB failover_routing_policy { type = "PRIMARY" } health_check_id = aws_route53_health_check.us_east.id } resource "aws_route53_record" "secondary" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" set_identifier = "secondary" ttl = 60 records = ["54.4.5.6"] # US-West ALB failover_routing_policy { type = "SECONDARY" } # No health check needed — only activated when primary fails }---
4. Database replication strategies
{: class="marginalia" } Netflix's chaos engineering practice — Chaos Monkey, Chaos Kong (which kills entire AWS regions) — was born from a 2008 database corruption incident that took Netflix down for 3 days. They decided the only way to build resilience was to continuously inject failure in production.Replicating stateless application servers across regions is trivial — deploy more EC2 instances. The database layer is where multi-region architecture gets genuinely difficult. Every strategy involves a trade-off between consistency, latency, and complexity.
Read replicas (easy win)
Read replicas solve the latency problem for read-heavy workloads. Deploy a read replica in each region. Local users get sub-10ms read latency. Writes still route to the primary region. Replication lag means replicas may serve slightly stale data — acceptable for most read paths.
Cross-region replication options
| Technology | Replication lag | Consistency | Auto failover | Cost |
|---|---|---|---|---|
| MySQL GTID async replication | 10–100ms | Eventually consistent | Manual / MHA | Low |
| PostgreSQL logical replication | 10–200ms | Eventually consistent | Manual / Patroni | Low |
| Aurora Global Database | < 1s | Eventually consistent | Automated (~1min) | Medium |
| CockroachDB multi-region | Synchronous | Serializable | Automatic | High |
| Google Spanner | Synchronous | External consistency | Automatic | Very high |
Aurora Global Database is the pragmatic choice for AWS-based architectures. It replicates at the storage level (not at the SQL layer), achieves sub-1-second lag globally, and supports automated failover in under 1 minute with no data loss for the synchronous portion of writes.
CockroachDB and Spanner offer synchronous multi-region writes with strong consistency guarantees. The cost is write latency: a write must be acknowledged by quorum across regions before returning to the caller. If US-East and EU-West are 80ms apart, writes take at minimum 80ms round-trip. This is often acceptable for financial or inventory systems but not for high-frequency user activity.
---5. The split-brain problem
Split-brain is the distributed systems nightmare: two nodes, both believing they are the authoritative primary, simultaneously accepting conflicting writes.
In an active-active setup with async replication: US-East accepts a write that changes user Bob's email to bob@newdomain.com. EU-West, one second later, accepts a write that changes Bob's email to bob@work.com. A network partition prevents these changes from syncing. When the partition heals, which write wins?
The problem is not merely theoretical. Cassandra, DynamoDB, and MongoDB all have configurable consistency levels precisely because they have encountered this reality at scale.
Conflict resolution strategies
Last-write-wins (LWW). Each write is timestamped. On conflict, the write with the later timestamp wins. Simple to implement, but silently discards data. Clock skew between servers (even with NTP, typically 1–10ms) can cause the wrong write to win.
Vector clocks / causal consistency. Each write carries a vector of version counters per node. Causally related writes can be ordered; concurrent writes are detected and flagged for resolution. Used by Amazon Dynamo, Riak. Complex to implement but provides semantic conflict detection rather than silent data loss.
CRDT-based merges. Conflict-free Replicated Data Types are data structures mathematically guaranteed to merge without conflicts. Counters, sets, and maps have CRDT implementations. Shopping cart contents can be merged; a "delete item" operation needs careful handling (tombstones). Redis CRDT and Riak use this approach.
Region affinity / geo-partitioning. Assign each data partition to an owning region. User records for EU users are owned by EU-West; user records for US users are owned by US-East. Each region is authoritative for its own partition. No cross-region write conflicts for normal operations. Reads can be served from any region via replication. This is CockroachDB's geo-partitioned replicas model.
6. The failover runbook — animated
{: class="marginalia" } The difference between RTO and RPO is subtle but critical. RTO is how long the system is down — a business and SLA question. RPO is how much data can be lost — a data integrity question. A system with RTO=5min and RPO=1hr might recover quickly but lose an hour of orders. Both numbers must be agreed with the business, not just engineering.Understanding the theory is one thing. Knowing the exact sequence of events during a real failover — and where the gaps are — is what separates a good answer from a great one.
---7. Stateless vs stateful services
The reason multi-region is complex is that stateful services cannot simply be cloned. Here is the full inventory of stateful components in a typical web application and the recommended strategy for each.
| Service | Multi-region strategy | Notes |
|---|---|---|
| Application servers (Node, Java, Go) | Deploy identical copies in each region behind regional ALBs | Trivial. Keep truly stateless — no in-process session state. |
| PostgreSQL / MySQL | Aurora Global Database (async, <1s lag) | Best managed option on AWS. Failover in ~60s. |
| Redis / session cache | ElastiCache Global Datastore | Async replication. On failover, some sessions expire — users re-login. Design for this. |
| S3 / object storage | Cross-Region Replication (CRR) | Async. New objects replicate within minutes. Existing objects need a one-time copy job. |
| Elasticsearch / OpenSearch | Cross-Cluster Replication (CCR) | Follower index in each region. Reads from local, writes to primary. |
| Kafka / event streams | MirrorMaker 2.0 | Replicates topic partitions across clusters. Consumer offsets translated. |
| Scheduled jobs / cron | Leader-election: only one region runs jobs | Use a distributed lock (DynamoDB conditional writes, Redis SETNX) to elect the primary scheduler. |
A special case worth discussing: user session state. If you store session data in Redis and your Redis Global Datastore has async replication lag, a failover event means some recently-authenticated users lose their sessions. They see a logged-out screen. This is acceptable in most products — it is a minor annoyance, not data loss. Design your application to handle it gracefully: clear error messages, redirect to login.
Worse would be storing any financial transaction state in a session. Never do this. All durable state must live in the database, not in cache.
---8. Chaos engineering — testing your failover
The most dangerous failure mode in a multi-region architecture is believing your failover works without having tested it under realistic conditions. Runbooks become stale. Configuration drift happens. A health check endpoint gets accidentally broken in a deploy.
Chaos engineering is the practice of deliberately injecting failures into production systems to verify their resilience. The foundational principle: if you do not test your failure scenarios regularly, you will discover them for the first time during an actual incident.
GameDay exercises
A GameDay is a scheduled, coordinated exercise where engineers deliberately take down a region, service, or dependency during low-traffic hours (typically 2–4 AM). The team practices the entire incident response flow: detection, communication, diagnosis, failover execution, and recovery. Lessons learned feed back into the runbook.
Netflix Chaos Monkey
Netflix's Chaos Monkey tool randomly terminates EC2 instances in their production environment during business hours. The rationale: if an instance can be lost at any moment, every service must be built to survive it. Chaos Kong goes further — it terminates entire availability zones or entire AWS regions on a schedule.
# Chaos Mesh (Kubernetes) — network partition experiment apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: partition-us-east namespace: default spec: action: partition mode: all selector: namespaces: ["production"] labelSelectors: "region": "us-east-1" direction: both duration: "2m" # Partition lasts 2 minutes
AWS Fault Injection Simulator (FIS) is the managed alternative — it provides pre-built templates for injecting EC2 failures, RDS failovers, and network disruptions, with safety guardrails to prevent runaway experiments.
9. Capacity planning and cost
Multi-region is not free. Interviewers frequently ask about cost implications, and "it's expensive" is not a sufficient answer. You should be able to reason about the magnitude.
| Component | Cost model | Rough estimate |
|---|---|---|
| Duplicate API server fleet (active-passive standby) | Reduced-capacity standby: ~30% of primary fleet | +30% compute cost |
| Aurora Global Database replication | Per-GB replicated data transferred | ~$0.20/GB across regions |
| Route 53 health checks | Per health check per month | $0.50/check/month |
| Cross-region data transfer (API responses) | Per-GB egress to other regions | ~$0.02/GB |
| S3 Cross-Region Replication | Per-request + per-GB | ~$0.005/1000 requests + $0.02/GB |
| ElastiCache Global Datastore replication | Per-GB replicated | ~$0.20/GB |
For a mid-size application running $10,000/month in a single region, expect active-passive multi-region to add 35–45% overhead (~$3,500–$4,500/month). Active-active with full duplicate capacity adds 70–100%.
The cost question always leads to: what is the cost of downtime? For a $1M/hour revenue business, 90 seconds of outage costs $25,000. Multi-region at $4,000/month ($48,000/year) easily justifies itself even at a single major outage per year. For a $10,000/month revenue business, the math is different — a read replica for latency plus a manual failover runbook may be sufficient.
---10. Putting it all together: the reference architecture
Given the interview question — survive US-East failure, serve Asia under 100ms, US-West takes over in 30 seconds — here is the concrete architecture recommendation:
Three active regions: us-east-1 (primary), us-west-2 (hot standby / secondary), ap-northeast-1 (Tokyo, read-only replica for APAC latency).
DNS: Route 53 with latency-based routing. Health checks every 10 seconds, failover threshold 3×. TTL 60 seconds. Tokyo region serves APAC reads; US regions serve writes. All write requests for APAC users are proxied back to US-East over the private AWS backbone (~80ms, acceptable for writes).
Database: Aurora Global Database with us-east-1 as write primary, us-west-2 and ap-northeast-1 as read replicas. Automated failover to us-west-2 on primary failure (<60s). APAC replica used only for reads — no writes, no promotion risk.
Cache: ElastiCache Global Datastore. Accept that some sessions will expire on failover. Build graceful re-authentication UX.
Object storage: S3 with CRR enabled to all three regions. Use S3 Transfer Acceleration for user uploads to minimize cross-region latency.
Chaos engineering: Monthly GameDay exercises failing us-east-1. Quarterly full-region Chaos Kong exercise.
# Summary: RTO/RPO targets vs achieved # Target RTO: 30 seconds RPO: 0 seconds (no data loss) # Achieved with Aurora Global + Route 53 RTO: 60-90 seconds # DNS TTL 60s + Aurora promotion 60s, some overlap RPO: 0-5 seconds # Aurora Global replication lag at moment of failure # To achieve strict 30s RTO: # - Reduce DNS TTL to 30s (higher DNS cost, more queries) # - Pre-warm standby connections (keep app servers hot, not idle) # - Use Aurora Global "managed planned failover" for DNS-independent switchover # To achieve strict 0s RPO: # - Switch to CockroachDB or Spanner (synchronous multi-region writes) # - Accept ~80ms write latency penalty (round-trip to quorum)
Trade-off framing for the interview
End with the trade-offs clearly articulated. An interviewer is not looking for a perfect system — they are looking for a candidate who understands that every architectural decision is a trade-off, and who can reason about those trade-offs explicitly.
For this architecture: we chose active-passive over active-active because it avoids split-brain complexity at lower cost. We accepted 60–90 seconds RTO instead of the target 30 seconds because Aurora Global promotion takes time — reducing this further requires synchronous replication and its associated write latency cost. We accepted near-zero (not absolute zero) RPO because synchronous replication would add 80ms+ to every write operation — unacceptable for interactive web traffic. These are business trade-offs, not engineering failures.