System Design: A/B Testing Platform — Feature Flags and Experimentation at Scale
Amazon runs 1,000+
A/B tests at any given
moment. Every UI
change, algorithm
tweak, and pricing
experiment goes
through experimentation.
The “1-Click Purchase”
(1999) was one of the
earliest — and they
patented it.
Design an A/B testing platform like Optimizely or LaunchDarkly. Engineers define experiments — “show blue button to 50 % of users” — each user is consistently assigned to a variant, the platform collects metrics, and a statistics engine determines whether the difference is real. Scale: 10,000 concurrent experiments across 500 million users.
The question: Design an A/B testing / feature-flag platform. Engineers define experiments with traffic splits. Assignment must be deterministic per user, add less than 1 ms to request latency, and the system must support 10,000 simultaneous experiments across 500 M users. The platform collects conversion metrics and computes statistical significance.
1. What A/B Testing Solves
Before designing anything, ground the system in the four concrete problems it addresses:
📊 Data-driven decisions
Does the blue button outperform the green? Does removing the sidebar increase conversions? Gut feeling is replaced with statistical evidence.
🚀 Progressive rollouts
Ship to 1% → 10% → 50% → 100%. Each stage validates stability and metrics before wider exposure. Reduces blast radius on failures.
🔴 Kill switches
Instantly disable a broken feature for all users — without a deployment. The flag is turned off; the code path is never executed again.
🎯 Personalization
Show different experiences to different user segments: premium vs free, country-specific UI, power users vs casual visitors.
2. The Core Requirements
Translate the business needs into technical constraints before touching architecture.
| Requirement | Constraint | Why it matters |
|---|---|---|
| Deterministic assignment | Same user → same variant, always | User experience consistency; statistical validity |
| Latency | < 1 ms per assignment | Called on every page load; can't add perceptible delay |
| Scale | 10,000 experiments, 500 M users | Millions of assignments/sec at peak |
| No I/O on hot path | Assignment must be pure computation | Database lookups at 5 M req/s is impossible |
| Metrics collection | Collect events, aggregate, compute statistics | The whole point: measure lift and significance |
| Statistical correctness | No peeking problem; valid confidence intervals | Wrong statistics → wrong decisions → regression shipped |
3. Assignment: Consistent Hashing with MurmurHash
The core insight: assignment must require zero I/O. No database, no cache, no network call. Pure math.
// Deterministic variant assignment — no DB lookup needed function assign(userId, experimentId, trafficSplit): hashInput = userId + ":" + experimentId hashValue = murmur3(hashInput) % 100 if hashValue < trafficSplit: return "treatment" else: return "control"
Three properties make this design correct:
- Deterministic: the same
(userId, experimentId)pair always produces the same hash → the same variant. No storage needed. - Uniform distribution: MurmurHash distributes inputs uniformly across 0–99, so a 50 % split gives roughly equal groups.
- Experiment isolation: changing the
experimentIdin the hash input means existing user-to-experiment assignments are unaffected by adding new experiments.
Interactive Assignment Demo
Enter a user ID above.
4. Flag Delivery: Two Approaches
Once we know how assignment works, we need to decide where flag rules live.
🖥️ Server-side evaluation
All flag rules live on the server. Request arrives → server evaluates rules in memory → serves the appropriate variant. Rules are loaded on startup and cached; evaluation is < 1 μs. Rules are never exposed to end users. This is the dominant approach for back-end services.
🌐 Client-side evaluation
Rules are downloaded to the browser SDK on app start and evaluated in-browser — zero network round-trip on the hot path. Trade-off: rules are visible to the user (can be inspected), and a large rule set means a large download. Suitable for front-end feature flags where rule confidentiality is not required.
SDK bootstrap flow (client-side):
synchronous, 0 I/O
The “flag bundle” is a compact JSON document containing every flag rule relevant to the current user. The server pre-computes targeting-rule evaluation for the user and returns a stripped-down bundle — reducing client-side compute and hiding the full rule set.
5. Flag Storage and Rollout Rules
Each feature flag is a structured document with several logical components.
{
"id": "checkout_v3",
"name": "Checkout Flow Redesign v3",
"status": "active", // active | inactive | archived
"killSwitch": false, // if true → forces ALL users to control
"trafficSplit": 50, // % of matched users who get treatment
"targetingRules": [
{
"conditions": [
{ "attribute": "country", "op": "eq", "value": "US" },
{ "attribute": "plan", "op": "eq", "value": "pro" }
],
"variant": "treatment" // force treatment for US Pro users
}
],
"metrics": ["purchase", "signup", "page_view"]
}
Storage architecture:
flag definitions
source of truth
< 1 ms reads
hot flag cache
in-memory copy
polled every 30 s
All 10,000 flags compress to roughly 100 MB in Redis — a trivially small dataset. App servers keep a local in-memory copy updated via long-poll or SSE from the flag delivery service. Flag evaluation itself hits no external service.
6. Metrics Collection
Assignment alone is useless without outcome measurement. After a user is assigned a variant, the platform must collect conversion events and attribute them to the right experiment variant.
Event schema:
{
"userId": "user_12345",
"experimentId": "btn_color_v2",
"variant": "treatment",
"eventType": "purchase",
"timestamp": 1718179200000,
"value": 49.99 // optional: revenue, duration, etc.
}
Collection pipeline:
click / purchase
/ page_view
stateless HTTP
write-only
50B events/day
durable queue
stream processing
deduplication
OLAP storage
fast aggregation
ClickHouse is purpose-built for this: columnar storage, vectorized execution, and GROUP BY queries over billions of rows return in under a second. A query like “count conversions by variant for experiment X in the last 7 days” scans only the experimentId and variant columns — ignoring everything else.
7. Statistical Significance Calculator
The mathematics of deciding “is this result real, or just noise?” is the hardest part of A/B testing to get right.
Z-test for proportions is the standard approach when the metric is a conversion rate:
function zTest(Nc, kc, Nt, kt) { // Nc = control visitors, kc = control conversions // Nt = treatment visitors, kt = treatment conversions var pc = kc / Nc; var pt = kt / Nt; // Pooled proportion under H0 var p = (kc + kt) / (Nc + Nt); var se = Math.sqrt(p * (1 - p) * (1/Nc + 1/Nt)); var z = (pt - pc) / se; return { z: z, pValue: pFromZ(Math.abs(z)) }; } // 95% CI for treatment conversion rate function confidenceInterval(k, N) { var p = k / N; var se = Math.sqrt(p * (1 - p) / N); return { lo: p - 1.96 * se, hi: p + 1.96 * se }; }
Interactive Statistical Significance Calculator
Fill in the fields above.
8. Mutual Exclusion and Experiment Interaction
Running 10,000 experiments simultaneously creates a subtle problem: interaction effects.
If user Alice is simultaneously in:
- Experiment A (button color: blue vs green)
- Experiment B (page layout: wide vs narrow)
…then the narrow layout might make the blue button look better for unrelated reasons. The two experiments contaminate each other’s results.
Solution: Layered architecture
0–50% of users
50–80% of users
0–60% of users
60–100% of users
0–30% of users
Rules:
- Experiments within the same layer are mutually exclusive — a user can only be in one experiment per layer.
- Experiments in different layers are orthogonal — a user can be in one experiment per layer simultaneously. The layers are designed to test independent product dimensions (UI, ranking, pricing), so interaction effects are minimized.
- Each layer uses a different hash seed, so the 50 % split in Layer 1 is independent of the 60 % split in Layer 2.
9. The Peeking Problem
This is the most important section and the one most A/B testing implementations get wrong.
The problem: A researcher launches an experiment with a planned runtime of 14 days. On day 3, they check the dashboard and see p = 0.04 (significant!). They stop the experiment and ship the change. Six months later, it turns out the feature had no effect — the early result was pure noise.
Why this happens: The Z-test p-value is only valid at the planned sample size. Checking it repeatedly — and stopping when it crosses 0.05 — is called optional stopping. It inflates the false positive rate from 5 % to as high as 30 %.
Solutions:
| Approach | How it works | Trade-offs |
|---|---|---|
| Fixed-horizon test | Pre-commit to a sample size. Look at results only once. | Simple. But researchers always peek early anyway. |
| Sequential testing (mSPRT) | Always-valid p-values — mathematically correct to check at any time without inflating false positive rate. | Requires more samples to reach the same power. Used by Netflix, Booking.com. |
| Bayesian A/B testing | Compute P(treatment > control). Inherently valid at any sample size — probability statements, not binary reject/fail. | No hard significance threshold. Requires choosing a prior. Used by VWO, Google Optimize. |
Bayesian framing is increasingly popular because it answers the question humans actually want: “What is the probability that treatment is better than control?” rather than “Can we reject the null hypothesis?”
// Beta-Binomial conjugate model // Prior: Beta(1, 1) = uniform (no prior knowledge) // Posterior: Beta(1 + conversions, 1 + non-conversions) posterior_control = Beta(1 + kc, 1 + Nc - kc) posterior_treatment = Beta(1 + kt, 1 + Nt - kt) // Monte Carlo: sample 10,000 times from each posterior wins = count(sample_treatment[i] > sample_control[i] for i in range(10000)) P(treatment > control) = wins / 10000
10. Capacity Estimate
| Component | Scale driver | Solution |
|---|---|---|
| Flag assignment | 5 M assignments/sec | Pure hash computation, no I/O; horizontally scalable app servers |
| Flag delivery | Flag bundle refreshes | Redis + CDN edge caching of flag bundles; 30 s TTL |
| Event ingestion | 50 B events/day | Kafka (600 partitions), stateless ingest API, sendBeacon on client |
| Aggregation | Streaming computation | Flink for real-time rollups; ClickHouse for historical queries |
| Flag storage | 10,000 flags × rule complexity | PostgreSQL (source of truth) + Redis (read cache); 100 MB total |
| Stat engine | Dashboard queries | Pre-aggregated daily rollups in ClickHouse; Z-test computed in-process |
Why ClickHouse for metrics storage?
A query like “count distinct users with purchase events, grouped by variant, for experiment X, last 7 days” over 350 billion rows sounds terrifying — but ClickHouse completes it in under 1 second thanks to:
- Columnar storage: only
experimentId,variant,eventTypecolumns are read - Vectorized execution: SIMD operations over column batches
- MergeTree partitioning: data sharded by
(experimentId, date), so scans are localized
11. Full Architecture
defines experiment
CRUD + validation
source of truth
hot cache
in-memory copy
500 M users
hash(userId+expId)
<1ms, no I/O
click/purchase
→ ClickHouse
aggregate metrics
Z-test / Bayesian
p-value, CI, lift
The “peeking problem”
has caused many teams
to ship regressions.
A team sees “p=0.04”
after 3 days and ships
— not realizing it was
a statistical fluke.
Netflix’s experiment
platform uses
sequential testing
specifically to prevent
premature decisions.
Key Takeaways for the Interview
When an interviewer asks “design an A/B testing platform”, they are probing for these specific insights:
-
No I/O on the assignment hot path.
murmur3(userId + experimentId) % 100— that’s the entire algorithm. No database, no cache read. This is the single most important insight. -
In-memory flag evaluation. App servers hold all 10,000 flags in RAM (100 MB). Flags are refreshed every 30 seconds via background poll. Zero latency on the request path.
-
Separate control plane from data plane. Flag definition and editing (low traffic, strong consistency) is PostgreSQL. Flag evaluation (5 M/sec, pure compute) never touches the database.
-
Kafka + ClickHouse for metrics. Event write throughput (50 B/day) requires a queue. ClickHouse is the only mainstream database that can aggregate billions of events in under a second.
-
The peeking problem is a real problem. Don’t just say “compute a p-value.” Explain sequential testing (mSPRT) or Bayesian posteriors, and why naive Z-tests with optional stopping are broken.
-
Layer architecture for mutual exclusion. With 10,000 experiments, interaction effects are real. Experiments in the same dimension (UI, ranking, pricing) must be in the same layer and therefore mutually exclusive.
Google runs ~10,000
live search experiments
at any time. Their
“layered” experiment
design allows stacking:
one experiment tests
ranking algorithm,
another tests UI layout,
another tests ad format.
Layers minimize
interaction effects.
Happy shipping — and may your p-values always be valid.