System Design: A/B Testing Platform — Feature Flags and Experimentation at Scale

Series System Design: Web Scenarios

Amazon runs 1,000+
A/B tests at any given
moment. Every UI
change, algorithm
tweak, and pricing
experiment goes
through experimentation.
The “1-Click Purchase”
(1999) was one of the
earliest — and they
patented it.

Design an A/B testing platform like Optimizely or LaunchDarkly. Engineers define experiments — “show blue button to 50 % of users” — each user is consistently assigned to a variant, the platform collects metrics, and a statistics engine determines whether the difference is real. Scale: 10,000 concurrent experiments across 500 million users.

The question: Design an A/B testing / feature-flag platform. Engineers define experiments with traffic splits. Assignment must be deterministic per user, add less than 1 ms to request latency, and the system must support 10,000 simultaneous experiments across 500 M users. The platform collects conversion metrics and computes statistical significance.

1. What A/B Testing Solves

Before designing anything, ground the system in the four concrete problems it addresses:

📊 Data-driven decisions

Does the blue button outperform the green? Does removing the sidebar increase conversions? Gut feeling is replaced with statistical evidence.

🚀 Progressive rollouts

Ship to 1% → 10% → 50% → 100%. Each stage validates stability and metrics before wider exposure. Reduces blast radius on failures.

🔴 Kill switches

Instantly disable a broken feature for all users — without a deployment. The flag is turned off; the code path is never executed again.

🎯 Personalization

Show different experiences to different user segments: premium vs free, country-specific UI, power users vs casual visitors.

2. The Core Requirements

Translate the business needs into technical constraints before touching architecture.

Requirement	Constraint	Why it matters
Deterministic assignment	Same user → same variant, always	User experience consistency; statistical validity
Latency	< 1 ms per assignment	Called on every page load; can't add perceptible delay
Scale	10,000 experiments, 500 M users	Millions of assignments/sec at peak
No I/O on hot path	Assignment must be pure computation	Database lookups at 5 M req/s is impossible
Metrics collection	Collect events, aggregate, compute statistics	The whole point: measure lift and significance
Statistical correctness	No peeking problem; valid confidence intervals	Wrong statistics → wrong decisions → regression shipped

3. Assignment: Consistent Hashing with MurmurHash

The core insight: assignment must require zero I/O. No database, no cache, no network call. Pure math.

pseudocode

// Deterministic variant assignment — no DB lookup needed
function assign(userId, experimentId, trafficSplit):
  hashInput  = userId + ":" + experimentId
  hashValue  = murmur3(hashInput) % 100
  if hashValue < trafficSplit:
    return "treatment"
  else:
    return "control"

Three properties make this design correct:

Deterministic: the same (userId, experimentId) pair always produces the same hash → the same variant. No storage needed.
Uniform distribution: MurmurHash distributes inputs uniformly across 0–99, so a 50 % split gives roughly equal groups.
Experiment isolation: changing the experimentId in the hash input means existing user-to-experiment assignments are unaffected by adding new experiments.

Interactive Assignment Demo

Assignment Simulator

User ID

Experiment ID

Traffic Split 50%

Result
Enter a user ID above.

Verify actual distribution matches the split

4. Flag Delivery: Two Approaches

Once we know how assignment works, we need to decide where flag rules live.

🖥️ Server-side evaluation

All flag rules live on the server. Request arrives → server evaluates rules in memory → serves the appropriate variant. Rules are loaded on startup and cached; evaluation is < 1 μs. Rules are never exposed to end users. This is the dominant approach for back-end services.

🌐 Client-side evaluation

Rules are downloaded to the browser SDK on app start and evaluated in-browser — zero network round-trip on the hot path. Trade-off: rules are visible to the user (can be inspected), and a large rule set means a large download. Suitable for front-end feature flags where rule confidentiality is not required.

SDK bootstrap flow (client-side):

App Start

→

SDK Init

→

Fetch Flag Bundle

→

Cache Locally

→

Evaluate Flags
synchronous, 0 I/O

The “flag bundle” is a compact JSON document containing every flag rule relevant to the current user. The server pre-computes targeting-rule evaluation for the user and returns a stripped-down bundle — reducing client-side compute and hiding the full rule set.

5. Flag Storage and Rollout Rules

Each feature flag is a structured document with several logical components.

json

{
  "id":     "checkout_v3",
  "name":   "Checkout Flow Redesign v3",
  "status": "active",         // active | inactive | archived
  "killSwitch": false,         // if true → forces ALL users to control
  "trafficSplit": 50,            // % of matched users who get treatment
  "targetingRules": [
    {
      "conditions": [
        { "attribute": "country",  "op": "eq",  "value": "US" },
        { "attribute": "plan",     "op": "eq",  "value": "pro" }
      ],
      "variant": "treatment"    // force treatment for US Pro users
    }
  ],
  "metrics": ["purchase", "signup", "page_view"]
}

Storage architecture:

PostgreSQL
flag definitions
source of truth

→

Redis
< 1 ms reads
hot flag cache

→

App Server
in-memory copy
polled every 30 s

All 10,000 flags compress to roughly 100 MB in Redis — a trivially small dataset. App servers keep a local in-memory copy updated via long-poll or SSE from the flag delivery service. Flag evaluation itself hits no external service.

6. Metrics Collection

Assignment alone is useless without outcome measurement. After a user is assigned a variant, the platform must collect conversion events and attribute them to the right experiment variant.

Event schema:

json

{
  "userId":       "user_12345",
  "experimentId": "btn_color_v2",
  "variant":      "treatment",
  "eventType":    "purchase",
  "timestamp":    1718179200000,
  "value":        49.99         // optional: revenue, duration, etc.
}

Collection pipeline:

Client Event
click / purchase
/ page_view

→

Ingest API
stateless HTTP
write-only

→

Kafka
50B events/day
durable queue

→

Flink
stream processing
deduplication

→

ClickHouse
OLAP storage
fast aggregation

ClickHouse is purpose-built for this: columnar storage, vectorized execution, and GROUP BY queries over billions of rows return in under a second. A query like “count conversions by variant for experiment X in the last 7 days” scans only the experimentId and variant columns — ignoring everything else.

7. Statistical Significance Calculator

The mathematics of deciding “is this result real, or just noise?” is the hardest part of A/B testing to get right.

Z-test for proportions is the standard approach when the metric is a conversion rate:

javascript

function zTest(Nc, kc, Nt, kt) {
  // Nc = control visitors, kc = control conversions
  // Nt = treatment visitors, kt = treatment conversions
  var pc = kc / Nc;
  var pt = kt / Nt;
  // Pooled proportion under H0
  var p  = (kc + kt) / (Nc + Nt);
  var se = Math.sqrt(p * (1 - p) * (1/Nc + 1/Nt));
  var z  = (pt - pc) / se;
  return { z: z, pValue: pFromZ(Math.abs(z)) };
}

// 95% CI for treatment conversion rate
function confidenceInterval(k, N) {
  var p  = k / N;
  var se = Math.sqrt(p * (1 - p) / N);
  return { lo: p - 1.96 * se, hi: p + 1.96 * se };
}

Interactive Statistical Significance Calculator

Statistical Significance Calculator

🟢 Control (Green Button)

Visitors

Conversions

🔵 Treatment (Blue Button)

Visitors

Conversions

Daily visitors (for estimation)

Result
Fill in the fields above.

p = 1.0 (no signal) p = 0.05 (threshold) p = 0.001 (strong)

8. Mutual Exclusion and Experiment Interaction

Running 10,000 experiments simultaneously creates a subtle problem: interaction effects.

If user Alice is simultaneously in:

Experiment A (button color: blue vs green)
Experiment B (page layout: wide vs narrow)

…then the narrow layout might make the blue button look better for unrelated reasons. The two experiments contaminate each other’s results.

Solution: Layered architecture

Experiment Layers

Layer 1 — UI

btn_color_v2
0–50% of users

hero_image_v3
50–80% of users

20% unassigned

Layer 2 — Ranking

search_algo_v7
0–60% of users

rec_model_v4
60–100% of users

Layer 3 — Pricing

discount_strategy
0–30% of users

70% unassigned

Rules:

Experiments within the same layer are mutually exclusive — a user can only be in one experiment per layer.
Experiments in different layers are orthogonal — a user can be in one experiment per layer simultaneously. The layers are designed to test independent product dimensions (UI, ranking, pricing), so interaction effects are minimized.
Each layer uses a different hash seed, so the 50 % split in Layer 1 is independent of the 60 % split in Layer 2.

Namespace isolation (alternative model): Some platforms use a single namespace of 0–9999 "slots". Each experiment is allocated a slice of the namespace. Users are assigned to a slot via hash; the experiment that owns that slot serves them. Mutually exclusive by construction — no user falls into two experiments that share slots.

9. The Peeking Problem

This is the most important section and the one most A/B testing implementations get wrong.

The problem: A researcher launches an experiment with a planned runtime of 14 days. On day 3, they check the dashboard and see p = 0.04 (significant!). They stop the experiment and ship the change. Six months later, it turns out the feature had no effect — the early result was pure noise.

Why this happens: The Z-test p-value is only valid at the planned sample size. Checking it repeatedly — and stopping when it crosses 0.05 — is called optional stopping. It inflates the false positive rate from 5 % to as high as 30 %.

False positive simulation: Run 1,000 A/A tests (control vs control — identical variants). Check the p-value every day for 14 days. Stop and "ship" whenever p < 0.05. In a correctly-run experiment, about 5% of A/A tests should appear significant. With daily peeking and optional stopping, roughly 26–30% appear significant — a 5× inflation of false discoveries.

Solutions:

Approach	How it works	Trade-offs
Fixed-horizon test	Pre-commit to a sample size. Look at results only once.	Simple. But researchers always peek early anyway.
Sequential testing (mSPRT)	Always-valid p-values — mathematically correct to check at any time without inflating false positive rate.	Requires more samples to reach the same power. Used by Netflix, Booking.com.
Bayesian A/B testing	Compute P(treatment > control). Inherently valid at any sample size — probability statements, not binary reject/fail.	No hard significance threshold. Requires choosing a prior. Used by VWO, Google Optimize.

Bayesian framing is increasingly popular because it answers the question humans actually want: “What is the probability that treatment is better than control?” rather than “Can we reject the null hypothesis?”

pseudocode — bayesian estimate

// Beta-Binomial conjugate model
// Prior: Beta(1, 1) = uniform (no prior knowledge)
// Posterior: Beta(1 + conversions, 1 + non-conversions)

posterior_control   = Beta(1 + kc,  1 + Nc - kc)
posterior_treatment = Beta(1 + kt,  1 + Nt - kt)

// Monte Carlo: sample 10,000 times from each posterior
wins = count(sample_treatment[i] > sample_control[i]  for i in range(10000))
P(treatment > control) = wins / 10000

10. Capacity Estimate

10,000

Concurrent experiments

5 M/s

Peak assignments/sec

< 1 ms

Flag eval latency

50 B/day

Metric events

~5 TB/day

ClickHouse ingestion

~100 MB

Redis flag cache

Component	Scale driver	Solution
Flag assignment	5 M assignments/sec	Pure hash computation, no I/O; horizontally scalable app servers
Flag delivery	Flag bundle refreshes	Redis + CDN edge caching of flag bundles; 30 s TTL
Event ingestion	50 B events/day	Kafka (600 partitions), stateless ingest API, sendBeacon on client
Aggregation	Streaming computation	Flink for real-time rollups; ClickHouse for historical queries
Flag storage	10,000 flags × rule complexity	PostgreSQL (source of truth) + Redis (read cache); 100 MB total
Stat engine	Dashboard queries	Pre-aggregated daily rollups in ClickHouse; Z-test computed in-process

Why ClickHouse for metrics storage?

A query like “count distinct users with purchase events, grouped by variant, for experiment X, last 7 days” over 350 billion rows sounds terrifying — but ClickHouse completes it in under 1 second thanks to:

Columnar storage: only experimentId, variant, eventType columns are read
Vectorized execution: SIMD operations over column batches
MergeTree partitioning: data sharded by (experimentId, date), so scans are localized

11. Full Architecture

Control Plane (low traffic)

Engineer
defines experiment

→

Flag Service API
CRUD + validation

→

PostgreSQL
source of truth

→

Redis
hot cache

→

App Servers
in-memory copy

Data Plane (high traffic)

User Request
500 M users

→

Assign Variant
hash(userId+expId)

→

Serve Experience
<1ms, no I/O

→

Event Fired
click/purchase

→

Kafka → Flink
→ ClickHouse

Analysis Plane

ClickHouse Query
aggregate metrics

→

Stat Engine
Z-test / Bayesian

→

Dashboard
p-value, CI, lift

The “peeking problem”
has caused many teams
to ship regressions.
A team sees “p=0.04”
after 3 days and ships
— not realizing it was
a statistical fluke.
Netflix’s experiment
platform uses
sequential testing
specifically to prevent
premature decisions.

Key Takeaways for the Interview

When an interviewer asks “design an A/B testing platform”, they are probing for these specific insights:

No I/O on the assignment hot path. murmur3(userId + experimentId) % 100 — that’s the entire algorithm. No database, no cache read. This is the single most important insight.
In-memory flag evaluation. App servers hold all 10,000 flags in RAM (100 MB). Flags are refreshed every 30 seconds via background poll. Zero latency on the request path.
Separate control plane from data plane. Flag definition and editing (low traffic, strong consistency) is PostgreSQL. Flag evaluation (5 M/sec, pure compute) never touches the database.
Kafka + ClickHouse for metrics. Event write throughput (50 B/day) requires a queue. ClickHouse is the only mainstream database that can aggregate billions of events in under a second.
The peeking problem is a real problem. Don’t just say “compute a p-value.” Explain sequential testing (mSPRT) or Bayesian posteriors, and why naive Z-tests with optional stopping are broken.
Layer architecture for mutual exclusion. With 10,000 experiments, interaction effects are real. Experiments in the same dimension (UI, ranking, pricing) must be in the same layer and therefore mutually exclusive.

Google runs ~10,000
live search experiments
at any time. Their
“layered” experiment
design allows stacking:
one experiment tests
ranking algorithm,
another tests UI layout,
another tests ad format.
Layers minimize
interaction effects.

Happy shipping — and may your p-values always be valid.