System Design: Rate Limiter — From Token Bucket to Distributed Throttling
Series #3 of 15 in
System Design Interview
deep-dives. Each post
stands alone but they
build on each other.
The question: Design a rate limiter that limits API calls to 100 requests per minute per user. Make it work across a distributed cluster of 10 API servers.
This deceptively simple prompt hides a rabbit hole of trade-offs: accuracy vs. memory, simplicity vs. correctness, single-server vs. distributed. Let’s go deep.
1. Why Rate Limiting?
Before writing a single line of code, it’s worth internalising why this problem matters. Rate limiting sits at the intersection of reliability, fairness, and economics.
Prevent abuse & DDoS — Without limits, a single misbehaving client (or attacker) can saturate your servers and degrade service for everyone else. A rate limiter is the first line of defence.
Fair usage — In a multi-tenant system, one large customer shouldn’t be able to monopolise shared resources at the expense of smaller ones.
Cost control — Many APIs call downstream paid services (LLMs, SMS, geocoding). A single runaway script can generate a surprise $40,000 cloud bill overnight.
Real-world examples:
Public API Rate Limits in the Wild
| Service | Limit | Window | Key |
|---|---|---|---|
| Twitter / X (free tier) | 15 calls | 15 minutes | App token |
| Stripe | 100 requests | 1 second | API key |
| GitHub REST API | 5 000 requests | 1 hour | OAuth token |
| OpenAI (GPT-4o) | 500 requests | 1 minute | Org + model |
| Twilio SMS | 1 message/sec | sustained | Phone number |
2. Level 1 — In-Memory Counter (Single Server)
The simplest possible solution: keep a HashMap in RAM. For each user, store the request count and the window start time. At the start of every new minute, reset the counter.
class InMemoryRateLimiter { private final Map<String, long[]> store = new ConcurrentHashMap<>(); private static final int LIMIT = 100; private static final long WINDOW = 60_000; // 1 minute in ms /** * Returns true if request is allowed, false if rate-limited. * store[userId] = [count, windowStartMs] */ public synchronized boolean allow(String userId) { long now = System.currentTimeMillis(); long[] state = store.computeIfAbsent(userId, k -> new long[]{0, now}); if (now - state[1] >= WINDOW) { state[0] = 0; // reset count state[1] = now; // new window start } if (state[0] < LIMIT) { state[0]++; return true; } return false; } }
Problems with this approach:
- 💀 State lost on restart — every deploy wipes the counters; a user who made 99 requests just before a rolling restart gets a free 100 more.
- 💀 Doesn’t work across servers — with 10 API servers each keeping independent state, a user can make 100 requests per server = 1,000 requests per minute total.
- ✅ Zero latency (in-process)
- ✅ Fine for development / single-instance tools
3. Level 2 — Fixed Window Counter (Redis)
Move the counter to Redis. INCR is atomic; EXPIRE sets the TTL so the key auto-deletes at the end of the window.
import redis, time, math r = redis.Redis(host='localhost') def allow_fixed_window(user_id: str, limit: int = 100) -> bool: window = math.floor(time.time() / 60) # current 1-minute window key = "rl:" + user_id + ":" + str(window) # e.g. "rl:u42:28591234" pipe = r.pipeline() pipe.incr(key) pipe.expire(key, 60) count, _ = pipe.execute() return count <= limit
Simple, fast, and works across any number of servers. But there is a nasty edge case.
The Boundary Spike Problem
The sliding window counter
uses ~10× less memory than
the sliding window log for
the same accuracy — that’s
the production sweet spot.
This is the fundamental flaw of fixed-window counting: a client can double-spend by straddling a window boundary.
4. Level 3 — Sliding Window Log
Instead of a counter, store the timestamp of every request in a sorted set. On each new request:
- Remove all timestamps older than
now - 60s - Count remaining entries
- If count < limit → add timestamp and allow; else deny
def allow_sliding_log(user_id: str, limit: int = 100) -> bool: now = time.time() cutoff = now - 60 key = "rl_log:" + user_id with r.pipeline() as pipe: pipe.zremrangebyscore(key, 0, cutoff) # remove old entries pipe.zcard(key) # count remaining pipe.zadd(key, {str(now): now}) # add current timestamp pipe.expire(key, 70) # TTL safety net _, count, _, _ = pipe.execute() if count > limit: r.zrem(key, str(now)) # undo the add — request denied return False return True
Accuracy: Perfect — no boundary spike problem.
Memory cost: Each request stores a full 64-bit timestamp. A user making 100 req/min stores 100 entries. With 1 million users at the limit, that’s ~800 MB just for timestamps. At scale this becomes painful.
5. Level 4 — Sliding Window Counter (Recommended)
The elegant middle ground: keep only two counters (current window + previous window), then compute a weighted estimate of the rolling window count.
The Weighted Formula
rolling_count = prev_count × overlap_ratio + curr_count
overlap_ratio = the fraction of the previous window that still falls inside our 60-second lookback.
If we're 15 seconds into the current window, overlap_ratio = (60 − 15) / 60 = 0.75.
Example: User made 80 requests in the previous minute. We’re 20 seconds into the current minute and they’ve made 30 requests so far.
rolling_count = 80 × (40/60) + 30 = 53.3 + 30 = 83.3 → allowed (< 100)
def allow_sliding_counter(user_id: str, limit: int = 100) -> bool: now = time.time() window_size = 60.0 curr_window = math.floor(now / window_size) prev_window = curr_window - 1 elapsed = now - (curr_window * window_size) overlap = (window_size - elapsed) / window_size curr_key = "rl:" + user_id + ":" + str(curr_window) prev_key = "rl:" + user_id + ":" + str(prev_window) pipe = r.pipeline() pipe.get(prev_key) pipe.incr(curr_key) pipe.expire(curr_key, 120) prev_count_raw, curr_count, _ = pipe.execute() prev_count = int(prev_count_raw) if prev_count_raw else 0 rolling = prev_count * overlap + curr_count if rolling > limit: r.decr(curr_key) # undo the increment — request denied return False return True
Memory: 2 keys per user (each storing a single integer). ~10,000× less memory than the sliding log at scale, with accuracy typically within 0.1% of the true rolling count.
6. Level 5 — Token Bucket (Interactive Demo)
Token bucket is the algorithm used by most major API gateways (AWS API Gateway, Nginx, Kong). The intuition:
- A bucket holds up to
capacitytokens - Tokens are added at
refill_ratetokens/second - Each request consumes 1 token
- If the bucket is empty → 429 Too Many Requests
- The bucket can accumulate tokens up to
capacity, enabling burst handling
🪣 Token Bucket — Live Demo
Token bucket allows bursting,
which is usually what you want
for APIs — a legitimate user
doing a one-time large batch
shouldn’t be penalised the
same way as a sustained
abuser.
7. Level 6 — Leaky Bucket
Leaky bucket is the mirror image of token bucket. Instead of tokens, think of a queue with a hole at the bottom that drains at a fixed rate.
🪣 Token Bucket
Tokens accumulate when idle. Burst requests consume banked tokens. Allows short bursts above the average rate. Good for user-facing APIs where occasional spikes are legitimate.
Burst-friendly
🚰 Leaky Bucket
Requests enter a FIFO queue. Processed at a fixed drain rate. No bursting — requests are smoothed to a constant output rate. Ideal for metered billing or downstream services that can't handle spikes.
No bursting Smooth output
class LeakyBucket: def __init__(self, capacity: int, drain_rate: float): self.capacity = capacity # max queue depth self.drain_rate = drain_rate # requests processed per second self.queue = 0 # current queue depth self.last_drain = time.time() def _drain(self): now = time.time() elapsed = now - self.last_drain drained = elapsed * self.drain_rate self.queue = max(0, self.queue - drained) self.last_drain = now def allow(self) -> bool: self._drain() if self.queue < self.capacity: self.queue += 1 return True return False # queue full — drop request
8. Level 7 — Distributed Rate Limiting
This is where the interview question gets interesting. With 10 API servers, naive approaches fail:
Approach A — Sticky sessions: Route each user to the same server using consistent hashing. Simple, but defeats load balancing; one hot server is a bottleneck.
Approach B — Centralized Redis: All servers talk to a single Redis. Works, but Redis becomes a single point of failure and adds 1–2ms of network latency to every request.
Approach C — Redis Cluster + Lua atomic script: Use a Redis Cluster (3 primaries, 3 replicas). The critical insight: the check-and-increment must be atomic. A Lua script in Redis runs atomically — no other command can interleave.
Lua scripts in Redis are
atomic — this is crucial
for rate limiting. Without
atomicity you have a race
condition where two requests
can both pass a limit of 1.
-- rate_limit.lua -- KEYS[1] = rate limit key (e.g. "rl:u42:28591234") -- ARGV[1] = limit (e.g. "100") -- ARGV[2] = window_seconds (e.g. "60") -- Returns: {allowed, current_count, ttl} local key = KEYS[1] local limit = tonumber(ARGV[1]) local ttl = tonumber(ARGV[2]) local count = redis.call('INCR', key) if count == 1 then -- First request in this window — set expiry redis.call('EXPIRE', key, ttl) end if count > limit then -- Undo the increment — we must not let it drift upward redis.call('DECR', key) return {0, count - 1, redis.call('TTL', key)} end return {1, count, redis.call('TTL', key)}
import redis, math, time r = redis.Redis(host='redis-cluster') # Load once at startup — returns a SHA hash with open('rate_limit.lua') as f: LUA_SHA = r.script_load(f.read()) def allow(user_id: str, limit: int = 100) -> tuple[bool, int, int]: window = math.floor(time.time() / 60) key = "rl:" + user_id + ":" + str(window) # evalsha is ~20% faster than eval — script is already compiled allowed, count, ttl = r.evalsha(LUA_SHA, 1, key, limit, 60) return bool(allowed), int(count), int(ttl)
Approach D — Local approximation + async sync: Each server keeps a local counter. Every 100ms, all servers gossip their deltas to a central store and pull the global total. A request is allowed if local_count + last_known_global < limit. This trades perfect accuracy for near-zero added latency. Netflix uses a variant of this.
9. Level 8 — Rate Limiting by Different Keys
Real systems enforce limits at multiple granularities simultaneously.
# By user ID (most common) rate_limit:user:{userId}:{window} # By IP address (unauthenticated traffic) rate_limit:ip:{ipAddr}:{window} # By API key (for partner integrations) rate_limit:apikey:{hashedKey}:{window} # By endpoint (prevent expensive endpoints from being hammered) rate_limit:endpoint:{endpoint}:{window} # Hierarchical: user + specific endpoint rate_limit:user:{userId}:endpoint:{endpoint}:{window} # Global (platform-wide circuit breaker) rate_limit:global:{window}
def allow_hierarchical(user_id: str, endpoint: str, ip: str) -> bool: # All three must pass — deny on the first failure checks = [ (ip, 30), # 30/min per IP (unauthenticated guard) (user_id, 100), # 100/min per user (main SLA) (user_id + endpoint, 20), # 20/min per user per endpoint ] for key_suffix, limit in checks: allowed, _, _ = allow(key_suffix, limit) if not allowed: return False return True
10. Interactive: Algorithm Comparison
| Property | Value |
|---|---|
| Memory per user | O(1) — 1 integer key |
| Time complexity | O(1) |
| Allows bursting | Yes (at window boundary) |
| Boundary spike problem | Yes — up to 2× limit in 1s |
| Redis operations | 2 (INCR + EXPIRE) |
| Production use | Simple internal tools, batch jobs |
The simplest distributed approach. Works well when brief spikes at window boundaries are acceptable and you prioritise operational simplicity. The 2× burst window is often not a problem for internal APIs.
| Property | Value |
|---|---|
| Memory per user | O(requests/window) — 1 timestamp per request |
| Time complexity | O(log N) — sorted set operations |
| Allows bursting | Yes (accurate rolling window) |
| Boundary spike problem | None — true sliding window |
| Redis operations | 4 (ZREMRANGE + ZCARD + ZADD + EXPIRE) |
| Production use | When accuracy is critical; low-traffic premium APIs |
Perfectly accurate but costly at scale. Storing a timestamp per request means at 100 req/min × 1M users = 100M Redis entries just for rate limit logs. Most systems can't afford this overhead.
| Property | Value |
|---|---|
| Memory per user | O(1) — 2 integer keys |
| Time complexity | O(1) |
| Allows bursting | Yes (within rolling window) |
| Boundary spike problem | ~0.1% error — statistically negligible |
| Redis operations | 3 (GET prev + INCR curr + EXPIRE) |
| Production use | Recommended default — Cloudflare, Figma |
The sweet spot: near-perfect accuracy with O(1) memory. The weighted formula introduces a tiny approximation error (~0.1%) in exchange for 10,000× less memory than the log approach. This is what you should propose in most interviews.
| Property | Value |
|---|---|
| Memory per user | O(1) — count + last refill timestamp |
| Time complexity | O(1) |
| Allows bursting | Yes — by design |
| Boundary spike problem | None — continuous time model |
| Redis operations | 1 Lua script (atomic GET+SET) |
| Production use | AWS API Gateway, Kong, Nginx limit_req |
Best for traffic shaping. Burst capacity is explicit and controlled. Ideal when you want to allow a user to make 10 requests immediately after a long idle period, but not sustain 10× the limit indefinitely.
11. Response Headers — Telling Clients What Happened
Retry-After is the most
important header — a good
client will back off and
retry after that interval
rather than hammering
the server and making
things worse.
A good rate limiter communicates its state to clients through standardised headers. This enables clients to self-throttle and implement exponential backoff correctly.
Allowed request (200 OK):
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1713261600 # Unix epoch of window reset
Content-Type: application/json
Rate-limited request (429 Too Many Requests):
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1713261600
Retry-After: 37 # seconds until window resets
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "Too many requests. Retry after 37 seconds.",
"limit": 100,
"window": "60s"
}
from fastapi import Request, Response from starlette.middleware.base import BaseHTTPMiddleware class RateLimitMiddleware(BaseHTTPMiddleware): async def dispatch(self, request: Request, call_next) -> Response: user_id = request.headers.get("X-User-Id", request.client.host) allowed, count, ttl = allow(user_id) if not allowed: return Response( content='{"error":"rate_limit_exceeded"}', status_code=429, headers={ "X-RateLimit-Limit": "100", "X-RateLimit-Remaining": "0", "Retry-After": str(ttl), "Content-Type": "application/json", } ) response = await call_next(request) response.headers["X-RateLimit-Limit"] = "100" response.headers["X-RateLimit-Remaining"] = str(100 - count) response.headers["X-RateLimit-Reset"] = str(int(time.time()) + ttl) return response
The Interview Answer
If asked “Design a rate limiter” in a system design interview, here is the progression to walk through:
- Clarify — per-user or per-IP? What granularity (req/sec, req/min, req/day)? Single server or distributed? Hard limit or soft?
- Start simple — in-memory HashMap. Acknowledge its limitations immediately.
- Evolve to Redis — INCR + EXPIRE. Mention atomicity.
- Identify the edge case — fixed window boundary spike. Propose sliding window counter.
- Discuss Token Bucket — for burst-tolerant systems. Mention leaky bucket as the smoothed alternative.
- Address distribution — Redis Cluster + Lua script for atomicity. Mention gossip/local-approx as the high-performance alternative.
- Round off — key schema, response headers, monitoring (track rejection rates per user).
The sliding window counter + Redis Cluster + Lua is the answer that will impress most interviewers. It is O(1) memory, near-perfect accuracy, distributed, and atomic.