System Design: Rate Limiter — From Token Bucket to Distributed Throttling

Series #3 of 15 in
System Design Interview
deep-dives. Each post
stands alone but they
build on each other.

System Design Interview Series — #3 of 15

The question: Design a rate limiter that limits API calls to 100 requests per minute per user. Make it work across a distributed cluster of 10 API servers.

This deceptively simple prompt hides a rabbit hole of trade-offs: accuracy vs. memory, simplicity vs. correctness, single-server vs. distributed. Let’s go deep.


1. Why Rate Limiting?

Before writing a single line of code, it’s worth internalising why this problem matters. Rate limiting sits at the intersection of reliability, fairness, and economics.

Prevent abuse & DDoS — Without limits, a single misbehaving client (or attacker) can saturate your servers and degrade service for everyone else. A rate limiter is the first line of defence.

Fair usage — In a multi-tenant system, one large customer shouldn’t be able to monopolise shared resources at the expense of smaller ones.

Cost control — Many APIs call downstream paid services (LLMs, SMS, geocoding). A single runaway script can generate a surprise $40,000 cloud bill overnight.

Real-world examples:

Public API Rate Limits in the Wild

ServiceLimitWindowKey
Twitter / X (free tier)15 calls15 minutesApp token
Stripe100 requests1 secondAPI key
GitHub REST API5 000 requests1 hourOAuth token
OpenAI (GPT-4o)500 requests1 minuteOrg + model
Twilio SMS1 message/secsustainedPhone number

2. Level 1 — In-Memory Counter (Single Server)

The simplest possible solution: keep a HashMap in RAM. For each user, store the request count and the window start time. At the start of every new minute, reset the counter.

Java
class InMemoryRateLimiter {
    private final Map<String, long[]> store = new ConcurrentHashMap<>();
    private static final int LIMIT  = 100;
    private static final long WINDOW = 60_000; // 1 minute in ms

    /**
     * Returns true if request is allowed, false if rate-limited.
     * store[userId] = [count, windowStartMs]
     */
    public synchronized boolean allow(String userId) {
        long now = System.currentTimeMillis();
        long[] state = store.computeIfAbsent(userId, k -> new long[]{0, now});

        if (now - state[1] >= WINDOW) {
            state[0] = 0;      // reset count
            state[1] = now;   // new window start
        }
        if (state[0] < LIMIT) { state[0]++; return true; }
        return false;
    }
}

Problems with this approach:

  • 💀 State lost on restart — every deploy wipes the counters; a user who made 99 requests just before a rolling restart gets a free 100 more.
  • 💀 Doesn’t work across servers — with 10 API servers each keeping independent state, a user can make 100 requests per server = 1,000 requests per minute total.
  • ✅ Zero latency (in-process)
  • ✅ Fine for development / single-instance tools

3. Level 2 — Fixed Window Counter (Redis)

Move the counter to Redis. INCR is atomic; EXPIRE sets the TTL so the key auto-deletes at the end of the window.

Python
import redis, time, math

r = redis.Redis(host='localhost')

def allow_fixed_window(user_id: str, limit: int = 100) -> bool:
    window = math.floor(time.time() / 60)          # current 1-minute window
    key    = "rl:" + user_id + ":" + str(window) # e.g. "rl:u42:28591234"

    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, 60)
    count, _ = pipe.execute()

    return count <= limit

Simple, fast, and works across any number of servers. But there is a nasty edge case.

The Boundary Spike Problem

t = 0:58t = 1:00 (window boundary)t = 1:02
Window A (0:00–1:00)  |  Window B (1:00–2:00)
A user sends 100 requests in the last 2s of window A, then 100 more in the first 2s of window B — 200 requests in 4 seconds, both windows report ≤ 100.

The sliding window counter
uses ~10× less memory than
the sliding window log for
the same accuracy — that’s
the production sweet spot.

This is the fundamental flaw of fixed-window counting: a client can double-spend by straddling a window boundary.


4. Level 3 — Sliding Window Log

Instead of a counter, store the timestamp of every request in a sorted set. On each new request:

  1. Remove all timestamps older than now - 60s
  2. Count remaining entries
  3. If count < limit → add timestamp and allow; else deny
Python / Redis
def allow_sliding_log(user_id: str, limit: int = 100) -> bool:
    now    = time.time()
    cutoff = now - 60
    key    = "rl_log:" + user_id

    with r.pipeline() as pipe:
        pipe.zremrangebyscore(key, 0, cutoff)   # remove old entries
        pipe.zcard(key)                            # count remaining
        pipe.zadd(key, {str(now): now})          # add current timestamp
        pipe.expire(key, 70)                     # TTL safety net
        _, count, _, _ = pipe.execute()

    if count > limit:
        r.zrem(key, str(now))   # undo the add — request denied
        return False
    return True

Accuracy: Perfect — no boundary spike problem.

Memory cost: Each request stores a full 64-bit timestamp. A user making 100 req/min stores 100 entries. With 1 million users at the limit, that’s ~800 MB just for timestamps. At scale this becomes painful.


The elegant middle ground: keep only two counters (current window + previous window), then compute a weighted estimate of the rolling window count.

The Weighted Formula

rolling_count = prev_count × overlap_ratio + curr_count

overlap_ratio = the fraction of the previous window that still falls inside our 60-second lookback.
If we're 15 seconds into the current window, overlap_ratio = (60 − 15) / 60 = 0.75.

Example: User made 80 requests in the previous minute. We’re 20 seconds into the current minute and they’ve made 30 requests so far.

rolling_count = 80 × (40/60) + 30 = 53.3 + 30 = 83.3 → allowed (< 100)

Python / Redis
def allow_sliding_counter(user_id: str, limit: int = 100) -> bool:
    now         = time.time()
    window_size = 60.0
    curr_window = math.floor(now / window_size)
    prev_window = curr_window - 1
    elapsed     = now - (curr_window * window_size)
    overlap     = (window_size - elapsed) / window_size

    curr_key = "rl:" + user_id + ":" + str(curr_window)
    prev_key = "rl:" + user_id + ":" + str(prev_window)

    pipe = r.pipeline()
    pipe.get(prev_key)
    pipe.incr(curr_key)
    pipe.expire(curr_key, 120)
    prev_count_raw, curr_count, _ = pipe.execute()

    prev_count   = int(prev_count_raw) if prev_count_raw else 0
    rolling      = prev_count * overlap + curr_count

    if rolling > limit:
        r.decr(curr_key)   # undo the increment — request denied
        return False
    return True

Memory: 2 keys per user (each storing a single integer). ~10,000× less memory than the sliding log at scale, with accuracy typically within 0.1% of the true rolling count.


6. Level 5 — Token Bucket (Interactive Demo)

Token bucket is the algorithm used by most major API gateways (AWS API Gateway, Nginx, Kong). The intuition:

  • A bucket holds up to capacity tokens
  • Tokens are added at refill_rate tokens/second
  • Each request consumes 1 token
  • If the bucket is empty → 429 Too Many Requests
  • The bucket can accumulate tokens up to capacity, enabling burst handling

🪣 Token Bucket — Live Demo

10 tokens ▼ +1 / sec requests consume tokens
Tokens remaining 10 / 10
Refill rate 1 token / sec
Capacity 10 tokens
Total allowed 0
Total rejected 0

Token bucket allows bursting,
which is usually what you want
for APIs — a legitimate user
doing a one-time large batch
shouldn’t be penalised the
same way as a sustained
abuser.


7. Level 6 — Leaky Bucket

Leaky bucket is the mirror image of token bucket. Instead of tokens, think of a queue with a hole at the bottom that drains at a fixed rate.

🪣 Token Bucket

Tokens accumulate when idle. Burst requests consume banked tokens. Allows short bursts above the average rate. Good for user-facing APIs where occasional spikes are legitimate.

Burst-friendly

🚰 Leaky Bucket

Requests enter a FIFO queue. Processed at a fixed drain rate. No bursting — requests are smoothed to a constant output rate. Ideal for metered billing or downstream services that can't handle spikes.

No bursting Smooth output

Python
class LeakyBucket:
    def __init__(self, capacity: int, drain_rate: float):
        self.capacity   = capacity    # max queue depth
        self.drain_rate = drain_rate  # requests processed per second
        self.queue      = 0           # current queue depth
        self.last_drain = time.time()

    def _drain(self):
        now      = time.time()
        elapsed  = now - self.last_drain
        drained  = elapsed * self.drain_rate
        self.queue = max(0, self.queue - drained)
        self.last_drain = now

    def allow(self) -> bool:
        self._drain()
        if self.queue < self.capacity:
            self.queue += 1
            return True
        return False   # queue full — drop request

8. Level 7 — Distributed Rate Limiting

This is where the interview question gets interesting. With 10 API servers, naive approaches fail:

Approach A — Sticky sessions: Route each user to the same server using consistent hashing. Simple, but defeats load balancing; one hot server is a bottleneck.

Approach B — Centralized Redis: All servers talk to a single Redis. Works, but Redis becomes a single point of failure and adds 1–2ms of network latency to every request.

Approach C — Redis Cluster + Lua atomic script: Use a Redis Cluster (3 primaries, 3 replicas). The critical insight: the check-and-increment must be atomic. A Lua script in Redis runs atomically — no other command can interleave.

Lua scripts in Redis are
atomic — this is crucial
for rate limiting. Without
atomicity you have a race
condition where two requests
can both pass a limit of 1.

Lua (Redis EVALSHA)
-- rate_limit.lua
-- KEYS[1] = rate limit key  (e.g. "rl:u42:28591234")
-- ARGV[1] = limit            (e.g. "100")
-- ARGV[2] = window_seconds   (e.g. "60")
-- Returns: {allowed, current_count, ttl}

local key   = KEYS[1]
local limit = tonumber(ARGV[1])
local ttl   = tonumber(ARGV[2])

local count = redis.call('INCR', key)

if count == 1 then
    -- First request in this window — set expiry
    redis.call('EXPIRE', key, ttl)
end

if count > limit then
    -- Undo the increment — we must not let it drift upward
    redis.call('DECR', key)
    return {0, count - 1, redis.call('TTL', key)}
end

return {1, count, redis.call('TTL', key)}
Python — loading and calling the Lua script
import redis, math, time

r = redis.Redis(host='redis-cluster')

# Load once at startup — returns a SHA hash
with open('rate_limit.lua') as f:
    LUA_SHA = r.script_load(f.read())

def allow(user_id: str, limit: int = 100) -> tuple[bool, int, int]:
    window = math.floor(time.time() / 60)
    key    = "rl:" + user_id + ":" + str(window)
    # evalsha is ~20% faster than eval — script is already compiled
    allowed, count, ttl = r.evalsha(LUA_SHA, 1, key, limit, 60)
    return bool(allowed), int(count), int(ttl)

Approach D — Local approximation + async sync: Each server keeps a local counter. Every 100ms, all servers gossip their deltas to a central store and pull the global total. A request is allowed if local_count + last_known_global < limit. This trades perfect accuracy for near-zero added latency. Netflix uses a variant of this.


9. Level 8 — Rate Limiting by Different Keys

Real systems enforce limits at multiple granularities simultaneously.

Key schema
# By user ID (most common)
rate_limit:user:{userId}:{window}

# By IP address (unauthenticated traffic)
rate_limit:ip:{ipAddr}:{window}

# By API key (for partner integrations)
rate_limit:apikey:{hashedKey}:{window}

# By endpoint (prevent expensive endpoints from being hammered)
rate_limit:endpoint:{endpoint}:{window}

# Hierarchical: user + specific endpoint
rate_limit:user:{userId}:endpoint:{endpoint}:{window}

# Global (platform-wide circuit breaker)
rate_limit:global:{window}
Python — hierarchical rate limiting
def allow_hierarchical(user_id: str, endpoint: str, ip: str) -> bool:
    # All three must pass — deny on the first failure
    checks = [
        (ip,                   30),   # 30/min per IP (unauthenticated guard)
        (user_id,             100),   # 100/min per user (main SLA)
        (user_id + endpoint,  20),   # 20/min per user per endpoint
    ]
    for key_suffix, limit in checks:
        allowed, _, _ = allow(key_suffix, limit)
        if not allowed:
            return False
    return True

10. Interactive: Algorithm Comparison

PropertyValue
Memory per userO(1) — 1 integer key
Time complexityO(1)
Allows burstingYes (at window boundary)
Boundary spike problemYes — up to 2× limit in 1s
Redis operations2 (INCR + EXPIRE)
Production useSimple internal tools, batch jobs

The simplest distributed approach. Works well when brief spikes at window boundaries are acceptable and you prioritise operational simplicity. The 2× burst window is often not a problem for internal APIs.

PropertyValue
Memory per userO(requests/window) — 1 timestamp per request
Time complexityO(log N) — sorted set operations
Allows burstingYes (accurate rolling window)
Boundary spike problemNone — true sliding window
Redis operations4 (ZREMRANGE + ZCARD + ZADD + EXPIRE)
Production useWhen accuracy is critical; low-traffic premium APIs

Perfectly accurate but costly at scale. Storing a timestamp per request means at 100 req/min × 1M users = 100M Redis entries just for rate limit logs. Most systems can't afford this overhead.

PropertyValue
Memory per userO(1) — 2 integer keys
Time complexityO(1)
Allows burstingYes (within rolling window)
Boundary spike problem~0.1% error — statistically negligible
Redis operations3 (GET prev + INCR curr + EXPIRE)
Production useRecommended default — Cloudflare, Figma

The sweet spot: near-perfect accuracy with O(1) memory. The weighted formula introduces a tiny approximation error (~0.1%) in exchange for 10,000× less memory than the log approach. This is what you should propose in most interviews.

PropertyValue
Memory per userO(1) — count + last refill timestamp
Time complexityO(1)
Allows burstingYes — by design
Boundary spike problemNone — continuous time model
Redis operations1 Lua script (atomic GET+SET)
Production useAWS API Gateway, Kong, Nginx limit_req

Best for traffic shaping. Burst capacity is explicit and controlled. Ideal when you want to allow a user to make 10 requests immediately after a long idle period, but not sustain 10× the limit indefinitely.

Simulation: 150 requests in 60s (limit = 100/min, 10 req burst at t=58s)
Fixed Window
0/150
Sliding Log
0/150
Sliding Counter
0/150
Token Bucket
0/150

11. Response Headers — Telling Clients What Happened

Retry-After is the most
important header — a good
client will back off and
retry after that interval
rather than hammering
the server and making
things worse.

A good rate limiter communicates its state to clients through standardised headers. This enables clients to self-throttle and implement exponential backoff correctly.

Allowed request (200 OK):

200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1713261600 # Unix epoch of window reset
Content-Type: application/json

Rate-limited request (429 Too Many Requests):

429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1713261600
Retry-After: 37 # seconds until window resets
Content-Type: application/json

{
"error": "rate_limit_exceeded",
"message": "Too many requests. Retry after 37 seconds.",
"limit": 100,
"window": "60s"
}
Python / FastAPI middleware
from fastapi import Request, Response
from starlette.middleware.base import BaseHTTPMiddleware

class RateLimitMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next) -> Response:
        user_id  = request.headers.get("X-User-Id", request.client.host)
        allowed, count, ttl = allow(user_id)

        if not allowed:
            return Response(
                content='{"error":"rate_limit_exceeded"}',
                status_code=429,
                headers={
                    "X-RateLimit-Limit":     "100",
                    "X-RateLimit-Remaining": "0",
                    "Retry-After":           str(ttl),
                    "Content-Type":          "application/json",
                }
            )

        response = await call_next(request)
        response.headers["X-RateLimit-Limit"]     = "100"
        response.headers["X-RateLimit-Remaining"] = str(100 - count)
        response.headers["X-RateLimit-Reset"]     = str(int(time.time()) + ttl)
        return response

The Interview Answer

If asked “Design a rate limiter” in a system design interview, here is the progression to walk through:

  1. Clarify — per-user or per-IP? What granularity (req/sec, req/min, req/day)? Single server or distributed? Hard limit or soft?
  2. Start simple — in-memory HashMap. Acknowledge its limitations immediately.
  3. Evolve to Redis — INCR + EXPIRE. Mention atomicity.
  4. Identify the edge case — fixed window boundary spike. Propose sliding window counter.
  5. Discuss Token Bucket — for burst-tolerant systems. Mention leaky bucket as the smoothed alternative.
  6. Address distribution — Redis Cluster + Lua script for atomicity. Mention gossip/local-approx as the high-performance alternative.
  7. Round off — key schema, response headers, monitoring (track rejection rates per user).

The sliding window counter + Redis Cluster + Lua is the answer that will impress most interviewers. It is O(1) memory, near-perfect accuracy, distributed, and atomic.