System Design: Notification System — Push, Email, SMS at Billions of Scale

Series System Design Interview Prep — #8 of 15

APNs (Apple Push Notification
service) has a 4 KB payload
limit per notification —
if your payload is larger,
store the data server-side
and send a “fetch”
notification instead.

Notifications are the silent backbone of modern apps. A like on Instagram, a two-factor authentication code, a “your package has shipped” text — none of these feel special until the moment they fail to arrive. Designing a system that reliably delivers billions of notifications across push, email, and SMS channels is a deceptively hard problem that touches queuing theory, distributed systems, third-party API quirks, and user experience all at once.

The question: Design a notification system like Facebook/Instagram’s that sends push notifications, emails, and SMS. Handle 10 million notifications/day across multiple channels, with delivery guarantees and user preferences.

1. The Problem

10M

Notifications/day

~116/sec average

2,778

Peak/sec

Marketing blast (10M / 1 hr)

Channels

Push · Email · SMS

99.9%

Target delivery

≤ 9 hours downtime/yr

Notifications are everywhere: a like on Instagram, a payment receipt, a delivery update. Three delivery channels exist, each with radically different characteristics:

Push (iOS APNs / Android FCM) — free, sub-second latency, but only reaches online devices and has a 4 KB payload limit
Email (SendGrid/SES) — cheap at $0.0001/msg, high reliability (~99%), but latency ranges from 1 to 60 seconds
SMS (Twilio) — expensive at $0.0075/msg, near-universal reach, 98% open rate, ~99.9% delivery

The core challenge: at 10M/day average, a single marketing blast can spike to 2,778 messages/second — a 24× surge. The system must absorb that spike without losing messages, slowing down the user-facing API, or hammering third-party provider rate limits.

2. Level 1 — Synchronous, In-Request Notification

The simplest possible design: when a user likes a photo, the API handler calls APNs directly before returning a response.

python

def like_photo(user_id, photo_id):
    photo = db.get_photo(photo_id)
    db.create_like(user_id, photo_id)

    # BAD: synchronous notification inline
    device_token = db.get_device_token(photo.owner_id)
    apns.send(device_token, "Someone liked your photo!")
    # if APNs is slow or down, the entire API call hangs

    return "ok"

Problems:

If APNs responds in 3s, every like API call takes 3s
If APNs is down, likes fail completely
No retry — if the call fails, the notification is lost forever
No rate limiting — a viral post could spawn thousands of concurrent APNs connections

3. Level 2 — Async with a Simple Queue

Decouple the notification from the API response. The handler publishes an event to a queue; a background worker consumes and sends it.

python

# API handler — returns instantly
def like_photo(user_id, photo_id):
    db.create_like(user_id, photo_id)
    queue.push({
        "type": "like",
        "recipient_id": photo.owner_id,
        "actor_id": user_id,
    })
    return "ok"  # fast!

# Background worker
def worker_loop():
    while True:
        event = queue.pop()
        device_token = db.get_device_token(event["recipient_id"])
        apns.send(device_token, build_message(event))

This is significantly better: the API is fast and the notification is decoupled. But there are still gaps:

Single queue = SPOF — if the queue node dies, all notifications are lost (unless you use durable queuing)
No per-channel routing — push, email, and SMS all compete on the same queue
No priority — a 2FA SMS sits behind 10M marketing emails
No backpressure — a marketing blast overwhelms the single worker

4. Level 3 — Channel-Specific Workers + Fanout via Kafka

The production-grade architecture introduces Kafka for durable, partitioned, high-throughput messaging with independent consumer groups per channel.

Click any node to see details

🖥 API / Backend
Event Producer

→

⚡ Kafka
topic: notifications

→

📱 Push Worker

→

APNs / FCM

→

DLQ Push

📧 Email Worker

→

SendGrid/SES

→

DLQ Email

📟 SMS Worker

→

Twilio

→

DLQ SMS

Click a node above to explore its role, throughput characteristics, and failure handling.

Why Kafka over RabbitMQ here? Kafka’s consumer groups give each channel worker independent read cursors — email falling behind doesn’t slow down push. Messages are replayed on crash without redelivery to other consumers. For 10M/day that’s only ~116 msg/sec, well within a single Kafka broker’s capacity, but Kafka’s durability story (replicated, durable log) makes it worth the operational overhead.

5. Level 4 — User Preferences & Opt-Out

Before sending any notification, check whether the user actually wants it.

Notification Event Received

↓

Fetch User Preferences
Redis cache → MySQL fallback

↓

Is this a Critical notification?
(2FA, payment, security alert)

↓

YES (critical)

Send on ALL opted-in channels
(ignore marketing opt-outs)

NO (marketing / social)

Check per-channel preference

↓

Push opted in?

Send Push

Email opted in?

Send Email

SMS opted in?

Never for marketing

The preference service has two layers: a Redis cache for hot-path lookups (TTL = 5 minutes) and a MySQL database as the source of truth. When a user updates preferences in the app, both are updated: MySQL first (durable), then Redis cache is invalidated (so next request re-fetches).

python

def should_send(user_id, notif_type, channel):
    pref_key = "pref:" + str(user_id)
    prefs = redis.get(pref_key)

    if not prefs:
        prefs = db.get_preferences(user_id)
        redis.setex(pref_key, 300, prefs)   # TTL 5 min

    # Critical notifications bypass marketing opt-outs
    if notif_type in CRITICAL_TYPES:
        return prefs.get("channel_" + channel + "_active", True)

    # Marketing: honour all opt-outs
    if not prefs.get("marketing_enabled", True):
        return False

    return prefs.get("channel_" + channel + "_enabled", True)

6. Level 5 — Delivery Guarantees

The dead letter queue is
your forensics tool —
every failed notification
lands there with the full
error. Without it, you’re
flying blind on delivery
failures.

At-least-once vs exactly-once: Kafka provides at-least-once delivery by default. For notifications, a duplicate push (“Someone liked your photo!” × 2) is annoying but not catastrophic. Exactly-once requires distributed transactions across Kafka + the provider API — complex and slow.

The practical solution: idempotency keys. Each notification event carries a notif_id (UUID generated by the producer). Workers pass this key to the provider:

python

# Worker deduplication with idempotency key
def send_push_with_idempotency(event):
    dedup_key = "sent:" + event["notif_id"] + ":" + "push"

    # Check if already sent (Redis SET NX with TTL)
    acquired = redis.set(dedup_key, 1, nx=True, ex=86400)
    if not acquired:
        logger.info("Duplicate, skipping: " + event["notif_id"])
        return

    apns.send(
        device_token=event["device_token"],
        payload=event["payload"],
        apns_id=event["notif_id"]  # APNs native dedup key
    )

The Redis SET NX (set if not exists) is atomic — even if two worker replicas pick up the same Kafka message during a rebalance, only one will acquire the lock and send. The key expires after 24 hours to avoid unbounded memory growth.

7. Level 6 — Retry Logic & Failure Handling

Different errors require different responses:

Error	Channel	Action
`BadDeviceToken`	APNs	Remove token from DB. Do not retry.
`Unregistered`	FCM	Remove registration ID. Do not retry.
`ServiceUnavailable`	APNs / FCM	Exponential backoff. Retry up to 4×.
Bounce (hard)	Email	Add to suppression list. Do not retry.
Bounce (soft)	Email	Retry once after 1 hour.
Rate limit (429)	Any	Respect `Retry-After` header. Retry.
Invalid number	SMS	Mark invalid in user DB. Do not retry.
Carrier filtering	SMS	Escalate to ops. Review message content.

Exponential backoff schedule (jitter added to prevent thundering herd):

python

import random, time

BACKOFF_BASE = 1.0    # seconds
BACKOFF_MAX  = 8.0    # seconds
MAX_RETRIES  = 4

def send_with_retry(send_fn, event):
    for attempt in range(MAX_RETRIES):
        try:
            return send_fn(event)
        except PermanentError:
            raise   # never retry permanent errors
        except TransientError:
            delay = min(BACKOFF_BASE * (2 ** attempt), BACKOFF_MAX)
            jitter = delay * 0.2 * random.random()
            time.sleep(delay + jitter)

    # Exhausted retries — emit to DLQ
    dlq.publish(event)

8. Level 7 — Notification Templates & Personalization

Hard-coding notification text in worker code doesn’t scale. The Template Service stores versioned templates in a DB and renders them at send time:

json (template stored in DB)

{
  "template_id": "order_shipped_push",
  "channel": "push",
  "locale": "en",
  "title": "Your order is on its way!",
  "body": "Hi [[ name ]], order [[ orderId ]] has shipped!",
  "ab_variant": "A"
}

Jekyll note: Template syntax uses [[ name ]] in these examples to avoid conflicts with Liquid templating. In production, use your preferred engine — Mustache ({{name}}), Jinja2 ({{ name }}), or Handlebars.

python

def render_template(template_id, locale, variables):
    tmpl = template_db.get(template_id, locale=locale)
    body = tmpl["body"]

    # Simple variable substitution (no Liquid conflicts)
    for key, value in variables.items():
        body = body.replace("[[ " + key + " ]]", str(value))

    return body

# Usage
msg = render_template(
    "order_shipped_push",
    locale=user.locale,
    variables={"name": "Alice", "orderId": "ORD-4821"}
)
# → "Hi Alice, order ORD-4821 has shipped!"

A/B testing: Templates carry an ab_variant field. The orchestrator randomly assigns users to variant A or B (using user_id % 2) and selects the matching template. Analytics later reveal which copy drives higher open or click rates.

Localization: Templates are stored per (template_id, locale) pair. If the user’s locale doesn’t have a matching template, fall back to en. This enables a single notification event to be rendered in 50+ languages by the worker at send time.

9. Level 8 — Analytics & Tracking

Knowing a notification was sent is not the same as knowing it was seen. Tracking the full funnel:

Event	Push	Email	SMS	How
Sent	✓	✓	✓	Logged by worker on successful provider call
Delivered	✓	✓	✓	Provider delivery webhook → event stream
Opened	✓	✓	✗	Push: app SDK callback; Email: 1×1 tracking pixel
Clicked	App	✓	Link	Email: redirect via tracking domain; Push: deep link
Bounced	Token err	✓	Carrier err	Provider webhook → suppression list update

All events flow to an event stream (another Kafka topic: notif_events) and are consumed by an analytics pipeline that writes to ClickHouse for fast OLAP queries:

sql

-- Delivery funnel for campaign 1042
SELECT
    channel,
    countIf(event_type = 'sent')      AS sent,
    countIf(event_type = 'delivered')  AS delivered,
    countIf(event_type = 'opened')     AS opened,
    round(countIf(event_type='opened') * 100.0
          / nullIf(countIf(event_type='delivered'), 0), 2) AS open_rate_pct
FROM notif_events
WHERE campaign_id = 1042
  AND ts >= now() - INTERVAL 7 DAY
GROUP BY channel;

10. Interactive: Notification Flow Simulator

Notification Delivery Simulator

Step 1

API Server

→

Step 2

Kafka

→

📱 Push idle idle

📧 Email idle idle

📟 SMS idle idle

11. Cost Comparison & Calculator

SMS is 75× more expensive
than email but has a 98%
open rate vs ~20% for
email — choose your channel
based on message urgency
and audience, not cost alone.

Channel	Cost / message	Latency	Reliability	Open Rate
📱 Push (FCM/APNs)	Free	< 1 s	~95% (device must be online)	~15%
📧 Email (SendGrid)	$0.0001	1 – 60 s	~99%	~20%
📟 SMS (Twilio, US)	$0.0075	< 5 s	~99.9%	~98%

Interactive Cost Calculator

Notifications per day 10M

📱 Push

FCM / APNs — free

📧 Email

$0.00

@ $0.0001 / msg

📟 SMS

$0.00

@ $0.0075 / msg (US)

12. Priority Queues

Not all notifications are equal. A 2FA code that doesn’t arrive means a locked-out user; a marketing email that arrives 10 minutes late is fine.

Priority routing in Kafka

critical 2FA / payment / security alert → Partition 0 (priority) → Dedicated Worker Pool (3×) → Provider (no queue wait)

marketing newsletters / promotions → Partitions 1–15 (standard) → Shared Worker Pool → Provider (may batch)

The separation is enforced at the producer level: notification events include a priority field (critical standard), and the producer routes them to different Kafka topics or partitions:

python

def publish_notification(event):
    topic    = "notif-critical" if event["priority"] == "critical" else "notif-standard"
    producer.send(topic, value=event, key=event["recipient_id"].encode())

# Dedicated workers only subscribe to notif-critical
critical_consumer = KafkaConsumer(
    "notif-critical",
    group_id="push-workers-critical",
    # no max_poll_records limit — drain as fast as possible
)

# Shared workers subscribe to notif-standard
standard_consumer = KafkaConsumer(
    "notif-standard",
    group_id="push-workers-standard",
    max_poll_records=100,  # batch for throughput
)

SLA targets by priority:

Type	P99 Latency	Queue Depth Alert
Critical (2FA, payment)	< 1 second	Alert if > 10 messages
Transactional (order shipped)	< 30 seconds	Alert if > 1,000 messages
Marketing (newsletter)	< 15 minutes	Alert if > 1M messages

13. Putting It All Together

The complete Level 7 architecture:
API → Kafka (2 topics: critical + standard) → 3 channel consumer groups (push, email, sms) per topic → preference check (Redis + MySQL) → idempotency check (Redis SET NX) → template render → provider call → analytics event to ClickHouse → DLQ for failures.

The system handles 10M/day easily and can scale to billions with horizontal Kafka partitioning and worker auto-scaling on consumer group lag. Key interview talking points:

Decouple producers from consumers with Kafka — the API never waits on providers
Per-channel workers with independent consumer groups — email slowness never delays push
User preferences checked before every send — respect opt-outs, GDPR-compliant
Idempotency keys prevent duplicate sends during Kafka rebalances
Priority routing gives 2FA codes a dedicated fast lane
DLQ for every channel — never lose a failure silently
Analytics pipeline closes the loop — you know delivery rate, open rate, bounce rate

At Facebook scale,
notifications are a product
in themselves — the team
that owns the notification
infrastructure is separate
from the teams that trigger
notifications.

A common follow-up question: what if a user has 10 devices? The device registry stores all active tokens per user. The push worker fans out to all tokens in parallel. If any token returns BadDeviceToken, it’s removed; if all fail, it falls back to email. This is why the idempotency key must include the device token: notif_id + ":" + device_token, not just notif_id.

Another common extension: digest notifications. Instead of sending 50 “X liked your photo” pushes, batch them into “Alice and 49 others liked your photo”. The fanout service groups events by recipient within a 30-second window before publishing to Kafka. This reduces provider API calls and prevents notification fatigue — one of the most impactful product improvements you can make to a notification system.