System Design: Notification System — Push, Email, SMS at Billions of Scale

Series System Design Interview Prep — #8 of 15

APNs (Apple Push Notification
service) has a 4 KB payload
limit per notification —
if your payload is larger,
store the data server-side
and send a “fetch”
notification instead.

Notifications are the silent backbone of modern apps. A like on Instagram, a two-factor authentication code, a “your package has shipped” text — none of these feel special until the moment they fail to arrive. Designing a system that reliably delivers billions of notifications across push, email, and SMS channels is a deceptively hard problem that touches queuing theory, distributed systems, third-party API quirks, and user experience all at once.

The question: Design a notification system like Facebook/Instagram’s that sends push notifications, emails, and SMS. Handle 10 million notifications/day across multiple channels, with delivery guarantees and user preferences.


1. The Problem

10M
Notifications/day
~116/sec average
2,778
Peak/sec
Marketing blast (10M / 1 hr)
3
Channels
Push · Email · SMS
99.9%
Target delivery
≤ 9 hours downtime/yr

Notifications are everywhere: a like on Instagram, a payment receipt, a delivery update. Three delivery channels exist, each with radically different characteristics:

  • Push (iOS APNs / Android FCM) — free, sub-second latency, but only reaches online devices and has a 4 KB payload limit
  • Email (SendGrid/SES) — cheap at $0.0001/msg, high reliability (~99%), but latency ranges from 1 to 60 seconds
  • SMS (Twilio) — expensive at $0.0075/msg, near-universal reach, 98% open rate, ~99.9% delivery

The core challenge: at 10M/day average, a single marketing blast can spike to 2,778 messages/second — a 24× surge. The system must absorb that spike without losing messages, slowing down the user-facing API, or hammering third-party provider rate limits.


2. Level 1 — Synchronous, In-Request Notification

The simplest possible design: when a user likes a photo, the API handler calls APNs directly before returning a response.

python
def like_photo(user_id, photo_id):
    photo = db.get_photo(photo_id)
    db.create_like(user_id, photo_id)

    # BAD: synchronous notification inline
    device_token = db.get_device_token(photo.owner_id)
    apns.send(device_token, "Someone liked your photo!")
    # if APNs is slow or down, the entire API call hangs

    return "ok"

Problems:

  • If APNs responds in 3s, every like API call takes 3s
  • If APNs is down, likes fail completely
  • No retry — if the call fails, the notification is lost forever
  • No rate limiting — a viral post could spawn thousands of concurrent APNs connections

3. Level 2 — Async with a Simple Queue

Decouple the notification from the API response. The handler publishes an event to a queue; a background worker consumes and sends it.

python
# API handler — returns instantly
def like_photo(user_id, photo_id):
    db.create_like(user_id, photo_id)
    queue.push({
        "type": "like",
        "recipient_id": photo.owner_id,
        "actor_id": user_id,
    })
    return "ok"  # fast!

# Background worker
def worker_loop():
    while True:
        event = queue.pop()
        device_token = db.get_device_token(event["recipient_id"])
        apns.send(device_token, build_message(event))

This is significantly better: the API is fast and the notification is decoupled. But there are still gaps:

  • Single queue = SPOF — if the queue node dies, all notifications are lost (unless you use durable queuing)
  • No per-channel routing — push, email, and SMS all compete on the same queue
  • No priority — a 2FA SMS sits behind 10M marketing emails
  • No backpressure — a marketing blast overwhelms the single worker

4. Level 3 — Channel-Specific Workers + Fanout via Kafka

The production-grade architecture introduces Kafka for durable, partitioned, high-throughput messaging with independent consumer groups per channel.

Click any node to see details
🖥 API / Backend
Event Producer
⚡ Kafka
topic: notifications
📱 Push Worker
APNs / FCM
DLQ Push
📧 Email Worker
SendGrid/SES
DLQ Email
📟 SMS Worker
Twilio
DLQ SMS
Click a node above to explore its role, throughput characteristics, and failure handling.

Why Kafka over RabbitMQ here? Kafka’s consumer groups give each channel worker independent read cursors — email falling behind doesn’t slow down push. Messages are replayed on crash without redelivery to other consumers. For 10M/day that’s only ~116 msg/sec, well within a single Kafka broker’s capacity, but Kafka’s durability story (replicated, durable log) makes it worth the operational overhead.


5. Level 4 — User Preferences & Opt-Out

Before sending any notification, check whether the user actually wants it.

Notification Event Received
Fetch User Preferences
Redis cache → MySQL fallback
Is this a Critical notification?
(2FA, payment, security alert)
YES (critical)
Send on ALL opted-in channels
(ignore marketing opt-outs)
NO (marketing / social)
Check per-channel preference
Push opted in?
Send Push
Email opted in?
Send Email
SMS opted in?
Never for marketing

The preference service has two layers: a Redis cache for hot-path lookups (TTL = 5 minutes) and a MySQL database as the source of truth. When a user updates preferences in the app, both are updated: MySQL first (durable), then Redis cache is invalidated (so next request re-fetches).

python
def should_send(user_id, notif_type, channel):
    pref_key = "pref:" + str(user_id)
    prefs = redis.get(pref_key)

    if not prefs:
        prefs = db.get_preferences(user_id)
        redis.setex(pref_key, 300, prefs)   # TTL 5 min

    # Critical notifications bypass marketing opt-outs
    if notif_type in CRITICAL_TYPES:
        return prefs.get("channel_" + channel + "_active", True)

    # Marketing: honour all opt-outs
    if not prefs.get("marketing_enabled", True):
        return False

    return prefs.get("channel_" + channel + "_enabled", True)

6. Level 5 — Delivery Guarantees

The dead letter queue is
your forensics tool —
every failed notification
lands there with the full
error. Without it, you’re
flying blind on delivery
failures.

At-least-once vs exactly-once: Kafka provides at-least-once delivery by default. For notifications, a duplicate push (“Someone liked your photo!” × 2) is annoying but not catastrophic. Exactly-once requires distributed transactions across Kafka + the provider API — complex and slow.

The practical solution: idempotency keys. Each notification event carries a notif_id (UUID generated by the producer). Workers pass this key to the provider:

python
# Worker deduplication with idempotency key
def send_push_with_idempotency(event):
    dedup_key = "sent:" + event["notif_id"] + ":" + "push"

    # Check if already sent (Redis SET NX with TTL)
    acquired = redis.set(dedup_key, 1, nx=True, ex=86400)
    if not acquired:
        logger.info("Duplicate, skipping: " + event["notif_id"])
        return

    apns.send(
        device_token=event["device_token"],
        payload=event["payload"],
        apns_id=event["notif_id"]  # APNs native dedup key
    )

The Redis SET NX (set if not exists) is atomic — even if two worker replicas pick up the same Kafka message during a rebalance, only one will acquire the lock and send. The key expires after 24 hours to avoid unbounded memory growth.


7. Level 6 — Retry Logic & Failure Handling

Different errors require different responses:

Error Channel Action
BadDeviceToken APNs Remove token from DB. Do not retry.
Unregistered FCM Remove registration ID. Do not retry.
ServiceUnavailable APNs / FCM Exponential backoff. Retry up to 4×.
Bounce (hard) Email Add to suppression list. Do not retry.
Bounce (soft) Email Retry once after 1 hour.
Rate limit (429) Any Respect Retry-After header. Retry.
Invalid number SMS Mark invalid in user DB. Do not retry.
Carrier filtering SMS Escalate to ops. Review message content.

Exponential backoff schedule (jitter added to prevent thundering herd):

python
import random, time

BACKOFF_BASE = 1.0    # seconds
BACKOFF_MAX  = 8.0    # seconds
MAX_RETRIES  = 4

def send_with_retry(send_fn, event):
    for attempt in range(MAX_RETRIES):
        try:
            return send_fn(event)
        except PermanentError:
            raise   # never retry permanent errors
        except TransientError:
            delay = min(BACKOFF_BASE * (2 ** attempt), BACKOFF_MAX)
            jitter = delay * 0.2 * random.random()
            time.sleep(delay + jitter)

    # Exhausted retries — emit to DLQ
    dlq.publish(event)

8. Level 7 — Notification Templates & Personalization

Hard-coding notification text in worker code doesn’t scale. The Template Service stores versioned templates in a DB and renders them at send time:

json (template stored in DB)
{
  "template_id": "order_shipped_push",
  "channel": "push",
  "locale": "en",
  "title": "Your order is on its way!",
  "body": "Hi [[ name ]], order [[ orderId ]] has shipped!",
  "ab_variant": "A"
}
Jekyll note: Template syntax uses [[ name ]] in these examples to avoid conflicts with Liquid templating. In production, use your preferred engine — Mustache ({{name}}), Jinja2 ({{ name }}), or Handlebars.
python
def render_template(template_id, locale, variables):
    tmpl = template_db.get(template_id, locale=locale)
    body = tmpl["body"]

    # Simple variable substitution (no Liquid conflicts)
    for key, value in variables.items():
        body = body.replace("[[ " + key + " ]]", str(value))

    return body

# Usage
msg = render_template(
    "order_shipped_push",
    locale=user.locale,
    variables={"name": "Alice", "orderId": "ORD-4821"}
)
# → "Hi Alice, order ORD-4821 has shipped!"

A/B testing: Templates carry an ab_variant field. The orchestrator randomly assigns users to variant A or B (using user_id % 2) and selects the matching template. Analytics later reveal which copy drives higher open or click rates.

Localization: Templates are stored per (template_id, locale) pair. If the user’s locale doesn’t have a matching template, fall back to en. This enables a single notification event to be rendered in 50+ languages by the worker at send time.


9. Level 8 — Analytics & Tracking

Knowing a notification was sent is not the same as knowing it was seen. Tracking the full funnel:

Event Push Email SMS How
Sent Logged by worker on successful provider call
Delivered Provider delivery webhook → event stream
Opened Push: app SDK callback; Email: 1×1 tracking pixel
Clicked App Link Email: redirect via tracking domain; Push: deep link
Bounced Token err Carrier err Provider webhook → suppression list update

All events flow to an event stream (another Kafka topic: notif_events) and are consumed by an analytics pipeline that writes to ClickHouse for fast OLAP queries:

sql
-- Delivery funnel for campaign 1042
SELECT
    channel,
    countIf(event_type = 'sent')      AS sent,
    countIf(event_type = 'delivered')  AS delivered,
    countIf(event_type = 'opened')     AS opened,
    round(countIf(event_type='opened') * 100.0
          / nullIf(countIf(event_type='delivered'), 0), 2) AS open_rate_pct
FROM notif_events
WHERE campaign_id = 1042
  AND ts >= now() - INTERVAL 7 DAY
GROUP BY channel;

10. Interactive: Notification Flow Simulator

Notification Delivery Simulator
Step 1
API Server
Step 2
Kafka
📱 Push idle idle
📧 Email idle idle
📟 SMS idle idle

11. Cost Comparison & Calculator

SMS is 75× more expensive
than email but has a 98%
open rate
vs ~20% for
email — choose your channel
based on message urgency
and audience, not cost alone.

Channel Cost / message Latency Reliability Open Rate
📱 Push (FCM/APNs) Free < 1 s ~95% (device must be online) ~15%
📧 Email (SendGrid) $0.0001 1 – 60 s ~99% ~20%
📟 SMS (Twilio, US) $0.0075 < 5 s ~99.9% ~98%
Interactive Cost Calculator
10M
📱 Push
$0
FCM / APNs — free
📧 Email
$0.00
@ $0.0001 / msg
📟 SMS
$0.00
@ $0.0075 / msg (US)

12. Priority Queues

Not all notifications are equal. A 2FA code that doesn’t arrive means a locked-out user; a marketing email that arrives 10 minutes late is fine.

Priority routing in Kafka
critical 2FA / payment / security alert Partition 0 (priority) Dedicated Worker Pool (3×) Provider (no queue wait)
marketing newsletters / promotions Partitions 1–15 (standard) Shared Worker Pool Provider (may batch)
The separation is enforced at the producer level: notification events include a priority field (critical standard), and the producer routes them to different Kafka topics or partitions:
python
def publish_notification(event):
    topic    = "notif-critical" if event["priority"] == "critical" else "notif-standard"
    producer.send(topic, value=event, key=event["recipient_id"].encode())

# Dedicated workers only subscribe to notif-critical
critical_consumer = KafkaConsumer(
    "notif-critical",
    group_id="push-workers-critical",
    # no max_poll_records limit — drain as fast as possible
)

# Shared workers subscribe to notif-standard
standard_consumer = KafkaConsumer(
    "notif-standard",
    group_id="push-workers-standard",
    max_poll_records=100,  # batch for throughput
)

SLA targets by priority:

TypeP99 LatencyQueue Depth Alert
Critical (2FA, payment) < 1 second Alert if > 10 messages
Transactional (order shipped) < 30 seconds Alert if > 1,000 messages
Marketing (newsletter) < 15 minutes Alert if > 1M messages

13. Putting It All Together

The complete Level 7 architecture:
API → Kafka (2 topics: critical + standard) → 3 channel consumer groups (push, email, sms) per topic → preference check (Redis + MySQL) → idempotency check (Redis SET NX) → template render → provider call → analytics event to ClickHouse → DLQ for failures.

The system handles 10M/day easily and can scale to billions with horizontal Kafka partitioning and worker auto-scaling on consumer group lag. Key interview talking points:

  1. Decouple producers from consumers with Kafka — the API never waits on providers
  2. Per-channel workers with independent consumer groups — email slowness never delays push
  3. User preferences checked before every send — respect opt-outs, GDPR-compliant
  4. Idempotency keys prevent duplicate sends during Kafka rebalances
  5. Priority routing gives 2FA codes a dedicated fast lane
  6. DLQ for every channel — never lose a failure silently
  7. Analytics pipeline closes the loop — you know delivery rate, open rate, bounce rate

At Facebook scale,
notifications are a product
in themselves — the team
that owns the notification
infrastructure is separate
from the teams that trigger
notifications.

A common follow-up question: what if a user has 10 devices? The device registry stores all active tokens per user. The push worker fans out to all tokens in parallel. If any token returns BadDeviceToken, it’s removed; if all fail, it falls back to email. This is why the idempotency key must include the device token: notif_id + ":" + device_token, not just notif_id.

Another common extension: digest notifications. Instead of sending 50 “X liked your photo” pushes, batch them into “Alice and 49 others liked your photo”. The fanout service groups events by recipient within a 30-second window before publishing to Kafka. This reduces provider API calls and prevents notification fatigue — one of the most impactful product improvements you can make to a notification system.