System Design: Notification System — Push, Email, SMS at Billions of Scale
APNs (Apple Push Notification
service) has a 4 KB payload
limit per notification —
if your payload is larger,
store the data server-side
and send a “fetch”
notification instead.
Notifications are the silent backbone of modern apps. A like on Instagram, a two-factor authentication code, a “your package has shipped” text — none of these feel special until the moment they fail to arrive. Designing a system that reliably delivers billions of notifications across push, email, and SMS channels is a deceptively hard problem that touches queuing theory, distributed systems, third-party API quirks, and user experience all at once.
The question: Design a notification system like Facebook/Instagram’s that sends push notifications, emails, and SMS. Handle 10 million notifications/day across multiple channels, with delivery guarantees and user preferences.
1. The Problem
Notifications are everywhere: a like on Instagram, a payment receipt, a delivery update. Three delivery channels exist, each with radically different characteristics:
- Push (iOS APNs / Android FCM) — free, sub-second latency, but only reaches online devices and has a 4 KB payload limit
- Email (SendGrid/SES) — cheap at $0.0001/msg, high reliability (~99%), but latency ranges from 1 to 60 seconds
- SMS (Twilio) — expensive at $0.0075/msg, near-universal reach, 98% open rate, ~99.9% delivery
The core challenge: at 10M/day average, a single marketing blast can spike to 2,778 messages/second — a 24× surge. The system must absorb that spike without losing messages, slowing down the user-facing API, or hammering third-party provider rate limits.
2. Level 1 — Synchronous, In-Request Notification
The simplest possible design: when a user likes a photo, the API handler calls APNs directly before returning a response.
def like_photo(user_id, photo_id): photo = db.get_photo(photo_id) db.create_like(user_id, photo_id) # BAD: synchronous notification inline device_token = db.get_device_token(photo.owner_id) apns.send(device_token, "Someone liked your photo!") # if APNs is slow or down, the entire API call hangs return "ok"
Problems:
- If APNs responds in 3s, every like API call takes 3s
- If APNs is down, likes fail completely
- No retry — if the call fails, the notification is lost forever
- No rate limiting — a viral post could spawn thousands of concurrent APNs connections
3. Level 2 — Async with a Simple Queue
Decouple the notification from the API response. The handler publishes an event to a queue; a background worker consumes and sends it.
# API handler — returns instantly def like_photo(user_id, photo_id): db.create_like(user_id, photo_id) queue.push({ "type": "like", "recipient_id": photo.owner_id, "actor_id": user_id, }) return "ok" # fast! # Background worker def worker_loop(): while True: event = queue.pop() device_token = db.get_device_token(event["recipient_id"]) apns.send(device_token, build_message(event))
This is significantly better: the API is fast and the notification is decoupled. But there are still gaps:
- Single queue = SPOF — if the queue node dies, all notifications are lost (unless you use durable queuing)
- No per-channel routing — push, email, and SMS all compete on the same queue
- No priority — a 2FA SMS sits behind 10M marketing emails
- No backpressure — a marketing blast overwhelms the single worker
4. Level 3 — Channel-Specific Workers + Fanout via Kafka
The production-grade architecture introduces Kafka for durable, partitioned, high-throughput messaging with independent consumer groups per channel.
Event Producer
topic: notifications
Why Kafka over RabbitMQ here? Kafka’s consumer groups give each channel worker independent read cursors — email falling behind doesn’t slow down push. Messages are replayed on crash without redelivery to other consumers. For 10M/day that’s only ~116 msg/sec, well within a single Kafka broker’s capacity, but Kafka’s durability story (replicated, durable log) makes it worth the operational overhead.
5. Level 4 — User Preferences & Opt-Out
Before sending any notification, check whether the user actually wants it.
Redis cache → MySQL fallback
(2FA, payment, security alert)
(ignore marketing opt-outs)
The preference service has two layers: a Redis cache for hot-path lookups (TTL = 5 minutes) and a MySQL database as the source of truth. When a user updates preferences in the app, both are updated: MySQL first (durable), then Redis cache is invalidated (so next request re-fetches).
def should_send(user_id, notif_type, channel): pref_key = "pref:" + str(user_id) prefs = redis.get(pref_key) if not prefs: prefs = db.get_preferences(user_id) redis.setex(pref_key, 300, prefs) # TTL 5 min # Critical notifications bypass marketing opt-outs if notif_type in CRITICAL_TYPES: return prefs.get("channel_" + channel + "_active", True) # Marketing: honour all opt-outs if not prefs.get("marketing_enabled", True): return False return prefs.get("channel_" + channel + "_enabled", True)
6. Level 5 — Delivery Guarantees
The dead letter queue is
your forensics tool —
every failed notification
lands there with the full
error. Without it, you’re
flying blind on delivery
failures.
At-least-once vs exactly-once: Kafka provides at-least-once delivery by default. For notifications, a duplicate push (“Someone liked your photo!” × 2) is annoying but not catastrophic. Exactly-once requires distributed transactions across Kafka + the provider API — complex and slow.
The practical solution: idempotency keys. Each notification event carries a notif_id (UUID generated by the producer). Workers pass this key to the provider:
# Worker deduplication with idempotency key def send_push_with_idempotency(event): dedup_key = "sent:" + event["notif_id"] + ":" + "push" # Check if already sent (Redis SET NX with TTL) acquired = redis.set(dedup_key, 1, nx=True, ex=86400) if not acquired: logger.info("Duplicate, skipping: " + event["notif_id"]) return apns.send( device_token=event["device_token"], payload=event["payload"], apns_id=event["notif_id"] # APNs native dedup key )
The Redis SET NX (set if not exists) is atomic — even if two worker replicas pick up the same Kafka message during a rebalance, only one will acquire the lock and send. The key expires after 24 hours to avoid unbounded memory growth.
7. Level 6 — Retry Logic & Failure Handling
Different errors require different responses:
| Error | Channel | Action |
|---|---|---|
BadDeviceToken |
APNs | Remove token from DB. Do not retry. |
Unregistered |
FCM | Remove registration ID. Do not retry. |
ServiceUnavailable |
APNs / FCM | Exponential backoff. Retry up to 4×. |
| Bounce (hard) | Add to suppression list. Do not retry. | |
| Bounce (soft) | Retry once after 1 hour. | |
| Rate limit (429) | Any | Respect Retry-After header. Retry. |
| Invalid number | SMS | Mark invalid in user DB. Do not retry. |
| Carrier filtering | SMS | Escalate to ops. Review message content. |
Exponential backoff schedule (jitter added to prevent thundering herd):
import random, time BACKOFF_BASE = 1.0 # seconds BACKOFF_MAX = 8.0 # seconds MAX_RETRIES = 4 def send_with_retry(send_fn, event): for attempt in range(MAX_RETRIES): try: return send_fn(event) except PermanentError: raise # never retry permanent errors except TransientError: delay = min(BACKOFF_BASE * (2 ** attempt), BACKOFF_MAX) jitter = delay * 0.2 * random.random() time.sleep(delay + jitter) # Exhausted retries — emit to DLQ dlq.publish(event)
8. Level 7 — Notification Templates & Personalization
Hard-coding notification text in worker code doesn’t scale. The Template Service stores versioned templates in a DB and renders them at send time:
{
"template_id": "order_shipped_push",
"channel": "push",
"locale": "en",
"title": "Your order is on its way!",
"body": "Hi [[ name ]], order [[ orderId ]] has shipped!",
"ab_variant": "A"
}
[[ name ]] in these examples to avoid conflicts with Liquid templating. In production, use your preferred engine — Mustache ({{name}}), Jinja2 ({{ name }}), or Handlebars.
def render_template(template_id, locale, variables): tmpl = template_db.get(template_id, locale=locale) body = tmpl["body"] # Simple variable substitution (no Liquid conflicts) for key, value in variables.items(): body = body.replace("[[ " + key + " ]]", str(value)) return body # Usage msg = render_template( "order_shipped_push", locale=user.locale, variables={"name": "Alice", "orderId": "ORD-4821"} ) # → "Hi Alice, order ORD-4821 has shipped!"
A/B testing: Templates carry an ab_variant field. The orchestrator randomly assigns users to variant A or B (using user_id % 2) and selects the matching template. Analytics later reveal which copy drives higher open or click rates.
Localization: Templates are stored per (template_id, locale) pair. If the user’s locale doesn’t have a matching template, fall back to en. This enables a single notification event to be rendered in 50+ languages by the worker at send time.
9. Level 8 — Analytics & Tracking
Knowing a notification was sent is not the same as knowing it was seen. Tracking the full funnel:
| Event | Push | SMS | How | |
|---|---|---|---|---|
| Sent | ✓ | ✓ | ✓ | Logged by worker on successful provider call |
| Delivered | ✓ | ✓ | ✓ | Provider delivery webhook → event stream |
| Opened | ✓ | ✓ | ✗ | Push: app SDK callback; Email: 1×1 tracking pixel |
| Clicked | App | ✓ | Link | Email: redirect via tracking domain; Push: deep link |
| Bounced | Token err | ✓ | Carrier err | Provider webhook → suppression list update |
All events flow to an event stream (another Kafka topic: notif_events) and are consumed by an analytics pipeline that writes to ClickHouse for fast OLAP queries:
-- Delivery funnel for campaign 1042 SELECT channel, countIf(event_type = 'sent') AS sent, countIf(event_type = 'delivered') AS delivered, countIf(event_type = 'opened') AS opened, round(countIf(event_type='opened') * 100.0 / nullIf(countIf(event_type='delivered'), 0), 2) AS open_rate_pct FROM notif_events WHERE campaign_id = 1042 AND ts >= now() - INTERVAL 7 DAY GROUP BY channel;
10. Interactive: Notification Flow Simulator
11. Cost Comparison & Calculator
SMS is 75× more expensive
than email but has a 98%
open rate vs ~20% for
email — choose your channel
based on message urgency
and audience, not cost alone.
| Channel | Cost / message | Latency | Reliability | Open Rate |
|---|---|---|---|---|
| 📱 Push (FCM/APNs) | Free | < 1 s | ~95% (device must be online) | ~15% |
| 📧 Email (SendGrid) | $0.0001 | 1 – 60 s | ~99% | ~20% |
| 📟 SMS (Twilio, US) | $0.0075 | < 5 s | ~99.9% | ~98% |
12. Priority Queues
Not all notifications are equal. A 2FA code that doesn’t arrive means a locked-out user; a marketing email that arrives 10 minutes late is fine.
The separation is enforced at the producer level: notification events include a priority field (critical |
standard), and the producer routes them to different Kafka topics or partitions: |
def publish_notification(event): topic = "notif-critical" if event["priority"] == "critical" else "notif-standard" producer.send(topic, value=event, key=event["recipient_id"].encode()) # Dedicated workers only subscribe to notif-critical critical_consumer = KafkaConsumer( "notif-critical", group_id="push-workers-critical", # no max_poll_records limit — drain as fast as possible ) # Shared workers subscribe to notif-standard standard_consumer = KafkaConsumer( "notif-standard", group_id="push-workers-standard", max_poll_records=100, # batch for throughput )
SLA targets by priority:
| Type | P99 Latency | Queue Depth Alert |
|---|---|---|
| Critical (2FA, payment) | < 1 second | Alert if > 10 messages |
| Transactional (order shipped) | < 30 seconds | Alert if > 1,000 messages |
| Marketing (newsletter) | < 15 minutes | Alert if > 1M messages |
13. Putting It All Together
API → Kafka (2 topics: critical + standard) → 3 channel consumer groups (push, email, sms) per topic → preference check (Redis + MySQL) → idempotency check (Redis SET NX) → template render → provider call → analytics event to ClickHouse → DLQ for failures.
The system handles 10M/day easily and can scale to billions with horizontal Kafka partitioning and worker auto-scaling on consumer group lag. Key interview talking points:
- Decouple producers from consumers with Kafka — the API never waits on providers
- Per-channel workers with independent consumer groups — email slowness never delays push
- User preferences checked before every send — respect opt-outs, GDPR-compliant
- Idempotency keys prevent duplicate sends during Kafka rebalances
- Priority routing gives 2FA codes a dedicated fast lane
- DLQ for every channel — never lose a failure silently
- Analytics pipeline closes the loop — you know delivery rate, open rate, bounce rate
At Facebook scale,
notifications are a product
in themselves — the team
that owns the notification
infrastructure is separate
from the teams that trigger
notifications.
A common follow-up question: what if a user has 10 devices? The device registry stores all active tokens per user. The push worker fans out to all tokens in parallel. If any token returns BadDeviceToken, it’s removed; if all fail, it falls back to email. This is why the idempotency key must include the device token: notif_id + ":" + device_token, not just notif_id.
Another common extension: digest notifications. Instead of sending 50 “X liked your photo” pushes, batch them into “Alice and 49 others liked your photo”. The fanout service groups events by recipient within a 30-second window before publishing to Kafka. This reduces provider API calls and prevents notification fatigue — one of the most impactful product improvements you can make to a notification system.