A system design interview guide to building a notification service that reliably fans out push notifications, SMS, and email to millions of users without losing messages, sending duplicates, or spamming anyone.
A notification system is one of those services that looks trivial from the outside ("just send a message") and turns out to be a surprisingly rich design problem once you account for scale, third-party gateways you do not control, and the hard requirement that you neither lose a message nor send it twice. The interesting work is not the sending itself but everything around it: deciding what contact information to keep, decoupling the callers from the slow external gateways, retrying without duplicating, respecting user preferences, and being able to prove afterwards whether a notification was actually delivered. This guide walks through a design that handles all of that, building up from the channels outward.
The first thing to pin down is that a notification system almost never delivers messages itself. Instead it hands each message to a third-party gateway that owns the last mile to the device or inbox. There are several distinct channels, each with its own gateway, its own credentials, and its own quirks. Your design has to treat them as separate pipelines rather than assuming one uniform "send" path.
| Channel | Gateway | What it does |
|---|---|---|
| iOS push | APNs (Apple Push Notification service) | Delivers push notifications to iPhones and iPads. You authenticate to APNs and send the payload addressed to a per-device token. |
| Android push | FCM (Firebase Cloud Messaging) | Google's equivalent for Android devices (and a cross-platform option). Again addressed by a device registration token. |
| SMS | An SMS provider (e.g. a commercial messaging API) | Sends a text message to a phone number. Usually metered per message and subject to carrier rules. |
| An email provider / transactional email service | Sends email to an address, handling deliverability concerns like SPF, DKIM, and bounce tracking. |
The common thread is that all four are external, networked, and unreliable from your point of view. They can be slow, rate-limited, or temporarily down, and you cannot fix them — you can only react. That single fact drives most of the architecture that follows: you must isolate yourself from each gateway so that one misbehaving channel cannot stall the others, and you must be able to retry safely when a gateway returns an error.
Before you can send anything you need to know where to send it, and that information has to be collected and kept up to date well before the first notification fires. The natural collection points are when a user signs up and when they install the app on a new device.
For each user you want to store:
When the mobile app is installed and registers for push, it obtains a device token from the OS and sends it to your backend; you persist it against the user id. Sign-up captures the phone number and email. All of this lands in a database that the notification system reads at send time to resolve a user id into concrete addresses.
# on app install / push registration
function register_device(user_id, platform, device_token):
db.devices.upsert(user_id, platform, device_token, last_seen=now())
# on sign-up
function register_contact(user_id, phone, email):
db.users.upsert(user_id, normalize(phone), email)
A schema sketch keeps the one-to-many device relationship explicit:
| Table | Key fields | Notes |
|---|---|---|
| users | user_id, phone, email, locale, timezone | One row per user. Locale and timezone matter later for templating and quiet hours. |
| devices | device_id, user_id, platform, token, last_seen | Many rows per user. Stale tokens get cleaned up when a gateway reports them invalid. |
With channels and contact info in place, the central design is the path a notification takes from the service that wants to send it to the device that receives it. The key move is to put a queue between the request and the actual sending, so the fast, synchronous part (accepting the request) is decoupled from the slow, unreliable part (talking to the gateway).

Reading the flow left to right:
function notify(caller, user_id, channel, template_id, params):
authenticate(caller) # reject untrusted services
if not rate_limiter.allow(user_id, channel):
return DROPPED # don't spam the user
contact = cache.get_contact(user_id) # falls back to DB
body = templates.render(template_id, params, contact.locale)
msg = build(notification_id=uuid(), user_id, channel, contact, body)
log.write(msg, status="send-pending")
queue[channel].enqueue(msg) # return fast; send is async
return ACCEPTED
It is tempting to use one big queue for everything, but giving each channel its own queue is one of the highest-leverage decisions in the design. The reasons compound:
The defining requirement of a notification system is usually stated as two rules that pull in opposite directions: do not lose a notification, and do not send the same notification twice. Satisfying both at once is the heart of the reliability design.
Three mechanisms work together:
send-pending. If a worker crashes mid-send, the record survives and can be retried. Never treat an in-memory message as the source of truth.notification_id. Because retries (and at-least-once queue delivery) mean a message can be processed more than once, the worker checks whether that id has already been sent before calling the gateway. This is what prevents a user from receiving the same alert twice.function worker_process(msg):
if log.already_sent(msg.notification_id): # dedup: don't send twice
return
try:
gateway[msg.channel].send(msg)
log.mark(msg.notification_id, status="sent")
analytics.record(msg.notification_id, "sent")
except RetryableError as err:
if msg.attempts < MAX_ATTEMPTS:
queue[msg.channel].requeue(msg, backoff(msg.attempts))
else:
log.mark(msg.notification_id, status="failed") # give up loudly
except InvalidTokenError:
db.devices.remove(msg.device_id) # clean up, do not retry
The combination is what gives at-least-once delivery with effective exactly-once user experience: persistence and retries guarantee the message is not lost, and the notification-id dedup check guarantees the user does not see it twice even though the message may flow through the worker more than once.
Most notifications are not bespoke one-off strings; they are the same message structure filled in with different values — "Your order #{order_id} has shipped," "{name} liked your post." Embedding that text inline in every calling service is a maintenance and consistency nightmare. A notification template store solves this by keeping reusable, parameterized content in one place.
A template is a named, versioned piece of content with placeholders. Callers reference a template by id and supply the parameters; the notification servers render the final body at send time. This buys several things at once:
# template: "order_shipped" with placeholders
templates["order_shipped"] = {
"push": "Your order {order_id} is on the way!",
"email": "Hi {name}, order {order_id} shipped and arrives {eta}.",
}
render("order_shipped", {order_id: 482, name: "Sam", eta: "Friday"}, channel="email")
# -> "Hi Sam, order 482 shipped and arrives Friday."
Just because you can send a notification does not mean you should. Two controls protect the user from being overwhelmed, and they are also what keep your sending reputation (and SMS bill) healthy.
function should_send(user_id, channel, category):
prefs = settings.get(user_id)
if not prefs.opted_in(channel, category):
return False # respect opt-out
if not rate_limiter.allow(user_id, category):
return False # don't spam
return True
Because the actual delivery happens inside third parties you do not control, you cannot reason about the system without recording what happened at each step. Two pieces give you that visibility.
send-pending → sent → delivered, or failed. This is both the reliability backbone (it is what makes persist-first and dedup possible) and the audit trail you consult when a user asks "why didn't I get the alert?"Tracking sent, delivered, and clicked as distinct events matters because the gaps between them are diagnostic. A high sent-but-not-delivered rate points at the gateway or stale tokens; a high delivered-but-not-clicked rate points at content. Without these events you are flying blind through services you cannot otherwise inspect.
analytics.record(notification_id, "send-pending") # accepted, queued
analytics.record(notification_id, "sent") # handed to gateway
analytics.record(notification_id, "delivered") # gateway receipt
analytics.record(notification_id, "clicked") # user engagement
A notification system is a resilient dispatch layer in front of several unreliable external gateways. Its design is a handful of decisions that reinforce each other:
| Concern | Mechanism |
|---|---|
| How do messages actually reach devices? | Third-party gateways per channel: APNs (iOS push), FCM (Android push), an SMS provider, an email provider. |
| Where do we send to? | Contact info — user id, device tokens, phone, email — collected at sign-up and install, stored in a DB. |
| How do callers trigger sends? | Notification servers authenticate the caller, apply rate limits, render the template, persist, and enqueue. |
| How do we stay isolated from flaky gateways? | A message queue per channel: isolation, buffering, independent scaling, and backpressure. |
| How do we avoid losing messages? | Persist first to the notification log, then retry on error with backoff. |
| How do we avoid sending twice? | Dedup on a unique notification id before calling the gateway. |
| How do we keep content consistent? | A template store with reusable, parameterized, localizable content. |
| How do we avoid spamming users? | User settings / opt-out plus per-user, per-category rate limiting. |
| How do we know what happened? | A notification log plus an analytics service tracking sent, delivered, and clicked. |