Designing a Notification System

A system design interview guide to building a notification service that reliably fans out push notifications, SMS, and email to millions of users without losing messages, sending duplicates, or spamming anyone.

A notification system is one of those services that looks trivial from the outside ("just send a message") and turns out to be a surprisingly rich design problem once you account for scale, third-party gateways you do not control, and the hard requirement that you neither lose a message nor send it twice. The interesting work is not the sending itself but everything around it: deciding what contact information to keep, decoupling the callers from the slow external gateways, retrying without duplicating, respecting user preferences, and being able to prove afterwards whether a notification was actually delivered. This guide walks through a design that handles all of that, building up from the channels outward.

Contents

  1. Channels and Gateways
  2. Gathering Contact Info
  3. End-to-End Architecture
  4. Why a Queue per Channel
  5. Reliability: No Lost Data
  6. Templates
  7. Settings & Rate Limiting
  8. Observability
  9. Summary

1. Channels and Gateways

The first thing to pin down is that a notification system almost never delivers messages itself. Instead it hands each message to a third-party gateway that owns the last mile to the device or inbox. There are several distinct channels, each with its own gateway, its own credentials, and its own quirks. Your design has to treat them as separate pipelines rather than assuming one uniform "send" path.

ChannelGatewayWhat it does
iOS pushAPNs (Apple Push Notification service)Delivers push notifications to iPhones and iPads. You authenticate to APNs and send the payload addressed to a per-device token.
Android pushFCM (Firebase Cloud Messaging)Google's equivalent for Android devices (and a cross-platform option). Again addressed by a device registration token.
SMSAn SMS provider (e.g. a commercial messaging API)Sends a text message to a phone number. Usually metered per message and subject to carrier rules.
EmailAn email provider / transactional email serviceSends email to an address, handling deliverability concerns like SPF, DKIM, and bounce tracking.

The common thread is that all four are external, networked, and unreliable from your point of view. They can be slow, rate-limited, or temporarily down, and you cannot fix them — you can only react. That single fact drives most of the architecture that follows: you must isolate yourself from each gateway so that one misbehaving channel cannot stall the others, and you must be able to retry safely when a gateway returns an error.

A useful reframing for an interview: you are not building a "message sender," you are building a resilient dispatch layer in front of four flaky external services. Every design choice — queues, retries, dedup — exists to absorb the unreliability of the gateways.

2. Gathering Contact Info

Before you can send anything you need to know where to send it, and that information has to be collected and kept up to date well before the first notification fires. The natural collection points are when a user signs up and when they install the app on a new device.

For each user you want to store:

When the mobile app is installed and registers for push, it obtains a device token from the OS and sends it to your backend; you persist it against the user id. Sign-up captures the phone number and email. All of this lands in a database that the notification system reads at send time to resolve a user id into concrete addresses.

# on app install / push registration
function register_device(user_id, platform, device_token):
  db.devices.upsert(user_id, platform, device_token, last_seen=now())

# on sign-up
function register_contact(user_id, phone, email):
  db.users.upsert(user_id, normalize(phone), email)

A schema sketch keeps the one-to-many device relationship explicit:

TableKey fieldsNotes
usersuser_id, phone, email, locale, timezoneOne row per user. Locale and timezone matter later for templating and quiet hours.
devicesdevice_id, user_id, platform, token, last_seenMany rows per user. Stale tokens get cleaned up when a gateway reports them invalid.

3. End-to-End Architecture

With channels and contact info in place, the central design is the path a notification takes from the service that wants to send it to the device that receives it. The key move is to put a queue between the request and the actual sending, so the fast, synchronous part (accepting the request) is decoupled from the slow, unreliable part (talking to the gateway).

Notification system architecture
A service triggers a notification; the notification servers authenticate the caller and apply rate limits, look up device and user data from cache and DB, render from the template store, and enqueue onto a per-channel queue. Workers pull from the queue, call the third-party gateway with retry on error, and the notification reaches the device. A notification log and analytics service track the message through send-pending, sent, and click events.

Reading the flow left to right:

function notify(caller, user_id, channel, template_id, params):
  authenticate(caller)                       # reject untrusted services
  if not rate_limiter.allow(user_id, channel):
    return DROPPED                           # don't spam the user
  contact = cache.get_contact(user_id)       # falls back to DB
  body    = templates.render(template_id, params, contact.locale)
  msg     = build(notification_id=uuid(), user_id, channel, contact, body)
  log.write(msg, status="send-pending")
  queue[channel].enqueue(msg)                # return fast; send is async
  return ACCEPTED

4. Why a Queue per Channel

It is tempting to use one big queue for everything, but giving each channel its own queue is one of the highest-leverage decisions in the design. The reasons compound:

The per-channel queue is what turns "we depend on four flaky external services" into "each flaky service is contained behind its own buffer." It is the structural reason a single gateway outage degrades one channel instead of taking the whole system down.

5. Reliability: No Lost Data

The defining requirement of a notification system is usually stated as two rules that pull in opposite directions: do not lose a notification, and do not send the same notification twice. Satisfying both at once is the heart of the reliability design.

Three mechanisms work together:

function worker_process(msg):
  if log.already_sent(msg.notification_id):   # dedup: don't send twice
    return
  try:
    gateway[msg.channel].send(msg)
    log.mark(msg.notification_id, status="sent")
    analytics.record(msg.notification_id, "sent")
  except RetryableError as err:
    if msg.attempts < MAX_ATTEMPTS:
      queue[msg.channel].requeue(msg, backoff(msg.attempts))
    else:
      log.mark(msg.notification_id, status="failed")  # give up loudly
  except InvalidTokenError:
    db.devices.remove(msg.device_id)          # clean up, do not retry

The combination is what gives at-least-once delivery with effective exactly-once user experience: persistence and retries guarantee the message is not lost, and the notification-id dedup check guarantees the user does not see it twice even though the message may flow through the worker more than once.

6. Templates

Most notifications are not bespoke one-off strings; they are the same message structure filled in with different values — "Your order #{order_id} has shipped," "{name} liked your post." Embedding that text inline in every calling service is a maintenance and consistency nightmare. A notification template store solves this by keeping reusable, parameterized content in one place.

A template is a named, versioned piece of content with placeholders. Callers reference a template by id and supply the parameters; the notification servers render the final body at send time. This buys several things at once:

# template: "order_shipped" with placeholders
templates["order_shipped"] = {
  "push":  "Your order {order_id} is on the way!",
  "email": "Hi {name}, order {order_id} shipped and arrives {eta}.",
}

render("order_shipped", {order_id: 482, name: "Sam", eta: "Friday"}, channel="email")
# -> "Hi Sam, order 482 shipped and arrives Friday."

7. Settings and Rate Limiting

Just because you can send a notification does not mean you should. Two controls protect the user from being overwhelmed, and they are also what keep your sending reputation (and SMS bill) healthy.

function should_send(user_id, channel, category):
  prefs = settings.get(user_id)
  if not prefs.opted_in(channel, category):
    return False                              # respect opt-out
  if not rate_limiter.allow(user_id, category):
    return False                              # don't spam
  return True
A user who gets spammed turns off notifications entirely — or marks your email as spam, which damages deliverability for everyone. Restraint is a feature: the settings and rate-limiting layer protects the long-term value of the channel.

8. Observability

Because the actual delivery happens inside third parties you do not control, you cannot reason about the system without recording what happened at each step. Two pieces give you that visibility.

Tracking sent, delivered, and clicked as distinct events matters because the gaps between them are diagnostic. A high sent-but-not-delivered rate points at the gateway or stale tokens; a high delivered-but-not-clicked rate points at content. Without these events you are flying blind through services you cannot otherwise inspect.

analytics.record(notification_id, "send-pending")  # accepted, queued
analytics.record(notification_id, "sent")          # handed to gateway
analytics.record(notification_id, "delivered")     # gateway receipt
analytics.record(notification_id, "clicked")       # user engagement

9. Summary

A notification system is a resilient dispatch layer in front of several unreliable external gateways. Its design is a handful of decisions that reinforce each other:

ConcernMechanism
How do messages actually reach devices?Third-party gateways per channel: APNs (iOS push), FCM (Android push), an SMS provider, an email provider.
Where do we send to?Contact info — user id, device tokens, phone, email — collected at sign-up and install, stored in a DB.
How do callers trigger sends?Notification servers authenticate the caller, apply rate limits, render the template, persist, and enqueue.
How do we stay isolated from flaky gateways?A message queue per channel: isolation, buffering, independent scaling, and backpressure.
How do we avoid losing messages?Persist first to the notification log, then retry on error with backoff.
How do we avoid sending twice?Dedup on a unique notification id before calling the gateway.
How do we keep content consistent?A template store with reusable, parameterized, localizable content.
How do we avoid spamming users?User settings / opt-out plus per-user, per-category rate limiting.
How do we know what happened?A notification log plus an analytics service tracking sent, delivered, and clicked.
The recurring theme: every part of the design exists to absorb the unreliability of services you do not control. Queues isolate them, persistence and retries survive them, dedup tames the retries, and the log and analytics make the whole opaque chain observable.