Design a notification system: PM interview answer

Q: Design a notification system.

How to answer the notification system design question as a PM, covering channel tradeoffs, frequency controls, and the 2026 AI-agent angle.

This question is not asking you to recite Kafka. It is asking whether you understand why notification systems fail as products, not as infrastructure.

Scope before you draw anything

The first thing a strong candidate says is: “What kind of product is this, and who is the user?” A transactional system (bank alerts, order confirmations) has different trust requirements than a social engagement system (likes, follows) or a workplace tool (mentions, task assignments). The channel mix, reliability model, and frequency caps are all downstream of that answer.

Then anchor on scale. At 50M daily users sending 5 notifications each, you are handling 250M notifications per day, roughly 17,000 per second at peak. Name that number. Interviewers want to see you reason about scale before components, not after.

What the architecture actually needs to do

The core flow is: event producer (app action, agent action, scheduler) feeds a queue (Kafka or equivalent), workers fan out to channel handlers (APNs for iOS, FCM for Android, SMTP for email, SMS gateway), a preference store filters and personalizes, and a device token registry keeps addresses current.

Push versus pull is not a binary choice. Use push delivery for real-time alerts and a pull-based notification inbox for history, the way Gmail and Slack both do. A user who missed a push can still find the message; a push-only system loses it.

On reliability: at-least-once delivery with exponential backoff is the standard for most notifications (duplicates are annoying, missed order confirmations are worse). Financial alerts need idempotency keys and deduplication before delivery. Name the tradeoff; do not pretend it does not exist.

The product problem the interviewer actually cares about

Delivery is a solved infrastructure problem. The unsolved problem is relevance and trust.

Android push opt-ins dropped from 85% to 67% in a single year. iOS sits at 43.9%. Your system will reach roughly half of users by push. A single irrelevant push causes 10% of users to opt out entirely; two to five low-relevance notifications per week drives nearly half of users to disable push altogether. The notification system you design determines whether the product has a push channel in six months.

This means the preference model is not a settings page bolted on at the end. It must be per-channel, per-category, per-frequency, and timezone-aware. Quiet hours (10pm to 8am local) are not optional. The default should be less than the user expects, not more.

On channel economics: SMS costs approximately $0.01 per message. At 1M messages per day, that is $10,000 per day. A PM is expected to know when not to send the SMS, which means reserving it for truly critical transactional messages (fraud alerts, two-factor codes) and using in-app banners or email for everything else.

The 2026 complication that separates strong candidates

By end of 2026, 40% of enterprise applications will embed AI agents. Each agent can generate notification events autonomously. A single subscription renewal that previously sent one email can now cascade into five notifications across email, in-app banner, Slack, mobile push, and webhook. Microsoft data shows employees already receive 153 Teams messages and 117 emails daily, with interruptions every two minutes. A 23-minute recovery cost per interruption is not a statistic to cite for color; it is the product consequence of a poorly orchestrated notification system.

A strong answer names this: the system needs an orchestration layer that deduplicates across agent-triggered and human-triggered events before any message goes out. The question is no longer “can we send this?” It is “should we, and does the user know an agent caused it?”

How to close

Define success as three numbers: delivery rate (did it arrive?), opt-in retention rate (is push still enabled next month?), and actionability (click-through or dismissal rate as a proxy for relevance). Throughput alone is an engineering metric. The interviewer wants to see you own the product outcome.

strong

"Before I design anything: is this consumer social, a workplace tool, or transactional? Those have different trust models. I'll assume a consumer social product at 50M DAU, which means roughly 250M notifications per day at five per user. The core product problem is not delivery, it's opt-in retention. Android opt-ins dropped from 85% to 67% in a year; iOS is at 43.9%. Every design choice I make is oriented around not accelerating that decline. So I'd tier channels by urgency: transactional gets push plus email, marketing gets in-app only, system alerts get push with quiet hour suppression. Rate caps are per-user and per-category, not a single global number, because a user who wants order updates doesn't want to be throttled alongside promotional messages. On reliability: at-least-once with exponential backoff for most types, idempotency keys for financial. And because AI agents now generate notification events autonomously, I'd add an orchestration layer that deduplicates across agent-triggered and human-triggered events before delivery. Success is delivery rate, opt-in retention, and click-through rate, in that order."

weak

"We need Kafka for the queue, APNs for iOS, FCM for Android, a preference store, and a rate limit of 10 per hour." No scoping, no channel economics, no opt-out data, no 2026 agent context, no definition of success beyond throughput. The interviewer sees an engineering diagram with PM vocabulary sprayed on top.

The PM judgment

The interviewer is checking whether you understand that viability now depends on user trust. A notification system that erodes opt-in rates is not a functioning system at any throughput. The candidate who treats relevance as a reliability constraint, not a nice-to-have, is the one who clears the bar.