system design · hard
Walk me through a system you designed
Walk me through a system you designed. What were the tradeoffs?
At Stripe, this question is not a whiteboard exercise. The boxes-and-arrows sketch takes roughly 10 minutes. The next 45 minutes are a structured interrogation of every assumption in that sketch. Candidates who prepare for the drawing fail. Candidates who prepare for the failure probes pass.
What Stripe is actually evaluating
Stripe interviewers care about API design and developer experience more than raw scale metrics. The internal bar, confirmed by multiple prep sources and firsthand accounts, is whether you can “reason about system properties.” That means: when a caller misuses your endpoint, when a downstream service goes down, when one tenant’s traffic floods the shared worker pool, does your architecture hold? Or does it only work on the happy path?
The question is not asking you to prove engineering depth. It is asking you to prove you owned a real architectural decision: that you understood the constraints, chose among genuine options, and can explain what broke in production and why.
There are four follow-up probe types that Stripe interviewers reliably escalate through:
- External endpoint failure. What happens when the merchant’s webhook receiver returns a 500, or times out entirely? Fire-and-forget is wrong. How does your retry logic work and what’s the backoff curve?
- Noisy neighbor. A single high-volume tenant with a slow endpoint is starving the shared worker pool. How does your architecture isolate tenants? Per-tenant queues? Circuit breakers? Rate limits per integration?
- Security boundary. Could a malicious actor craft an endpoint URL that causes your delivery system to make requests inside your internal network? This is an SSRF-class vulnerability. Did you validate destination URLs? Allowlist them?
- Exactly-once delivery. Can you guarantee a webhook fires exactly once? The correct answer is no: exactly-once delivery is provably impossible at the network layer. The right answer is idempotency keys on the receiver side. If you claim you can guarantee it in the queue, you fail this probe.
At staff level, interviewers also expect you to treat rollout, observability, and cross-team alignment as part of the answer, not an afterthought you mention when prompted.
How to pick the right system
Pick a system where you made a consequential architectural call you can defend with reasoning, not just outcome. The system should have at least one real failure mode you discovered in production, and at least one decision where you chose among two or more genuine options with different tradeoffs.
Avoid systems where you were an observer. If you are describing what the team built rather than a decision you owned, the interviewer will surface this quickly: “Why did you choose that approach over X?” If you don’t have an answer grounded in the constraint you were solving, you are describing someone else’s design.
Good candidates for the system: a webhook delivery pipeline, a rate limiter for an external API, an event fan-out system, an async job queue with retry semantics, a real-time pricing or rules engine. In 2026, LLM pipelines, RAG layers, and agent loops also qualify and often produce richer tradeoff conversations (see below).
The narrative arc
The structure that works is: constraint that forced a decision, the options you evaluated, the specific tradeoff you made and why, what broke or surprised you in production, what you’d do differently.
strong
"We built a webhook delivery system for third-party integrations. The constraint was guaranteed delivery with acceptable latency: we couldn't use fire-and-forget, but we also couldn't block the main payment processing path. We evaluated three approaches: synchronous delivery (too risky, one slow merchant could delay our pipeline), polling (too much latency for event-driven use cases), and an async queue with retry. We chose the queue with exponential backoff and jitter. The tradeoff we underestimated was tenant isolation. A single high-volume merchant with a slow endpoint was starving the shared worker pool within two weeks of launch. We patched it with per-tenant queues and circuit breakers, but the real lesson was that the happy-path architecture looked sound and the failure-mode architecture was completely different. If I did it again, I'd design for the noisy neighbor from day one. On idempotency: we required receivers to implement idempotency keys because exactly-once delivery isn't something we can guarantee at the network layer. That decision protected us from duplicate-charge bugs when we retried."
weak
"I designed a notification system for our mobile app. We had a backend service that sent push notifications via Firebase. The tradeoff was between push and pull. We went with push because it's more real-time. It worked well and engagement went up." This fails on every Stripe dimension. No failure mode reasoning. No API design thinking. No delivery guarantee discussion. "Engagement went up" is not a tradeoff, it is a vague outcome. The first follow-up ("what happens when Firebase is unavailable?") ends the answer because the candidate never thought past the happy path. The story is also self-contained in a way that signals the PM was a bystander, not the decision-maker.
The 2026 reframe: AI systems are valid systems
Before 2025, “system you designed” almost always meant infrastructure: queues, caches, databases, APIs. In 2026, many PM candidates’ most recent system is an LLM pipeline, a RAG layer, an agent loop, or an eval harness. These are valid and often produce richer tradeoff conversations because the vocabulary maps directly to business outcomes.
Tradeoffs worth naming in an AI system: latency vs. accuracy (fast cheap model vs. slow accurate one), retrieval precision vs. recall (tight context windows miss relevant chunks; loose retrieval adds noise), human-in-the-loop vs. full automation (where is the right guardrail and why), cost-per-query as a direct infrastructure constraint on unit economics.
The probe that matters most in 2026: when you made the system more capable, did you make it more viable? Feasibility is no longer the constraint. The interesting tradeoffs are whether the infrastructure costs support a margin-positive business, and whether the system’s behavior (speed, predictability, error handling) actually serves the person using it, not just the demo. A Stripe interviewer probing an LLM-based system will ask the same escalating failure questions: what happens when the model returns a malformed response, when latency spikes, when one tenant’s query volume consumes the context budget of others?
Own the tradeoff reasoning. That is the whole interview.
For more on the Stripe loop format and what clears the bar at each round, see Stripe’s interview process. For the idempotency question that often appears alongside this one, see GET, POST, PUT, DELETE: idempotency.