execution · standard

Phased rollout vs full launch: how do you decide?

How do you decide between a phased rollout and a full launch?

Updated Jun 2026 Calibrated to the strong-hire bar

The interviewer is not checking whether you know what a canary release is. They are checking whether you can reason about risk under pressure, name the variables that actually matter, and hold your ground when a stakeholder says “just ship it.” Most candidates answer this by listing generic tradeoffs. Strong candidates apply a ranked decision framework to the specific feature in front of them.

The four variables, in order

Before naming a percentage or a timeline, work through these four questions out loud:

  1. Reversibility of the harm. If this feature breaks, can you undo what happened? A broken recommendation widget is annoying. A payment confirmation bug that double-charges customers, or an agentic feature that sent emails on users’ behalf, is not reversible. Irreversible harm always phases. This is not a judgment call.

  2. Blast radius. How many users does a bad outcome touch before you can stop it? A full launch means 100% of users are exposed to the first failure. A 1% canary limits that to 1%. The question is not “do we accept risk” but “how many users does risk touch before we can stop it?”

  3. Signal quality at partial traffic. Can you learn what you need to learn at 1% or 5%? For most conventional features, yes. For rare-event detection (fraud patterns, accessibility failures, long-tail edge cases), you may need much larger N to get meaningful signal. If partial traffic produces noisy or incomplete data, phasing delays your ability to decide and may itself be the riskier choice.

  4. Cost of delay. Phasing takes time. If a competitor is shipping the same thing, or if a regulatory window closes, or if the feature is purely additive with no side effects, the cost of delay may outweigh the risk of shipping wide. Phasing is not always safer, it is sometimes just slower.

Structure a strong answer

strong

"The right choice depends on four variables I always check in order: reversibility of the harm, blast radius at each phase, the quality of signal I get at partial traffic, and the cost of delay. Let me apply them to this feature specifically.

If the harm is irreversible (say this feature modifies user data, initiates financial transactions, or takes actions on a user's behalf), I phase regardless of schedule pressure. You can't unsend an email.

If the harm is reversible, I look at blast radius. I'd want a kill switch instrumented before the first user sees this: a feature flag that halts the rollout in under 15 minutes, no code deploy required. Then I'd run internal dogfood, move to a closed beta with opt-in power users, then a 1–5% random slice with pre-defined gate criteria before widening. At each gate, I need a specific guardrail metric threshold: not just 'error rate looks fine,' but 'if p99 latency exceeds X ms or error rate crosses Y%, we stop and investigate before advancing.'

If there's pressure to skip the phase, I'd reframe it: a full rollout that fails and triggers an emergency rollback takes longer and costs more user trust than a two-week canary. The question is whether we want to learn on 5% or on 100%.

That said, I'd push for a full launch if the feature is purely additive, opt-in, with no side effects on other users' data, and we need large N quickly to get meaningful signal. Phasing isn't automatically safer."

weak

"It depends on the risk level. I'd start with a small group and expand if things look good." This lists concepts without applying them. It conflates phasing with caution and never mentions what "things looking good" actually means in metric terms. It doesn't distinguish reversible from irreversible harm, doesn't name a guardrail metric, doesn't address the kill switch, and will break under any follow-up pressure.

The vocabulary that signals senior thinking

A few terms that carry weight when used correctly:

  • Guardrail metric: a metric you are not trying to move, but monitor to catch collateral damage. Distinct from the success metric. Candidates who only name success metrics are answering a weaker version of this question.
  • Kill switch: a feature flag that can halt the rollout in under 15 minutes with no code deploy. Any risky feature needs one before the first user sees it.
  • Rollback SLA: “we can revert in 15 minutes” is a different risk profile than “we need a two-week migration to undo.” State it explicitly.
  • Canary release vs. feature flag: these are different tools. A canary release routes a percentage of live traffic to a new deployment version. A feature flag toggles a feature within the same deployment. Conflating them is a junior tell. Google’s Limelight, Meta’s Gatekeeper, and Amazon’s Weblab are internal feature flagging systems, knowing these names signals technical fluency without being gratuitous.

The AI-specific layer

In 2026, AI features add a complication that most prep resources miss. A conventional feature either works or it doesn’t, and you see it in error rates within minutes. An LLM-powered feature can work for 99.5% of prompts and silently fail on a long-tail of edge cases that only appear at scale, in specific languages, or under adversarial use. A 1% canary may not surface those at all.

For AI features, partial traffic monitoring is necessary but not sufficient. You also need a parallel eval harness running adversarial inputs alongside the rollout. The guardrail metrics change too: hallucination rate, refusal rate, and per-query cost are not relevant to conventional features but are critical for LLM features. A per-query cost spike under load can make a full launch financially damaging even if it is technically stable.

For agentic features specifically, the reversibility question dominates everything else. An agent that sends emails, books appointments, or executes transactions on a user’s behalf creates outcomes that cannot be undone. For these, a phased rollout is not a conservative choice, it is the minimum viable safety protocol. Name the reversibility check before you name any percentage.

The PM judgment

The interviewer is testing risk reasoning, not process recitation. They want to see that you can name the specific failure mode for this specific feature, quantify how bad it is and for how many users, and define the exact conditions under which you would stop. Show that you think about launches and about failures.

Asked at