How would you design an A/B test for a core product flow?

How to structure an A/B test answer in a PM interview: hypothesis, metrics, guardrails, and the 2026 wrinkle of non-deterministic AI features.

Design an A/B test for a core product flow

This question is a competency filter. Interviewers at Meta, Airbnb, and Stripe run experiments constantly; they are not checking whether you know the word “hypothesis.” They are checking whether you know what breaks a real experiment and what to do when significance is reached but the answer is still ambiguous.

Five beats, in order

1. Hypothesis (30 seconds). Slot your thinking into this shape: “We believe that [specific change] for [segment] will increase [primary metric] by at least [MDE] because [mechanism]. Our assumption is that [the friction we’re removing] is the binding constraint.” The minimum detectable effect is not optional. If you cannot name the smallest lift worth shipping, you cannot compute a sample size, and experienced interviewers will probe exactly that.

2. Metrics (60 seconds). Name one primary metric, two supporting metrics, and two guardrail metrics with explicit thresholds set before the test starts. For a checkout flow: primary is checkout completion rate, supporting metrics are time-to-complete and add-to-cart-to-purchase rate, guardrails are payment error rate (kill if it rises more than 2% relative) and support tickets filed within 24 hours of checkout (kill if it rises more than 5% relative). Setting guardrail thresholds after results arrive is p-hacking by a different name.

3. Design (60 seconds). Randomize at the user level, not the session level. Session randomization inflates your effect estimate when the same user sees both experiences across visits. For marketplace or social-sharing flows, flag network interference: if variant users and control users interact (referrals, shared listings, marketplace matching), standard splits underestimate true effect. Propose cluster randomization or a geo holdout instead. Run for a minimum of two full calendar weeks to capture the weekly cycle; a Mon-Fri-only run produces biased samples for consumer products. For safety-critical paths, use a 10/90 split with a kill switch rather than 50/50.

4. Analysis (30 seconds). Pre-commit to the analysis date and do not peek. Checking results early inflates your false-positive rate; use sequential testing methods if early stopping is genuinely required. After significance is reached, segment by new versus returning user, device type, and your top user cohorts before declaring a winner. A variant that wins in aggregate can quietly hurt mobile users or new users. Also check for novelty effect: if the lift is concentrated in week one, plan a 30-day post-launch holdout before shipping to 100%.

5. Decision (30 seconds). Four outcomes: primary wins with no guardrail trips and consistent segment results means staged rollout; primary wins but a guardrail trips means do not ship and investigate; null result means evaluate whether the MDE was right or the change was too small; inconclusive means run longer or increase traffic. “Statistically significant” is not a ship decision on its own.

The 2026 wrinkle: testing AI features

If the core flow includes a generative step (RAG search, LLM-powered support, AI recommendation), the variant is a distribution, not a fixed treatment. The same user, same prompt, same session can get a different response on different days. This breaks the independence assumption that underpins standard A/B testing.

The fix has four parts: freeze the model version and system prompt for the full test window; use eval scoring (automated rubric or LLM-as-judge) on a sampled output set as your quality metric rather than relying solely on click or completion signals; add user retry rate and override rate as leading indicators of quality degradation; and treat latency and cost per query as required guardrail metrics. A variant that scores higher on quality but costs three times more per inference is a trade-off to present to leadership, not an automatic ship decision.

strong

"I'd anchor this on Stripe's checkout confirmation step. Hypothesis: displaying an itemized fee breakdown before the final submit button will increase checkout completion rate by at least 1.5 percentage points for first-time users on mobile, because fee uncertainty is the primary abandonment driver in exit surveys. Primary metric: checkout completion rate. Guardrails: payment error rate and support contact rate, both with pre-set kill thresholds. Randomize at user ID, run two full weeks, segment by new versus returning and by device before calling a winner. If the primary metric wins but support volume ticks up, I do not ship until I understand why."

weak

"I'd set up a control and variant, run until I hit statistical significance, then ship the winner." No primary metric named. No guardrails. No randomization unit. No novelty check. Interviewers who run experiments will immediately ask "what if support volume spikes?" and this answer has nowhere to go.

What the interviewer is actually scoring

They want to see that you treat “statistically significant” as a necessary condition, not a sufficient one. The candidates who clear the bar at tier-1 companies name guardrail thresholds before the test starts, flag network interference when the flow involves social or marketplace dynamics, and, in 2026, acknowledge that a generative feature is a distribution and requires eval scoring alongside behavioral metrics. Those three things separate a PM who has shipped experiments from one who has read about them.

For the non-deterministic AI variant of this question and for how to measure success on a new AI product, the eval-scoring layer is the same underlying skill.

Five beats, in order

The 2026 wrinkle: testing AI features

What the interviewer is actually scoring

Asked at