A/B test (definition)

An A/B test is a controlled experiment: a population is split randomly, each group sees one variant, and a pre-specified metric determines which wins. Most PM candidates have that sentence. What separates a strong-hire answer is knowing when a statistically significant result still should not ship, and what correct experiment design looks like when the product is an LLM.

The four inputs to name before any traffic is split

Interview answers that collapse this step to “run the test and check” fail the judgment question. Specify all four before launch:

Primary success metric. One number, pre-committed: D7 retention, checkout completion rate, task completion rate. Not a family of metrics.
Guardrail metrics. Metrics that must not regress. Session crash rate, p95 latency, hallucination rate for AI features. A guardrail regression blocks a ship decision even when the primary metric is statistically significant.
Sample size. Calculated from three inputs: minimum detectable effect (MDE), significance level (alpha, conventionally 0.05), and statistical power (conventionally 0.80, meaning a 20% chance of missing a real effect). Most PM candidates cannot name all three. Netflix and Airbnb commonly target 0.90 power given the cost of shipping a losing experiment at scale.
Run duration, pre-committed. Fixed before launch. Two to four weeks is standard for engagement features, because the novelty effect produces a short-term positive signal that decays as users adjust to anything new.

The peeking problem

Checking results mid-experiment and stopping early inflates the false positive rate: the more times you check, the more likely a spurious significant result appears by chance. Interviewers at Meta, Airbnb, and Netflix probe for this explicitly. They have publicly named it the most common failure mode from candidates who memorized a framework without understanding why the rules exist. One scheduled read-out at the pre-committed end date is the correct posture.

When a significant result still should not ship

A p-value below 0.05 is not a ship trigger. Hold even with significance when:

A guardrail metric regressed. Crashes up 1.2 percentage points blocks the ship regardless of retention lift. File the bug first.
The novelty effect likely explains the movement and the test did not run long enough to clear it.
Network effects contaminated the split. Airbnb and Uber marketplace experiments require cluster-based or switchback designs, not naive user splits, because control and variant share supply and demand.
The effect is statistically real but smaller than the MDE needed to justify engineering maintenance cost.

Inconclusive results: do not re-run the same test. Form a new hypothesis or run qualitative research first.

A/B testing for AI products

Standard experiment design breaks in specific ways for LLMs and agents.

Non-determinism. Even at temperature zero, LLMs produce variance due to floating-point precision and Mixture-of-Experts routing. A CHI 2026 paper documented this directly: standard splits undercount behavioral variance in agent experiments. AI experiments require larger samples or replicated eval runs to reach the same confidence interval as a UI experiment.

Offline eval before production traffic. In 2026, a PM can run ten prompt variants through an eval harness before lunch. Tools like Langfuse natively support prompt A/B testing with weighted traffic splits in production. Confident AI reported 50-plus research-backed metrics available for AI evaluation: hallucination rate, latency p95, task completion rate. These belong in the guardrail column. The correct workflow is offline eval first, production A/B only for variants that clear the offline threshold.

Agent experiments. A CHI 2026 Amazon study validated that 1,000 LLM agents can reproduce directional outcomes from large-scale human experiments, the first published validation of agent simulation for A/B test design. Agents compound errors across steps, so monitor compound-step behavior, not just terminal outcomes. A small early-step divergence cascades into a meaningfully different user experience by the end of a session.

Strong vs. weak answers

strong

"Hypothesis: changing [variable] will lift D7 retention by [MDE] because [rationale]. Primary metric is D7 retention. Guardrail is session crash rate. Sample size comes from MDE, alpha 0.05, and power 0.80: roughly 40,000 users per arm for this expected effect size. Run duration is 21 days, pre-committed, to clear the novelty window. I read results once at day 21. Retention up and crashes flat: ship. Crashes up meaningfully: do not ship, file the bug. For a prompt variant, I run the eval harness offline first and expose production traffic only to variants that pass the hallucination and latency guardrails."

weak

"I'd split 50/50, run for a week, and ship if p is under 0.05." This misses sample size calculation entirely (a week may be wildly insufficient or wasteful depending on traffic volume), treats p below 0.05 as a binary ship trigger without guardrail metrics, and shows no awareness of the peeking problem or novelty effect. Interviewers at Airbnb, Meta, and Netflix have named this pattern, technically plausible but decision-shallow, as the most common failure mode they see.

For the metric selection step, see North Star Metric and cohort analysis. For controlled rollout before an experiment is ready, see feature flag.

The four inputs to name before any traffic is split

The peeking problem

When a significant result still should not ship

A/B testing for AI products

Strong vs. weak answers

Related