ai · hard

How do you A/B test a non-deterministic AI feature?

How would you design an A/B test for an AI feature whose outputs are non-deterministic?

Updated Jun 2026 Calibrated to the strong-hire bar

This question has a clean tell: candidates who know standard A/B testing but have never actually run an AI experiment answer confidently and incorrectly. The non-determinism is not a footnote to manage; it is the central design problem. There are three distinct variance sources to control before the experiment begins, and the interviewer at an AI-first company in 2026 expects you to name all three.

The three variance sources

Most candidates only think about one. Name them explicitly:

  • Model variance: the same prompt can return different outputs across calls, even at temperature 0. Non-associative floating-point operations, hardware configuration, and mixture-of-experts routing via batch composition all introduce variance even with a seed set (arXiv 2602.07150). This is not a solvable problem; it is a controllable one.
  • Evaluation variance: if you use an LLM judge to score output quality (which you should), that judge is also non-deterministic. A single LLM judge has plus-or-minus 10 to 14 percent variance depending on provider, based on Scale AI data across GPT-4o, Claude, and Gemini. A panel of three judges with semantically equivalent prompts cuts that variance by at least 50 percent.
  • Population variance: different user segments have different priors on AI quality. Power users of the existing product may notice degradation that new users never see. Segment-level analysis is not optional.

Structure a strong answer

strong

"Before touching live traffic, I'd run an offline eval on a golden set. The new variant needs to clear a quality gate scored by a panel of at least three LLM judges with semantically equivalent prompts; that cuts judge variance by more than half compared to a single judge. This is not an alternative to the online A/B test: the offline eval gates model quality, and the online test measures whether quality translates to user value. Both are required. Then I'd lock the experiment config: same model version, same prompt hash, same retrieval snapshot, same temperature. Any of those drifting mid-experiment invalidates the sample ratio mismatch check and makes the results uninterpretable. For the online experiment design, I'd choose the method based on the feature type. Classic A/B for independent user decisions like an AI reply suggestion. Interleaving for ranking or recommendation AI, where user taste is the confound and you can run with five to ten times fewer users by showing results from both variants in the same session. Shadow deployment for agentic or high-stakes features where you cannot risk live harm. My metric hierarchy: task completion or goal achievement is the north star; LLM-as-judge quality score on a held-out session sample is the quality gate; return rate over seven days is the lovability signal, because an AI feature can lift engagement while quietly eroding trust; cost-per-quality-completion is the viability floor. For variance reduction I'd use LLM-featurized covariates, specifically query intent and session type, as CUPED-style control variates to improve precision. This is a published technique as of 2026 (arXiv 2606.08853): LLMs featurize unstructured input to create control variates that reduce noise in the experiment. I'd also define the kill condition up front: if the quality judge score drops below threshold in any major user segment mid-experiment, I stop regardless of what the north star shows. CTR going up while trust erodes is not a win."

weak

"I'd run a standard 50/50 split for two weeks, use a large sample size to account for the randomness, and check if engagement went up at p less than 0.05." Interviewers at AI-first companies have heard this answer from candidates who read the Wikipedia entry the night before. It ignores evaluation variance (the metric itself is noisy), assumes the model is stable mid-experiment, skips the offline eval stage entirely, and mistakes engagement for quality. A feature that nudges users to click more while generating worse outputs will pass this test and fail in production.

The offline eval is not optional

The weak candidate treats offline evals and online A/B tests as alternatives: “we can test in production.” The strong candidate knows they are sequential. The offline eval on a golden set catches whether the model variant is qualitatively better before it touches live users. The online test then answers a different question: does that quality improvement produce user behavior change? Skipping the offline stage means your A/B test conflates model quality with user response, and you cannot isolate which is driving the result.

When A/B testing is the wrong tool

Interviewers at search and recommendation companies, specifically Google, Perplexity, and similar, expect you to know when to reach for interleaving instead. For AI ranking features, user taste is a massive confound: the same user responding differently on different days can swamp a small model improvement signal. Interleaving shows results from both variants in a single session and controls for that taste variance directly. You need five to ten times fewer users to reach the same power. For agentic features that write emails, execute code, or modify account state, shadow deployment is the right starting point: you run the new model in parallel, log its outputs, and compare to the human or baseline path without any user exposure.

The metric hierarchy in 2026

The PM who measures only CTR is measuring feasibility (the model can produce output the user clicks). The PM who layers in the return rate and cost-per-quality-completion is measuring viability (users trust it enough to come back, and the unit economics work) and lovability (the AI met users where they were, without being obnoxious about it). In 2026, feasibility is assumed. Interviewers are testing whether you know what to measure beyond it.

The kill condition most candidates omit

Defining when you stop the experiment is as important as defining when you ship. Set the kill condition before the experiment runs: a quality judge score below threshold in any major segment, or a seven-day return rate that diverges from the control by more than a defined margin, triggers a stop regardless of north star movement. If you cannot articulate the kill condition, you have not designed the experiment.