ai · hard

"How would you validate an AI model before shipping it?"

How would you validate an AI model before shipping it?

Updated Jun 2026 Calibrated to the strong-hire bar

This question tests whether you treat validation as a phased discipline or a launch checklist. The failure mode is rattling off metrics (precision, recall, F1, latency) without explaining who decides if an output is correct, what thresholds block the ship, or how online signals connect to business outcomes. Interviewers at Anthropic and OpenAI have explicitly said: if a candidate can’t name their actual metric and explain why they chose that threshold, the interview is over in their minds. They’re checking whether you owned the work or read about it.

In 2026, the harder question is not “does the model work?” That bar is cheap to clear. The real validation question is: does this model solve something users are willing to pay for (viable), and does it meet them where they are without being obnoxious (lovable)? Technical accuracy is the floor. Prove the floor, then prove the ceiling.

Structure a strong answer

Four layers, in order. Each has a clear handoff to the next.

Layer 1: define what good looks like before touching the model. Write a spec precise enough that two domain experts reach the same pass/fail verdict independently. If two experts disagree, you don’t have a validation problem yet. You have a spec problem. Source the first 20-50 eval cases from real failures and bug reports, not synthetic edge cases. Synthetic edge cases measure the eval designer’s imagination; real failures measure the model’s actual gaps.

Layer 2: offline gate. Run automated evals on every commit that touches prompts, retrieval, or agent logic. Choose grader types by what you’re measuring: code-based graders (fast, deterministic) for structured output and unit-testable tasks; model-based graders with explicit rubrics for open-ended outputs, calibrated against human raters; human graders for the subjective tail. The calibration step matters: candidates who say “LLM-as-judge” without mentioning calibration against human raters are signaling shallow knowledge. A grader that isn’t calibrated is just a second model with opinions.

Separate capability evals from regression evals. Capability evals start at low pass rates intentionally (you’re measuring what the model can learn, so a 40% score is informative, not alarming). Regression evals must stay near 100%; they guard the floor and catch degradation. A mature eval suite graduates from capability to regression as tests saturate. An eval sitting at 97% forever is a dead test: the model already cleared it and provides no new signal.

For agentic products, measure pass@k and pass^k separately. pass@k means at least one success in k tries; pass^k means all k tries succeed. Frontier models can hit pass@k near 100% on a coding task while pass^k collapses to near 0% at k=10. For a research assistant, one-in-ten may be acceptable. For a customer-facing support agent handling a refund, every interaction must succeed. Know which one your product requires before you pick a threshold.

A real example worth knowing: a benchmark grader penalizing “96.12” when the spec expected an exact decimal format caused a model to score 42% on a task it was actually solving correctly. After fixing the grader, the same model scored 95%. Low eval scores are sometimes grader bugs, not model bugs. Check this before escalating to model changes.

Layer 3: canary and shadow mode before full exposure. Shadow routing sends real traffic through the new model without serving the output. Canary releases serve a small cohort. Both let you watch for distribution shift: the long-tail phrasings and edge intents that no curated offline dataset anticipated. This is the primary reason you need online evals: not redundancy with offline, but coverage of the space that didn’t exist in your test set.

Layer 4: online monitoring as a continuous discipline. Track hallucination rate by category: intrinsic (contradicts provided context), extrinsic (unsupported by any source), fabrication (invented fact). Track task completion, fallback rate, user correction rate, and downstream business signals (resolution rate, repeat contact rate, retention). Set explicit thresholds before launch. Which are blocking and which are alerting? Don’t pick thresholds after you see the data.

Safety is not a dedicated phase. It’s a dimension of every eval: does the model trigger when it shouldn’t? Does it stay quiet when it should? Raise it before minute 40 of any loop at an AI-native company. Candidates who reach that point without naming safety signal that they see it as an afterthought.

The 2026 layer. Online metrics must include signals that users trust and act on the output, not just that it’s technically accurate: opt-out rate, satisfaction delta, downstream conversion or retention. Add anti-obnoxiousness checks: how often does the AI surface when the user didn’t want it, and does it stay quiet when it should? Feasibility is free. Viability and lovability are what prove the model should exist at all.

strong

"I think of validation in four layers, and the handoff between them matters as much as any individual metric. First, I define what good looks like before touching the model: writing a spec precise enough that two domain experts reach the same pass/fail verdict independently, and sourcing the first 20-50 eval cases from real failures, not synthetic examples. Second, I build an offline gate: automated evals on every commit that touches prompts, retrieval, or agent logic. I choose grader types by what I'm measuring (code-based for deterministic tasks, model-based with rubrics for open-ended output, and human graders for the tail). The model-based graders need calibration against human raters; otherwise you just have a second model with opinions. I separate capability evals from regression evals. Capability evals start at low pass rates intentionally; regression evals must stay near 100% or they're alerting me to degradation. For any agentic feature, I measure pass@k and pass^k separately. A model that gets it right one-in-ten tries is acceptable for a research assistant and a blocker for a customer-facing agent. Third, I use shadow mode and canary releases to catch distribution shift: the phrasings and intents that didn't exist in my test set. That's the reason online evals exist: coverage of what you couldn't anticipate, not redundancy with offline. Fourth, I set explicit blocking and alerting thresholds before I see the data. Hallucination rate by category (intrinsic, extrinsic, fabrication), task completion, fallback rate, user correction rate, and downstream signals like resolution rate and repeat contact rate. Safety appears in every layer as a dimension: does the model trigger when it shouldn't, and stay quiet when it should? Finally, the 2026 layer: online metrics must include viability and lovability signals, not just accuracy. Is the resolution rate translating to retention? Is the AI triggering when users don't want it? That's what proves the model is worth shipping."

weak

"I'd look at precision, recall, F1, and latency, then run an A/B test before launch." This fails in four ways: it treats validation as a single gate rather than a phased discipline; it says nothing about who or what decides if an output is correct (the grader design problem); it conflates capability evals with regression evals; and it never connects technical signals to business outcomes. Interviewers are specifically checking whether you can name your metric, explain your threshold, and show you owned the outcome. Listing the right words is not enough.

The PM judgment

The grader calibration problem is the most commonly missed tell. Most candidates say “LLM-as-judge” and move on. Practitioners say “LLM-as-judge, calibrated against human raters on a stratified sample.” That distinction signals real ownership. For RAG systems specifically, RAGAS is a named evaluation framework that scores retrieved context relevance and semantic similarity against ground truth. Worth naming if your AI feature uses retrieval.

Viability: does resolution rate translate to retention and revenue? Can you point to a metric that users pay to have solved? Lovability: does the model meet users where they are (correct channel, correct timing, correct confidence level) without surfacing when they didn’t want it? The offline eval proves the model can work. The online eval proves it does work. The viability and lovability layer proves it should exist.

Follow-up questions to prepare for

“What’s the difference between pass@k and pass^k, and when does it matter?” pass@k is at least one success in k tries; pass^k is all k tries succeed. For customer-facing agents, pass^k is the relevant metric because every interaction must succeed, not just one in ten.

“How do you know if a low eval score is a model bug or a grader bug?” Check whether two human raters would both mark the output as failing. If they’d disagree, fix the spec and the grader first. Low scores from rigid graders (penalizing valid format variants, for instance) are grader bugs. Fix those before debugging the model.

“What’s the difference between a capability eval and a regression eval?” Capability evals measure what the model can learn; they should start at low pass rates and trend up. Regression evals guard the floor and must stay near 100%. A test at 97% forever provides no new signal. It needs fresh cases to maintain coverage.

“How do you handle safety in your validation process?” Safety is a dimension of every layer, not a dedicated gate. In offline evals, it’s a category of test cases (harmful outputs, over-triggering). In online monitoring, it’s over-trigger rate and opt-out rate. It should surface before minute 40 of any loop at an AI-native company.