The eval harness interview question explained

The eval harness question is now the single best predictor of hire decisions at AI-native companies. Interviewers bolted it onto loops in the last 18 months because it cannot be answered by someone who was adjacent to the work. It is not an engineering question. It is a product judgment question: did you own the mechanism that decided whether your AI feature was worth shipping?

The OpenAI CPO has said it plainly: the most important thing a PM can learn to do is write evals. Interviewers who have internalized this will not let a vague answer about “monitoring dashboards” through.

Why the question exists

In 2026, feasibility is largely free. What separates a shippable AI product from an expensive experiment is whether the PM can prove the model is good enough to charge for (viable) and fails in ways users can absorb (lovable). The eval harness answers both questions. A product without one is not a product; it is a guess.

The four AI-washed tells

Interviewers listen for all four. A single answer often shows more than one.

Collapsing offline and online evals into one blob. “We monitored accuracy and tracked feedback” treats two different activities as one. Offline evals run against fixed datasets during development; the same scoring logic gating a live response is a guardrail. Same logic, different operational role. Candidates who conflate them have never owned either.
No fallback when outputs fail. A strong candidate describes a confidence gate routing low-confidence outputs to a human queue, a graceful degradation path, or a hard feature kill. No fallback plan is a lovability failure: users hit the model at its worst with no recovery path.
Treating the model as fixed and deterministic. AI-washed candidates speak about the model as if it stays approved forever. Real AI PMs know a model can degrade silently after a provider update, a shift in input distribution, or a retrieval change in a RAG pipeline. They have a story about catching drift.
Never having killed an AI feature. Every real AI PM has a graveyard. Interviewers call concrete answers “numbers and scars.” A hallucination rate story sounds like: “we were at 4% at launch, got to about 1% after better retrieval and a confidence gate, and killed the feature entirely when we could not get below 3% on legal documents.” The AI-washed candidate has shipped things that worked fine. They have not made a kill call.

Weak vs. strong

weak

"We monitored accuracy and tracked user feedback, and we had dashboards set up to watch for issues. We iterated based on what users told us and kept improving the model over time."

This collapses offline regression testing and production monitoring, names no metric, states no threshold, and describes no incident. Interviewers conclude the candidate was present in meetings where AI was discussed, not the one making calls when it broke.

strong

"Our offline eval set had 400 hand-labeled examples, 60 of which were edge cases we seeded deliberately: short queries, ambiguous intent, inputs that looked like hallucination bait. The metric was task completion rate with a secondary hallucination check via a judge model. Our go/no-go threshold was 92% task completion and under 2% hallucination; I owned the gate. In production, task completion came in 4 points lower than offline suggested; we traced it to a retrieval mismatch on mobile queries within a week. The fallback was a confidence gate: anything below 0.7 routed to a human queue. We used that gate to kill the feature on one document type entirely when we could not close the gap."

The five moves in sequence: describe the offline eval set and its deliberate edge cases; name the specific metric (not “accuracy” generically); state the threshold and who owned the go/no-go gate; describe what diverged in production and how fast you caught it; walk through the fallback or kill decision.

Follow-up probes

Candidates with canned answers collapse here. Interviewers use these to stress-test the structure.

“What hallucination rate would you refuse to launch above?” (Tests whether you have a principled threshold.)
“How would you know the model is quietly getting worse after launch?” (Tests silent drift awareness.)
“What happened when your model metric and your product metric diverged?” Google has probed F1 scores directly; inability to answer ended the interview.
“Walk me through the last AI feature you killed.” If the answer is “I haven’t,” the interview is over.

22% of candidates now use AI assistance in real-time interviews. The follow-up sequence is where that collapses: scripted openers have no incident, no numbers, and no graveyard.

The viable/lovable frame

Evals are not quality control. They are the instrument that answers: is this model good enough that someone would pay for the outcome (viable), and does it fail in ways users can absorb (lovable)? A confidence gate is a lovability decision, not an engineering one. The PM who frames it that way owns the product. The one who does not is AI-washed.

The eval harness deep-dive covers the full mechanics. The build an eval portfolio project guide covers how to produce the artifact before you interview.

Why the question exists

The four AI-washed tells

Weak vs. strong

Follow-up probes

The viable/lovable frame

Related