Eval harness PM interview answer: the structure that clears the bar

The eval harness question is not a technical question. It is a product judgment question: what does “good” mean for this output, and how do you know when you have it? Interviewers at Anthropic, OpenAI, and Meta use this question to sort candidates who have shipped AI in production from candidates who have only read about it. The sort happens in under two minutes, on one axis: can you draw the offline/online line without prompting?

What the interviewer is sorting for

The fastest seniority signal in an AI PM interview is the offline/online split. Candidates who collapse these into “we’d track accuracy and user satisfaction” cannot specify accuracy of what, measured against what ground truth, caught at what point in the release cycle. That answer reveals they’ve never owned a ship gate for an AI feature. Anthropic has a dedicated AI safety and ethics round; OpenAI embeds the quality-gate question throughout technical depth. If you arrive at minute 40 of a case study without mentioning how you’d catch a regression before it ships, you have already lost that portion.

The eval harness question sits at the intersection of Execution and Technical Depth, which together account for 30% of the AI PM interview scorecard. Name the components explicitly. Do not describe them in the abstract.

The answer structure

Open with the offline layer. Move to the confidence gate. Then the online layer. Then the feedback loop. Name each component by name.

strong

"My eval harness has two layers. Offline: I'd work with domain experts to hand-label a golden set of 100 to 300 examples drawn from real production inputs, real failure reports, and the edge cases we've already seen break. Each row has an input, an expected output, and a dimension score. Before any model or prompt change ships, that golden set runs in CI and blocks the PR if pass rate drops more than two points or if hallucination rate crosses the hard threshold I'd set for this context: under 3% for a customer-facing support feature. I'd also run a shadow deployment for 72 hours before cutover, running the candidate model on live traffic in parallel without surfacing results to users, then diff the outputs against the current baseline. Online: code-based checks run across 100% of live traces (format, refusal rate, latency). I'd sample 10% of traces for LLM-judge scoring on factual grounding and tone. Any output below the confidence cutoff routes to a human-review queue rather than shipping directly to the user. The feedback loop is explicit: every new failure pattern that survives human review gets added to the offline golden set so it's caught at the CI gate next time. Post-launch, the metric I watch daily is intent-specific pass rate, not aggregate pass rate. A single aggregate score hides the failure modes. If I saw the helpfulness score drop from 81% to 64% on a Monday morning, my first move is to check whether a backend deployment over the weekend altered retrieved context, not to rewrite the prompt."

weak

"We'd track accuracy and user satisfaction scores." This collapses offline and online evals into a single vague category. It cannot specify accuracy of what (compared to what ground truth?), names no mechanism for catching regressions before they reach users, has no threshold or gate, and treats quality as a post-launch observation rather than a pre-launch blocker. Interviewers at Anthropic and OpenAI specifically probe for the offline/online distinction. Candidates who can't draw that line are screened out at this stage.

The components to name, in order

Golden set (offline): 50 to 500 hand-labeled examples is the operating range. Start with 20 to 50 real failures from manual testing and bug reports. This is not a random sample. It is a curated collection of the production inputs that expose the seams in your feature. The golden set is the IP. Not the prompt, not the model.

Regression gate: The CI check that blocks a PR if pass rate drops by more than a defined threshold. Two points is a reasonable starting gate for most customer-facing features. Hallucination rate has a hard ceiling, not a soft trend line. State the number. “Under 3% hallucination rate for this context” signals you’ve actually set that threshold in production.

Shadow deployment: Running the candidate model or prompt version in parallel against live traffic before cutover. Users never see the candidate output during shadow mode. You see the diff. Naming this component signals you understand the release mechanics of AI features, not just the model mechanics.

Online sampling rate: Code-based checks on 100% of live traces. LLM-judge scoring on 5 to 15% of traces to manage cost. This is the standard operating range. Stating a specific rate signals hands-on familiarity.

Human-review queue: Outputs below a confidence cutoff route to human review rather than shipping directly. This is a quality and safety mechanism. Candidates who skip it signal they’ve never designed a fallback for the cases the model gets wrong.

Intent-specific pass rate: The headline post-launch metric beneath which sit precision/recall by request type, latency percentiles, and cost per query. Aggregate pass rate is a lagging indicator that hides which user intents are failing. Segment by intent from day one.

The follow-up you should expect

“What would you do if a metric dropped post-launch?” Walk the triage path: isolate by intent, check whether a non-model change (retrieved context, schema, backend data) explains the drop, diff the failure cases against the golden set to find the pattern, then add those cases to the golden set before shipping the fix. The answer that fails this follow-up is any answer that starts with “we’d retrain.” Retraining is the last resort, not the reflex.

The viable/lovable check

In 2026, the eval harness is where PM judgment lives. Feasibility is free. Any model can be spun up, any feature can be prototyped. The question is whether the outputs are worth paying for (viable) and whether they meet users where they are without embarrassing them or breaking their trust (lovable). A PM who cannot speak to their eval harness has outsourced the product quality decision to engineering. The eval harness answer is your proof that you have not done that.

Not knowing your own pass rate or hallucination threshold in a behavioral story about an AI feature you shipped is cited as an interview-losing mistake by coaches who have placed 47 candidates into AI PM roles at $300K-plus. If you are telling a story about shipping an AI feature, know the number.

For the mechanics of building a golden set from scratch, see the eval harness for PMs guide. For how this connects to production incidents, see when the AI is wrong.

What the interviewer is sorting for

The answer structure

The components to name, in order

The follow-up you should expect

The viable/lovable check

Related