Eval (AI evaluation)

An eval is a structured test that checks whether an AI system’s output meets a defined quality standard. In a software context, the closest analog is a unit test: you feed in an input, collect the output, and score it pass or fail. The difference is that AI outputs are often probabilistic and subjective, which is exactly what makes evals a PM problem rather than just an engineering one. At Anthropic, OpenAI, Google DeepMind, and Meta AI, fluency with evals is now a standard PM interview requirement. The wrong answer is “we A/B test it” or “that’s for the ML team.”

The four components of an eval

Every eval has the same skeleton:

Input: a task or scenario representative of real user traffic, such as “a customer asks why their order shows as delivered but never arrived”
Task: the system under test, meaning the prompt, the model, the tool calls, and any RAG pipeline
Scorer: the pass/fail logic that grades the output, either code-based or an LLM judge
Expected output (optional): a ground-truth answer used when the correct response is unambiguous

The PM’s specific contribution is defining the success criteria and curating the input cases. Engineers build the infrastructure; PMs decide what “good” looks like and write rubrics specific enough to be tested.

Two kinds of scorer: code-based and LLM-as-judge

Code-based evals are deterministic, fast, and cheap. They check things like: does the output include a valid JSON object, does the response cite one of the provided documents, does the agent call the right tool, does the output match an expected string? They cannot score tone, helpfulness, or whether a response actually solved the user’s problem.

LLM-as-judge (LLMaaJ) uses a second model to score the first model’s output against a natural-language rubric. This handles subjective quality: groundedness, appropriate confidence, tone fit, whether the answer addressed the actual question. The cost is real: LLM judges are slower and more expensive, and they must be calibrated against human reviewers before you trust them. A rubric that two human reviewers cannot independently score the same way will produce noise, not signal.

Most production eval suites combine both. Code-based evals run on everything; LLM judges run on a sampled subset where subjective quality matters.

Offline evals vs online evals

Offline evals run against a fixed, curated dataset before shipping. They gate releases and catch regressions in CI pipelines. A dataset of 20 to 50 cases drawn from real user failures and known edge cases is enough to start. Both human reviewers must independently reach the same pass/fail verdict before a case is added; ambiguous cases produce ambiguous signal.

Online evals sample live production traffic continuously. Because the full set of real user inputs is far more varied than any curated dataset, online evals surface failure modes that offline datasets miss. Most teams sample 5 to 15% of production traffic for LLM judging due to cost. The two work together: offline evals prevent known regressions, online evals find new ones.

pass@k vs pass^k: a distinction that maps to product requirements

This is a technical detail with direct product implications, and it comes up in interviews.

pass@k is the probability that at least one of k attempts succeeds. It is the right metric for creative or exploratory tasks where one good answer is sufficient: generate ten tagline options, one just needs to be usable.

pass^k is the probability that all k attempts succeed. It is the right metric for customer-facing features where consistency is the requirement. A support agent that resolves the issue 75% of the time sounds acceptable until you compute pass^10: the probability of ten consecutive successful interactions drops to roughly 42%. If your product’s reliability SLA depends on the agent working every time, a 75% per-trial success rate is not close to good enough.

Anthropic’s engineering team notes that a 0% pass@100 score is most often a signal of a broken eval or a grader bug, not an incapable model. Debugging your scorer before concluding the model is at fault is standard practice.

Two categories that serve different purposes

Capability evals start at low pass rates and climb as the system improves. They define what the product is trying to do and measure progress toward it.

Regression evals are maintained near 100% and trigger alerts on drops. Once a capability is working, a regression eval protects it. These run on different cadences with different alerting.

Connecting eval scores to business metrics

An eval score is not a business metric. An 87% coherence score does not tell you whether users returned the next week. The step many teams skip is joining eval results to product engagement data (retention, resolution rate, session depth) under the same user identity. When eval scores move, does retention follow? That join is what turns model quality work into a business case, and it is a PM’s job to insist on it.

A concrete example: trace analysis of a support agent revealed that 34% of inventory-related failures came from a tool returning stock counts but omitting restock dates. Two targeted evals, one checking that the restock date field was present, one checking that the agent referenced it in the response, made a previously invisible failure pattern auditable and fixable.

What “I’ll know good when I see it” actually costs

In 2026, any behavior can be built. The constraint is no longer whether something is technically possible. The harder question is whether it works reliably and whether users trust it enough to keep paying for it. An eval is how a PM encodes their judgment about quality into something measurable and repeatable across every model update, every prompt change, and every new traffic pattern.

A PM who does not own evals is handing quality decisions to whoever changed the prompt last. A PM who owns the eval owns the product’s definition of good.

What a strong answer sounds like in an interview

strong

"An eval is a structured test with four parts: an input that represents real user scenarios, the system under test, a scorer that grades the output pass or fail, and optionally a ground-truth expected output. I use code-based scorers for deterministic checks like tool call verification or schema validation, and LLM-as-judge for subjective quality like tone or groundedness, calibrated against human reviewers first. Offline evals on a curated dataset of 20 to 50 real failure cases gate releases. Online evals sample 5 to 15% of production traffic to catch failure modes the dataset missed. For a customer-facing feature where consistency matters, I track pass^k rather than per-trial success rate: a 75% per-trial rate drops to roughly 42% reliability across ten consecutive interactions. I join eval scores to retention and resolution data so model quality improvements show up as business outcomes, not just benchmark numbers."

weak

"We'd just A/B test it and see which version performs better. Or we could run it by the ML team to evaluate the model quality." This conflates A/B testing (a distribution mechanism, not a quality gate) with pre-ship evaluation, and defers the definition of quality to someone else. In an AI PM interview at an AI-first company, this is a fast disqualifier.

For how to build an eval from scratch, see eval harness for PMs. For a worked interview answer on designing an eval for a specific feature, see design an eval for a support agent. For the broader context on why viable and lovable are now the bottleneck, see feasibility is free.