Design an evaluation framework for a customer-support AI agent before it talks to real customers.

How to answer the AI PM eval-design question. Offline vs online evals, real thresholds, MiHR/MaHR, shadow mode, and the flywheel that makes evals a product moat

"Design an eval for a customer-support AI agent"

Evals are the defining AI PM skill of 2026. This question separates candidates who treat AI as a black box from those who can decide whether a probabilistic system is safe to ship. The first move is a clarifying question most candidates skip: is this a deflection agent (answer and close) or a resolution agent (takes actions like issuing refunds or changing account state)? The failure modes and thresholds differ entirely. A deflection agent that confidently closes a ticket without resolving it is worse than a human who escalates. Name that distinction before the interviewer has to ask.

Structure a strong answer

Separate offline (pre-launch, fixed dataset) from online (post-launch, real traffic). Name the metrics, the golden set, the two hallucination rate primitives, the launch bar, and the bridge to production.

strong

"First: is this a deflection agent or a resolution agent? If it can issue refunds or change plans, the thresholds get much tighter. Assuming resolution scope: offline, I'd build a golden set of 300 to 500 real tickets, labeled with the correct resolution and the reasoning chain, weighted toward the high-stakes tail: billing disputes, account lockouts, policy edge cases. I'd measure four things: (1) Resolution accuracy on the golden set. (2) MaHR, macro hallucination rate: the share of responses containing at least one hallucination, capped at 1% on any action touching money or account state. I'd also track MiHR, micro hallucination rate, which counts hallucinatory statements per response (useful for spotting models that hedge with many small fabrications rather than one obvious one). (3) Escalation correctness: did the agent hand off exactly when it should? This needs its own precision/recall framing, not just an accuracy number. (4) Policy faithfulness: scored by LLM-as-judge against the actual policy document, not model priors. Human spot-checks cover the lowest-confidence 10% by model confidence score. My launch bar: zero critical hallucinations (fabricated policy, wrong entitlement), under 1% MaHR on billing actions, escalation precision above 90%. The bridge to production is shadow mode: run the agent in parallel with human agents, compare outputs, expose nothing to customers yet. Then a canary at 5% traffic with automatic human-in-the-loop fallback for any response below a 0.7 confidence score. Online I watch deflection rate not as a goal but as context: high deflection with a high 48-hour re-contact rate is worse than lower deflection. The 48h re-contact rate is the true resolution proxy. I track CSAT on deflected versus escalated tickets separately. The flywheel: every escalation becomes a labeled eval case. The golden set grows with production traffic, which means the eval gets harder over time. That's intentional: you're raising the bar as the model improves."

weak

"I'd test it on some questions, see if the answers look good, then ship and monitor." This fails for four reasons: no golden set means no baseline and no regression detection; no launch bar means the ship decision is vibes; "looks good" won't catch hallucinated policy details buried in a confident-sounding response; monitoring without a defined alert threshold means you find out about failures after a reputational incident. Candidates who skip hallucination, safety, and measurement methodology are flagged immediately at frontier labs. The interview ends mentally before it ends literally.

The four failure modes to name

Interviewers who have actually operated a support agent probe for a concrete failure taxonomy. Generic accuracy is not enough. Name these:

Hallucinated policy: the agent confidently states a refund window or entitlement that does not exist in the policy document. Across 37 models in 2026, hallucination rates range from 15% to 52% on complex domains; in high-stakes support contexts the rate rises further. A 1% MaHR cap on billing actions is meaningful precisely because it is hard to achieve with off-the-shelf models.
Wrong escalation: the agent handles a case it should have handed off (or escalates one it could have resolved). Escalation correctness needs its own precision/recall framing. An agent that escalates everything looks good on hallucination metrics and terrible on deflection cost.
Tone failure: technically correct response, wrong register (dismissive, over-apologetic, or legalistic). LLM-as-judge catches this reliably when the rubric is written against your actual brand voice document, not a generic “helpful” label.
Refusal cascade: the agent refuses edge cases so aggressively that users re-contact at higher rates than before the agent existed. High deflection masking low resolution is the stealth failure mode. It shows up in the 48h re-contact rate before it shows up in CSAT.

Hallucinations in customer support produce an estimated 18% increase in escalation rates and contribute to roughly 30% of AI-related reputational incidents. These are not QA statistics; they are viability numbers.

The multi-turn problem

Most candidates evaluate single-turn exchanges. Real support conversations are stateful. A user who says “that didn’t fix it” in message four is testing whether the agent tracks context across the conversation, not whether it answers question four in isolation. DeepEval now supports multi-turn goldens for synthetic support datasets. If your golden set contains only single-turn tickets, you are missing the failure modes that appear after three exchanges: context drift, contradicting a prior response, and misreading the user’s emotional escalation from the conversation history.

MiHR vs MaHR: why both matter

MaHR (macro hallucination rate) counts the share of responses that contain at least one hallucination. It is your launch gate. MiHR (micro hallucination rate) counts hallucinatory statements per response. A model can post a clean MaHR while producing responses littered with small hedging fabrications that a user might not catch. Naming both signals that you have operated an eval, not just read about one.

LLM-as-judge vs human review

LLM-as-judge scales well for: policy faithfulness (is the response grounded in the actual policy document rather than model priors?), tone adherence (does it match the brand voice rubric?), and resolution completeness (did it address the stated issue?). Human review is non-negotiable for: regulatory exposure (anything touching financial data, account termination, or accessibility claims), the lowest-confidence 10% of responses by model confidence score, and any case the agent itself flagged as ambiguous. The combination gives you coverage at scale with a human backstop on the decisions that create legal or reputational risk.

Shadow mode to canary: the bridge most candidates miss

Skipping directly from offline eval to production is the most common architectural mistake in candidate answers. Shadow mode runs the agent in parallel with your human support team, comparing outputs without customer exposure. It surfaces distribution shift between your golden set and real traffic before any customer sees a response. Once shadow mode clears the launch bar, a canary at 5% traffic with a hard fallback threshold (anything below 0.7 confidence routes to a human agent, automatically) gives you a production signal without full exposure. Define your rollback criteria before you go live: if MaHR on billing actions exceeds 1% over a 24-hour window, the canary pauses and the failing cases are re-labeled and added to the golden set.

Why this is the new bar

In 2026, feasibility is essentially free: the model exists, the API is cheap, the UI is a weekend build. The eval is the product decision. Viable (will customers trust it enough to use it and pay for it?) and lovable (does it resolve issues in the channel and moment they need, without being obnoxious about it?) cannot be answered by a model card. They can only be answered by a purpose-built eval that covers real failure modes, not synthetic ones constructed to make the model look good.

When the model is a commodity, the eval is the product moat. Go deeper in the eval harness for PMs and see what a working eval portfolio looks like in build an eval portfolio project. For the guardrails that sit alongside the eval, see design an agent guardrails system.