ai · hard

Set a hallucination rate above which you would refuse to ship

What hallucination rate would cause you to refuse to launch?

Updated Jun 2026 Calibrated to the strong-hire bar

This question is a screening mechanism for one thing: does the candidate treat safety as a pre-launch commitment or a post-launch observation? The interviewer already knows the “right” number is context-dependent. They are checking whether you can name a specific threshold, explain how you measured it, and describe the operational machinery that enforces it.

Structure a strong answer

Start with measurement, then give a tiered threshold anchored to the harm profile of your use case, then name the fallback path. Four beats, in order.

strong

"Before I name a threshold I need to define what I'm measuring: hallucination rate on a hand-labeled golden dataset of at least 200 prompts that cover the real distribution of user queries, evaluated before each release. That's the offline gate.

For a customer-facing chatbot, I'd set a hard no-ship line at 3% on that golden dataset during shadow deployment, with a target below 1.5% before general availability. For any output touching a regulated or high-stakes domain (finance, legal, medical), the line drops to 0.5%. These aren't arbitrary: 2026 frontier models run at 3 to 19% depending on task type, so a customer-facing target below 2% is achievable but not free. At 2.4% on an 18,000-chat-per-week support product, you get roughly 160 escalations and 40 extra agent-hours daily. Dropping below 1% cut escalations by 28% in production data I've seen.

The fallback path matters as much as the number. The first time the model is confidently wrong in front of a real user, the answer is not a retry. A confidence gate routes low-confidence outputs to a human-review queue. Every incident is logged, tagged, and reviewed weekly to update the golden dataset and trigger retraining if the pattern repeats.

Launch is not a one-time gate. I'd set a production sampling rate of 2% of outputs reviewed weekly, with an automated rollback to the prior model version if the online rate exceeds the threshold by more than 0.5 percentage points for 48 hours. 'Refuse to launch' in practice means shadow mode, then 1% canary, then full rollout with a feature-flag kill-switch. It's not binary."

weak

"It depends on the use case, and we'd monitor it after launch." This fails on three counts: it gives the interviewer nothing concrete; "we'd monitor it" suggests you haven't thought about what monitoring means operationally (what rate, what sample, what triggers a rollback); and it treats the threshold as a post-launch observation rather than a pre-launch commitment. That is exactly the safety-as-afterthought mindset interviewers are screening against. A variant that also fails: naming a number (say, "5%") with no measurement method and no fallback path. The number without the machinery is theater.

The numbers you need to sound credible

In 2026, frontier models sit at 3.1 to 19.1% hallucination rate depending on model and task. That is substantially down from the 15 to 45% baselines of 2024, but nowhere near zero. A mathematical constraint applies: no LLM can simultaneously achieve truthful generation, semantic information preservation, relevance, and knowledge-constrained optimality. Zero-hallucination launches are not a goal to set; threshold-setting must accept this.

One counterintuitive fact worth naming: reasoning and thinking models hallucinate 2 to 3 times more than base models on summarization tasks. Adding a reasoning step paradoxically increases hallucination rate for certain task families. “Use the best model” is not a threshold strategy.

The largest reduction lever is not model selection. Web access and RAG reduce hallucination rate 73 to 86% independent of model choice. If the model isn’t already grounded in retrieval, that is the first move before debating launch thresholds. Multi-model cross-validation (running the same query through five models) catches 99.1% of turns with issues; that changes what threshold is acceptable in production.

Production benchmarks by domain:

  • General Q&A or internal tools: under 10%
  • Customer-facing chatbot or support agent: under 2 to 3%
  • Finance, legal, or medical assistant: under 1%
  • Regulated industry agents: under 0.5%

The PM judgment

The threshold question is a viability and usability commitment bundled together. Viability: what error rate does the revenue model tolerate? At a given volume, every percentage point of hallucination has a dollar cost in support escalations, legal exposure, or churn. That cost must be legible to stakeholders (legal, engineering, support) because the threshold is a negotiation, not a solo PM call.

Usability: when the model is wrong, does the product meet the user where they are with “I’m not sure, here’s a human,” or does it bury them in a confident wrong answer? The confidence-score gate is the mechanism that operationalizes this. The key distinction interviewers probe is offline versus online evaluation: the golden dataset is your pre-launch gate; production sampling is your ongoing signal. Both need explicit thresholds. The PM’s job is to specify both, not just name a number.

Candidates who describe a hand-labeled golden dataset, a specific ship/no-ship threshold, a shadow deployment path, and a human-review queue for low-confidence outputs clear the bar. Candidates who describe none of these do not. See eval harness for PMs for how to build the measurement side, and agent guardrails for the full operational checklist around confidence gates and fallback paths.

Asked at