AI hallucination

A hallucination is a confident, plausible-sounding output that is factually incorrect or unfaithful to the source. The danger is the confidence. A model that says “I don’t know” is not hallucinating. A model that fabricates a court citation, invents a drug dosage, or produces a summary that contradicts the document it just read, and states all of it with authority, is. That distinction is what interviewers at AI-native companies listen for. If your definition of hallucination does not separate it from a hedged or refused answer, it is not yet precise enough for an AI PM role.

Two types worth naming separately

Factuality hallucination: the model states something false as true. Typically the correct answer was absent or underrepresented in training data, so the model fills the gap with a plausible fabrication. Classic examples: a non-existent court case in a legal brief, a made-up statistic with a real-sounding source.

Faithfulness hallucination: the underlying source is correct but the model misrepresents, distorts, or omits it. More common in summarization and RAG pipelines: the model has the right document and still produces an answer that does not match it. A contract summary that says “terminable with 30 days notice” when the source says 90 is a faithfulness error.

The distinction matters for mitigation. Factuality errors respond to grounding the model in retrieved sources. Faithfulness errors require stricter output checks: citation verification, faithfulness evals, and sometimes a human review gate. Treating them as the same problem leads to mitigations that only address one type.

The rate data you need

Hallucination is not a binary model property. It is a rate, and the rate is task-specific, not model-specific in isolation. In 2026, frontier models span a wide range:

Gemini 2.0 Flash achieves 0.7% on grounded summarization, the current low-water mark
Frontier models overall: 3.1% to 19.1% across standard benchmarks
Citation accuracy even with extended thinking: 12.4% average across frontier models (the worst-performing task family)
Healthcare queries without mitigation: 64.1%
Legal queries: 17% to 88% depending on model and query type

The reasoning model paradox is counterintuitive and worth citing in an interview: OpenAI o1 hallucinates at 16%, o3 at 33%, o4-mini at 48%. More reasoning steps can amplify confabulation on factual recall tasks because the model has more opportunity to fill gaps with confident inference rather than admit ignorance. GPT-5 (launched April 2026) centered its headline product claim on a 60% reduction in hallucinations, which signals how mainstream this problem has become as a purchase driver.

The business cost is concrete. AI hallucinations cost businesses $67.4 billion globally in 2024. Knowledge workers spend roughly 4.3 hours per week verifying AI outputs, about $14,200 per employee annually. McKinsey (2025) found 88% of organizations now use AI regularly, and 51% have experienced at least one negative consequence. The verification cost often erases the productivity gain the product was supposed to deliver.

The threshold decision: what rate is acceptable

There is no universal acceptable rate. The decision depends on three variables: stakes, reversibility, and whether a human is in the loop before the output reaches a real consequence.

A 5% rate is acceptable for a creative writing tool where the user expects to edit every draft. It is unacceptable for a clinical decision support tool where a clinician is expected to act directly on the output. A useful diagnostic: if a user catches one wrong answer in ten sessions, does that break trust permanently or prompt a quick correction? In consumer products, one confident wrong answer tends to matter more than ten hedged correct ones.

A practical framing for the threshold decision:

Low stakes, reversible, human review in the loop: 10-15% with visible uncertainty signals is often workable
Medium stakes, partially reversible: 3-7%, require citations, build an eval suite to catch regressions
High stakes, irreversible, or regulated: sub-1% may be required; human review gates are non-negotiable before launch

The threshold is also a go/no-go gate, not a post-launch metric. Set it before development, not after the launch review when there is organizational pressure to ship.

What the PM owns, and what the ML team owns

This boundary matters. Model-level fixes (RLHF, architectural changes, hallucination-focused fine-tuning) belong to ML engineers. A NAACL 2025 study found that fine-tuning on synthetic datasets achieved 90-96% reduction. But those changes are slow to iterate and not in a PM’s direct control.

PM-owned mitigations:

Scope to grounded tasks. Open-ended generation from memory is where hallucination rates are highest. Constrain the feature to retrieval-grounded tasks where the model reads provided documents rather than recalling from training. RAG reduces hallucinations 75-90% versus ungrounded prompting, which caps at roughly 15% reduction on its own. The PM decision is choosing which tasks are grounded by design.

Require citations in the UX. If every claim links to a source, errors become visible and auditable. Users can verify; the team can spot regressions in citation accuracy. This is a product design decision, not an ML decision. A 2025 npj Digital Medicine study found GPT-4o’s hallucination rate dropped from 53% to 23% with prompt-based mitigation and structured output constraints.

Build an eval suite before launch. A regression in hallucination rate is silent without automated testing. Without a labeled test set running on every model update, you will not know the rate degraded until users report it. This is the PM’s responsibility. See eval for the mechanics.

Design uncertainty signals into the UX. “Based on the documents you uploaded” is a signal. “I’m not certain, you may want to verify this” shifts users from blind trust to appropriate skepticism. These are interface decisions that the PM spec controls.

Set the go/no-go gate. Decide what rate in what task context makes the feature unshippable. Commit to that number before engineering starts. Rate is movable, but the gate should be set before launch pressure exists.

Why this is a viability and lovability problem, not just a safety bug

In 2026, feasibility is effectively free. You can build almost anything. The harder questions are whether users will trust the product enough to keep using it, and whether the unit economics survive once you factor in verification costs. Hallucination attacks both.

A confidently wrong output destroys trust faster than a refusal does. Users who catch one fabricated fact rarely forgive the interface that presented it as settled truth. The retention graph after a hallucination incident tends to show a step-change drop. That is the lovability problem: the moment a user catches a lie stated with authority, the relationship with the product changes.

The viability problem is the cleanup cost. At $14,200 per knowledge worker per year in verification time, and with 51% of organizations reporting negative consequences from AI use, the unit economics of an unmitigated hallucination rate are often worse than not shipping the feature. The PM job is not “can we build this” but “can we build this in a way users trust enough to keep using, without verification costs that erase the efficiency gain.”

Interview expectations

strong

"A hallucination is a confident, plausible-sounding output that is factually wrong or unfaithful to the source. The danger is the confidence: a model that says 'I don't know' is not hallucinating. I separate factuality errors from faithfulness errors because they require different mitigations. On rates: frontier models in 2026 range from 0.7% on grounded summarization to 88% on legal queries. The number is always task-specific. One counterintuitive data point: o3 hallucinates at 33%, o4-mini at 48%, more than o1's 16%, because more reasoning steps can amplify confabulation. For the threshold decision, I'd frame it by stakes, reversibility, and whether there's a human in the loop. For a customer-facing medical tool without human review, even 1-2% may be unshippable. For an internal drafting aid with writer review, 10-15% with clear uncertainty signals may be acceptable. What I own as a PM: scoping to grounded tasks with RAG, requiring citations so errors are auditable, building an eval suite to catch regressions before they reach users, and designing uncertainty signals into the UX. Model-level fine-tuning is for the ML team, but I'd set the threshold that determines whether their improvements are enough to ship."

weak

"AI hallucination is when the AI makes stuff up. You can fix it with better prompting or fine-tuning." This signals a candidate who has used AI products but has not shipped one. No distinction between factuality and faithfulness. No rate data. No threshold judgment. No PM-owned mitigation. Interviewers at AI-native companies read this as a sign the candidate treats quality as someone else's problem.

For the full threshold decision walkthrough, see set a hallucination threshold and when the AI is wrong. For building the eval system that tracks your hallucination rate over time, see eval harness for PMs.