Ground truth (machine learning)

Ground truth is the verified, authoritative reference data against which a model’s outputs are compared. In supervised learning, it is labeled training data: the answer key the model learns from. In generative AI, it shifts meaning: ground truth becomes a curated golden eval set, a collection of prompt and response pairs that represent the correct behavior of the production system. PMs at AI-native companies need both frames. The definition-only answer fails because it says nothing about where ground truth comes from, who decides what “correct” means, or what breaks when it is wrong.

Why quality sets the ceiling

No algorithm outperforms its training data. If the labels are noisy, ambiguous, or misaligned with what users actually need, the model learns those same flaws. This is not an engineering problem to solve after launch; it is a product constraint to define before the first label is collected. The PM’s job is to own three things: the labeling schema (the definition of correctness), the quality bar (inter-annotator agreement, which reveals whether the schema is precise enough for two reviewers to reach the same verdict independently), and the monitoring plan (annotation drift, where the meaning of a label quietly shifts over time as annotators calibrate to each other rather than to the original spec).

Four ways to collect ground truth

Human annotation: general-purpose labelers classify, rank, or transcribe. Fast and scalable, but quality depends entirely on the schema clarity and reviewer selection.
Expert annotation: domain specialists (radiologists, lawyers, licensed engineers) label data where general knowledge is insufficient. Expensive; use for high-stakes domains where wrong labels produce liability, not just lower accuracy.
Behavioral inference: ground truth derived from user actions: clicks, purchases, skips, chargebacks. No labeling cost, but introduces selection and survivorship bias. A fraud model that gets ground truth from chargebacks only ever learns from the cases users bothered to dispute.
Synthetic generation: automated label creation at scale, often via another model. Useful for bootstrapping; must be validated against human review before use in eval sets or fine-tuning.

Two failure modes PMs cause

Labeling lag is the gap between a model prediction and when the true outcome is known. A fraud detection model may not receive ground truth until a chargeback resolves weeks later. If you measure model performance against premature or proxy labels, your launch metrics are fiction. The PM must account for lag when defining success metrics and setting the observation window.

Distribution shift at launch is when ground truth was collected from one population and the product ships to another. If the training set came from enterprise power users and the first release targets SMBs, the model underperforms in ways that look like an algorithmic failure but trace back to misaligned training data. This is a product decision failure, not an ML failure.

A third failure mode worth naming: label leakage, where training features include information only available after the outcome (for example, using “contract sent” to predict deal closure in a sales model). Accuracy on the training set inflates, and the model fails silently in production.

Ground truth in the LLM era

In classic supervised learning, ground truth is collected once and used to train. In generative AI, the relationship is different and ongoing. A golden eval set serves two purposes: it calibrates LLM judges (which grade model output against a rubric), and it is the reference against which hallucination is detected. A model hallucinates when its output contradicts the ground truth; you can only detect hallucination if you have ground truth to compare against. This makes building and maintaining a golden eval set the PM’s first line of defense against trust erosion.

For agentic systems, ground truth is harder still. The correct sequence of actions in a multi-step task is not always observable. Most teams use a combination of outcome-level labels (did the task complete correctly?) and trace-level human review to build eval sets iteratively. PMs at companies like Anthropic, Cognition, and Sierra need to understand this complexity before interviewing; it comes up directly.

What the interviewer is actually checking

When an interviewer at an AI-first company asks “how did you validate the model?” or “explain ground truth,” the question is a proxy for: does this candidate understand data quality as a product constraint, or just as a technical detail? The weak answer names the definition and stops. The strong answer shows PM ownership of the labeling schema and quality bar, names at least one failure mode the candidate has seen or would watch for, and connects ground truth to a downstream product decision: launch criteria, hallucination thresholds, or eval design.

strong

"Ground truth is the verified reference data you use to measure whether a model's outputs are correct. In supervised learning that's labeled training data; in generative AI it's a curated golden eval set of prompt and response pairs that represent the behavior you want in production. As a PM, I own three things: the labeling schema (who defines what correct means and for whom), annotation quality (inter-annotator agreement tells you if the schema is ambiguous), and labeling lag (for a fraud model, ground truth may not arrive until a chargeback resolves weeks later, so my launch metrics have to account for that observation window). Without a maintained ground truth eval set, I can't set a hallucination threshold, I can't know when the model is good enough to ship, and I can't tell if a model update made things better or worse. For agentic systems specifically, the correct sequence of actions isn't always observable, so we use outcome-level labels combined with trace-level human review to build the eval set incrementally."

weak

"Ground truth is the correct labels we use to train the model. It's like the answer key." This tells the interviewer the candidate has read a glossary. It fails because it shows no understanding of how labels are obtained, who decides what "correct" means, what breaks when labels are stale or leaky, or how ground truth connects to launch decisions. Stopping at the definition signals vocabulary, not judgment.

The 2026 reframe

In 2026, feasibility is nearly free. The bottleneck is not whether a model can do something; it is whether it does it correctly and reliably enough that users trust it. Ground truth is now a viability and lovability constraint owned by the PM. Viability: a model you cannot measure is one you cannot improve, price, or defend to the business. Lovability: hallucinations erode trust faster than any UX failure, and systematic prevention requires a well-maintained golden eval set. The PM who owns the labeling schema and quality bar is the one who ships AI products users actually keep using.

For how to build the eval infrastructure that consumes ground truth, see eval and eval harness for PMs. For the downstream consequence of bad ground truth, see hallucination. For how ground truth is consumed during model improvement, see fine-tuning.