Concept·Patterns & Practices·Added 1 month ago

Ground truth

Also known as: ground-truth data, gold label, true label, reference answer, correct answer

The known correct answer used to judge whether a model output is right or wrong. In an eval, you compare what the model said to the ground truth to measure its accuracy. Building ground truth datasets is hard, expensive, and surprisingly important.

Ground truth is borrowed from machine learning, where it originally referred to verified labels in a training dataset: the definitive 'correct' classification that a model is trying to learn to predict. In modern AI builder practice, it's used more broadly to mean any authoritative reference point you use to evaluate model outputs. If you're evaluating a customer support AI, your ground truth might be a set of expert-reviewed ideal responses.

Building ground truth is one of the less glamorous but most impactful parts of shipping AI products. It requires deciding: what does 'correct' actually mean for your task? Who is qualified to judge? How do you handle edge cases where reasonable people disagree? The answers shape your entire evaluation pipeline. Without ground truth, you can't measure whether your model is getting better or worse.

In agentic settings, ground truth gets harder: evaluating whether an agent completed a multi-step task correctly requires knowing what 'correct' looks like at each step, not just at the final output. This is one reason agent evals are a distinct and thorny subfield. LLM-as-judge (using a model to score other model outputs) is a common workaround when human-labeled ground truth is too expensive to collect at scale.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms