Evals
Also known as: evaluations, AI evaluation, LLM evaluation, model evaluation, evals pipeline
Traditional software has unit tests and integration tests. AI systems need evals. The difference is that there is rarely one correct output: a good response might be expressible in many ways, and a bad one might still look plausible. Evals typically compare model outputs against reference answers, rubrics, or using another model as a judge, a technique called LLM-as-a-judge.
Evals serve several jobs: catching hallucinations before they reach users, measuring the impact of prompt changes, comparing one model version to another, and validating that a RAG system retrieves the right documents for representative queries. Without evals, 'is this better?' after a change is just intuition. With evals, it is a measurable data point.
In the agentic context, evals become more complex. You need to evaluate not just final outputs but intermediate steps: did the agent call the right tool, did it use the result correctly, did it know when to stop? Building eval suites for agents is one of the unsolved scaling problems for teams trying to move from working demos to reliable production systems.