Concept·Patterns & Practices·Added 1 month ago

Eval

Also known as: evaluation, AI eval, model eval, LLM eval, benchmark eval, evals

Short for evaluation: any test or benchmark used to measure how well a model or AI system performs. Could be automated (run a thousand test cases, score the outputs) or human-judged. Evals tell you if your AI is getting better or worse as you change prompts, models, or architecture.

Eval is the AI builder equivalent of unit tests and QA. The basic pattern: you have a set of inputs, you know what good outputs look like (ground truth), you run the model, and you measure how often the outputs match your expectations. That measurement is your eval score. Iterate on prompts, fine-tuning, or retrieval, and watch whether the score goes up or down.

Evals come in many forms. Automated evals use predefined test cases with programmatic scoring (is the output valid JSON? does it contain the right entity?). LLM-as-judge evals use another model to score the outputs, useful when correctness is subjective or hard to check with a regex. Human evals are expensive and slow but often the gold standard for complex tasks where neither programmatic checks nor judge models fully capture quality.

The practical lesson most builders learn the hard way: you need evals before you change anything. Without a baseline, you can't tell if a new prompt made things better or just different. Without evals, every deployment is a guess. The evaluation engineering discipline is maturing fast, with dedicated tools, the evals-engineer role, and increasing recognition that the eval is part of the product, not a last-minute add-on.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms