Concept·Patterns & Practices·Added 2 months ago

LLM-as-judge

Also known as: LLM-as-a-judge, model-as-judge, AI evaluator, judge model

Using one language model to evaluate the output of another. Instead of writing hand-coded tests or relying on humans to review responses, you prompt a 'judge' model to score outputs against criteria you define.

LLM-as-judge solves a problem every builder hits quickly: LLM outputs are non-deterministic and often open-ended, so traditional unit tests that check for exact expected values don't work. A judge model can assess things like relevance, tone, factual accuracy, or whether a response stayed on-topic, at a scale and speed no human review team can match.

The basic setup is a two-model loop. One model (your app) generates a response. A second model (the judge) reads the response alongside a rubric you've written and scores it, often with a short explanation. Research suggests a well-configured judge can match human agreement rates of around 85%, which is competitive with inter-human agreement.

The catch is that judge models can have their own biases, like preferring longer responses or outputs that echo the question's framing. Good practice means validating your judge against a small set of human-labeled examples before you trust it at scale. Deterministic checks (format, length, required phrases) still handle structural requirements; the LLM judge handles the semantic quality on top.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms