Evals engineer
Also known as: AI evaluations engineer, evaluation engineer, LLM evaluator
Evals are test cases for AI. They check things like: does the model answer accurately on the tasks it was designed for, does it stay within policy, does it handle edge cases gracefully, and does it degrade when the underlying model version changes? Writing good evals requires understanding both the technical behavior of models and the business requirements the model is serving.
Evals engineering has matured into a recognizable specialization as teams realized that deploying AI without robust evaluation is essentially flying blind. The job title appears explicitly in postings at AI labs and larger AI product companies. At smaller teams, the work often belongs to AI engineers or product engineers who spend a significant portion of their time on evaluation design.
What makes the role distinct is the combination of skills it requires: software engineering for building test harnesses and data pipelines, product thinking for deciding what to measure and why, and enough model literacy to understand why a model behaves a certain way. Tools like LangSmith and various open-source frameworks have lowered the barrier, but designing evals that actually catch real problems before they reach users is still a craft.