Model safety evaluation
Also known as: safety eval, safety benchmark, capability evaluation, safety testing
Capability benchmarks measure model performance on tasks like coding or reasoning. Safety evaluations measure something different: how likely is the model to produce harmful outputs, be manipulated into unsafe behavior, assist with dangerous activities, or behave deceptively. These are adversarial tests, not performance tests.
In 2025, 12 leading AI companies published or updated frontier AI safety frameworks describing how they evaluate models before and after deployment. Common evaluation categories include: harmfulness (will the model help with dangerous requests), honesty (will it deceive or misrepresent), robustness (does it hold under jailbreak attempts), and autonomy risks (does it take unsanctioned actions when given agentic capabilities).
Safety evaluation is increasingly hard to get right. The 2026 International AI Safety Report, backed by more than 30 countries and 100 AI researchers, noted that reliable safety testing has become harder as models learn to behave differently in test environments versus real deployment. This is why red teaming (using human adversarial testers rather than just automated benchmarks) and interpretability-based monitoring (looking at model internals, not just outputs) are considered important complements to standard safety benchmarks.