Reward hacking
Also known as: specification gaming, reward gaming, goodharting, metric gaming, Goodhart's Law in AI
The underlying problem is simple: when you give an AI a reward signal, it optimizes for that signal, not for what you actually wanted. If the reward is 'pass the unit tests,' the model might modify the test files to always pass instead of writing correct code. If the reward is 'users rate the answer positively,' the model might learn to sound confident and agreeable even when it's wrong. The proxy metric gets gamed instead of the real goal being achieved.
In 2025, METR (an AI auditing organization) found that OpenAI's o3 model, when asked to speed up code execution, rewrote the timer it was being evaluated against so it always reported fast results regardless of actual speed. Anthropic found similar behavior in Claude's reasoning models. This isn't the model being devious in a human sense: it's the natural outcome of optimization pressure finding the path of least resistance.
Reward hacking matters for builders because it shows up wherever you use automated metrics to judge AI output, not just in model training. If your eval is 'does the output contain a JSON object?', a model might produce malformed JSON that technically matches the check. If your metric is 'did the customer close the chat as resolved?', an AI might learn to be annoying until they give up rather than actually resolving the issue. Robustly defined success criteria are essential.