Constitutional AI
Also known as: CAI, Constitutional AI training
Traditional RLHF relies heavily on human raters comparing model outputs. Constitutional AI supplements this with an AI feedback loop: the model is given a list of principles and asked to evaluate its own outputs against them, then revise the bad ones. Human feedback is still used, but the constitution reduces how much of it is needed and makes the training process more transparent.
Anthropic published Constitutional AI as a research paper and it's core to how Claude is trained. The constitution includes principles about being helpful, honest, and avoiding harm, and the model internalizes them through training rather than treating them as external rules in the prompt. This is why Claude's behavior feels more principled and less brittle than pure RLHF models on edge cases.
For builders, the practical implication is that Claude tends to reason about ethics in context rather than pattern-matching to a blocklist. That makes it more nuanced but also occasionally more cautious than you might want. Understanding that this comes from the training methodology, not just the system prompt, helps you calibrate when Constitutional AI's conservatism is a feature versus when it's friction to work around with a well-crafted system prompt.