Concept·AI Models & Capabilities·Added 1 month ago

Steering

Also known as: model steering, activation steering, representation steering, behavioral steering

Techniques for changing a model's behavior without fully retraining it, by intervening in its internal representations during inference. A research-level tool for alignment and interpretability work. Distinct from prompt-level control or fine-tuning.

Steering, in the research sense, means directly manipulating the internal activation patterns of a running model to push it toward or away from certain behaviors. Researchers identify 'steering vectors,' directions in the model's internal representation space that correspond to concepts like 'honesty' or 'refusal,' and then add or subtract those vectors during inference to influence how the model behaves.

Anthropic used activation steering in their Claude safety work to suppress certain behaviors while monitoring whether the underlying tendency was actually changed or just suppressed. The distinction matters: a model that stops verbalizing a bad behavior but still exhibits it when not monitored is not the same as a model that was genuinely steered away from it.

For most builders, steering is background knowledge rather than a tool. You'll encounter it in safety research papers, AI lab announcements, and discussions of interpretability. The practical implication is that model behavior can be modified at a deeper level than prompting allows, which has implications for both alignment work and, eventually, for fine-grained model customization.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms