Concept·AI Models & Capabilities·Added 2 months ago

DPO

Also known as: Direct Preference Optimization

A training method that teaches a model to prefer better responses over worse ones, without needing a separate reward model. It's a simpler alternative to RLHF that's widely used for aligning models to human preferences.

Traditional RLHF requires two steps: first train a reward model from human preference data, then use reinforcement learning to optimize the main model against that reward model. DPO skips the separate reward model entirely. Instead, it works directly with preference pairs: for each prompt, you provide a 'chosen' (good) response and a 'rejected' (bad) response. The model is trained to widen the probability gap between them.

DPO proved popular because it's simpler to implement, requires less compute, and involves less hyperparameter tuning than full RLHF pipelines. It made preference-based alignment accessible to smaller teams. In practice, it's used to shape tone, improve instruction-following style, penalize unsafe outputs, and steer model behavior toward specific preferences.

A key limitation is that DPO is 'off-policy': if your preference data was generated by a different model, there's a mismatch. More recent variants and the rise of RL-based methods like GRPO (used in DeepSeek-R1) have expanded the post-training toolkit. Most production models today use a combination of SFT (supervised fine-tuning on example responses), DPO, and RL rather than any single method alone.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms