DPO
Also known as: Direct Preference Optimization
Traditional RLHF requires two steps: first train a reward model from human preference data, then use reinforcement learning to optimize the main model against that reward model. DPO skips the separate reward model entirely. Instead, it works directly with preference pairs: for each prompt, you provide a 'chosen' (good) response and a 'rejected' (bad) response. The model is trained to widen the probability gap between them.
DPO proved popular because it's simpler to implement, requires less compute, and involves less hyperparameter tuning than full RLHF pipelines. It made preference-based alignment accessible to smaller teams. In practice, it's used to shape tone, improve instruction-following style, penalize unsafe outputs, and steer model behavior toward specific preferences.
A key limitation is that DPO is 'off-policy': if your preference data was generated by a different model, there's a mismatch. More recent variants and the rise of RL-based methods like GRPO (used in DeepSeek-R1) have expanded the post-training toolkit. Most production models today use a combination of SFT (supervised fine-tuning on example responses), DPO, and RL rather than any single method alone.