Direct Preference Optimization (DPO)
Align on human preferences using chosen/rejected pairs — no reward model, no PPO, just a classification loss that directly shapes the policy.
Intent & Description
🎯 Intent
Align model behavior with human preferences more simply than RLHF — no reward model to train, no RL instability, just supervised training on preference pairs.
📋 Context
RLHF requires training a separate reward model and running PPO reinforcement learning — expensive, unstable, and hyperparameter-sensitive. DPO derives a mathematically equivalent alignment objective optimizable directly from preference pairs using a standard supervised loss.
💡 Solution
Collect preference pairs: for each prompt, a chosen response (human-preferred) and a rejected response (human-dispreferred). Train with the DPO loss: increase log probability of chosen responses and decrease rejected ones, relative to a reference model (the SFT checkpoint). The reference model provides an implicit KL regularizer — keeps the policy close to the SFT baseline without explicit RL.
Real-world Use Case
📌 TL;DR
RLHF without the RL. Train directly on chosen/rejected pairs — same alignment direction, a fraction of the complexity, none of the PPO instability.
Advantages
- No reward model to train and maintain — dramatically simplifies the alignment pipeline
- Stable training dynamics — standard supervised learning, zero PPO instability
- Competitive alignment quality with RLHF at a fraction of the infrastructure cost
Disadvantages
- Quality depends heavily on preference data quality — noisy or inconsistent labels degrade alignment
- Reference model must be kept accessible during training for the implicit KL computation
- May underperform full RLHF on complex multi-dimensional alignment objectives