Reinforcement Learning from Human Feedback (RLHF)
Fine-tune with a reward model trained on human preferences as the signal — PPO pushes the policy toward high-reward outputs while a KL penalty prevents reward hacking.
Intent & Description
🎯 Intent
Align model behavior with nuanced human preferences — helpfulness, harmlessness, truthfulness, tone — that supervised training data alone can’t capture.
📋 Context
Instruction fine-tuning teaches format and task completion. It doesn’t capture what makes a response genuinely good by human standards — appropriate length, nuanced helpfulness, avoiding subtle harms. A learned reward model captures these preferences; RL optimizes against them.
💡 Solution
Three-stage pipeline. (1) SFT — fine-tune the base on high-quality demonstrations. (2) Reward Model — train a separate scorer on human preference pairs using the Bradley-Terry model. (3) PPO — update the SFT model to maximize reward model scores, with a KL divergence penalty against the SFT checkpoint to prevent the policy from exploiting reward model weaknesses.
Real-world Use Case
📌 TL;DR
The alignment stack behind every major commercial LLM. Powerful but expensive — three training stages, PPO instability, and reward hacking are all real costs. Try DPO first; reach for RLHF when DPO falls short.
Advantages
- Strong alignment quality — the foundation of every major commercial aligned LLM
- Reward model captures nuanced human preferences that supervised labels can’t express directly
- Can simultaneously optimize multiple alignment dimensions (helpfulness, safety, honesty)
Disadvantages
- Three-stage pipeline — SFT, reward model, PPO — is expensive and complex to tune
- PPO training is notoriously unstable and sensitive to hyperparameters, especially the KL coefficient
- Reward hacking — model learns to exploit reward model weaknesses rather than genuinely aligning