Reinforcement Learning from Human Feedback (RLHF)

Intent & Description

🎯 Intent

Align model behavior with nuanced human preferences — helpfulness, harmlessness, truthfulness, tone — that supervised training data alone can’t capture.

📋 Context

Instruction fine-tuning teaches format and task completion. It doesn’t capture what makes a response genuinely good by human standards — appropriate length, nuanced helpfulness, avoiding subtle harms. A learned reward model captures these preferences; RL optimizes against them.

💡 Solution

Three-stage pipeline. (1) SFT — fine-tune the base on high-quality demonstrations. (2) Reward Model — train a separate scorer on human preference pairs using the Bradley-Terry model. (3) PPO — update the SFT model to maximize reward model scores, with a KL divergence penalty against the SFT checkpoint to prevent the policy from exploiting reward model weaknesses.

Real-world Use Case

Producing the final aligned model for safety-critical or user-facing deployment where DPO’s simpler approach is insufficient. Training models that must simultaneously optimize helpfulness, harmlessness, and honesty. The alignment foundation of GPT-4, Claude, and Gemini-class assistants.

📌 TL;DR

The alignment stack behind every major commercial LLM. Powerful but expensive — three training stages, PPO instability, and reward hacking are all real costs. Try DPO first; reach for RLHF when DPO falls short.