Process Reward Model
Train a verifier that scores each reasoning step, not just the final answer — catching right-answer-wrong-reasoning before it gets reinforced.
Intent & Description
🎯 Intent
Get step-level signal on reasoning quality so you can reject chains that got the right answer the wrong way.
📋 Context
You’re training or evaluating a model on multi-step reasoning (math problems, multi-hop QA, logical deduction). Your outcome reward model only scores the final answer — and the model has learned to shortcut through steps as long as the last number lands right.
💡 Solution
Collect step-level labels (correct / neutral / incorrect / hallucination) for chain-of-thought traces. Train a classifier to predict step labels. At inference, score every step; reject candidates whose intermediate steps score poorly. Powers test-time search and fine-tuning of the generator.
Real-world Use Case
- Outcome-only reward reinforces shortcut reasoning that lands on the right answer the wrong way.
- Step-level labels (correct, neutral, incorrect, hallucination) can be collected at scale.
- Test-time search or fine-tuning can consume step-level scores.
Source
📌 TL;DR
Score every reasoning step, not just the final answer. Catches ‘right answer, wrong method’ before it gets baked in. Annotation-expensive but signal-rich.
Advantages
- Catches wrong-reasoning-right-answer cases that outcome-only reward misses.
- Enables tree-search and best-of-N with finer-grained signal.
Disadvantages
- Step-level annotation cost is significant — harder to collect than outcome labels.
- PRM calibration shifts as model capability improves; needs periodic retraining.