ReST-EM
Self-improve the model by training on its own high-quality outputs.
Intent & Description
🎯 Intent
Bootstrap a stronger reasoning model by collecting the model’s own correct outputs, filtering them through a verifier, and fine-tuning on the winners — repeating the cycle.
📋 Context
Getting high-quality reasoning traces for fine-tuning is expensive when it requires human labelers. ReST-EM sidesteps this by using the model itself as a data generator and a verifier (or an external oracle) as the quality filter — an EM loop where E-step = generate, M-step = fine-tune on correct outputs.
💡 Solution
(1) Sample many completions from the current model for each training problem. (2) Filter completions using a verifier (unit tests, a ground-truth checker, a reward model). (3) Fine-tune the model on the passing completions. (4) Repeat with the improved model. Each iteration raises the quality floor. Requires control over fine-tuning — not an API prompting pattern. See also: STaR-bootstrapping, generate-and-test-strategy, reflexion.
Real-world Use Case
- Fine-tuning pipelines where you have verifiable tasks but no human reasoning traces.
- Code generation, math, or logic domains where correctness is programmatically checkable.
- Distilling reasoning capability from a large model into a smaller one via self-generated data.
Source
📌 TL;DR
Generate, verify, fine-tune on winners, repeat — let the model teach itself to reason better.
Advantages
- Generates training data without human annotation — scales cheaply.
- Each iteration genuinely improves the model’s reasoning floor.
- Works well in domains with strong verifiers (code, math).
Disadvantages
- Requires fine-tuning access — not applicable to API-only deployments.
- Verifier quality gates everything; a weak verifier trains on wrong answers.
- Can reinforce confident-but-wrong reasoning patterns if the verifier has blind spots.