Best-of-N Sampling
Generate N candidates, score them with a reward model or rule-based scorer, return the best — quality lift without retraining.
Intent & Description
🎯 Intent
Trade inference cost for quality by picking the best output from multiple candidates.
📋 Context
Your LLM output quality varies noticeably from sample to sample — code reviews, translations, customer replies. You have a scorer that can rank candidates, and running the model a few extra times per prompt is affordable.
💡 Solution
Generate N candidates at non-zero temperature. Score each with a reward model or rule-based scorer. Return the top-1 (or top-K). The BoNBoN approach fine-tunes a model to mimic the BoN distribution directly, eliminating per-inference sampling cost at serving time.
Real-world Use Case
- A scorer or reward model exists that ranks candidates better than the generator selects them.
- Quality lift from selecting the best of N samples justifies the N-fold inference cost.
- Temperature can be raised enough to produce meaningfully diverse candidates.
Source
📌 TL;DR
Sample N times, pick the best. Quality goes up, cost goes up linearly. Works as long as your scorer isn’t gameable.
Advantages
- Quality lift without retraining the base model.
- Simple trade-off knob: increase N for more quality, decrease for less cost.
Disadvantages
- Cost scales linearly with N — expensive at large N.
- Reward hacking: candidates can game a flawed scorer, giving the illusion of quality.