Self-Consistency
Run the same prompt N times at non-zero temperature, aggregate by majority vote — higher accuracy on reasoning tasks with variance as a free confidence signal.
Intent & Description
🎯 Intent
Mitigate hallucination on reasoning-heavy tasks by aggregating across multiple independent samples.
📋 Context
Your model is mostly right on math word problems and multi-step logic, but occasionally invents a wrong intermediate chain and confidently produces the wrong answer. You can run the same prompt several times in parallel and extract a comparable answer from each.
💡 Solution
Run the same prompt N times with non-zero temperature. Extract the answer from each. Aggregate: majority vote for discrete answers, median for numeric, judge for free-form. Sample variance across runs is logged as a confidence signal.
Real-world Use Case
- Reasoning-heavy questions where the model is mostly right but sometimes invents a wrong chain.
- Answers are extractable in comparable form (discrete, numeric, or judge-able).
- Cost of N samples is acceptable relative to the quality lift.
Source
📌 TL;DR
Sample N times, majority vote wins. Accuracy goes up, cost goes up proportionally. Variance is a free confidence signal. Good fit for reasoning tasks with comparable answer formats.
Advantages
- Higher accuracy on reasoning benchmarks at moderate cost.
- Variance across samples is a free uncertainty estimate — no extra calls needed.
Disadvantages
- Cost scales linearly with N.
- Free-form aggregation requires a judge model — not truly free.