Generate-and-Test Strategy
Generate candidate solutions, run them against a verifier, keep what passes.
Intent & Description
🎯 Intent
Decouple generation from validation. The model generates candidates; an external or programmatic verifier filters them.
📋 Context
LLMs are probabilistic — they generate plausible outputs, not guaranteed-correct ones. For tasks with checkable outputs (code, math proofs, SQL queries, structured data), running the output through a verifier is far more reliable than asking the model to self-evaluate.
💡 Solution
Build a loop: (1) prompt the model to generate N candidate solutions, (2) run each through a verifier (unit tests, a compiler, a constraint checker, another model), (3) return passing candidates or feed failures back into the generation loop with error context. Combine with best-of-N sampling for parallelism. See also: best-of-n-sampling, evaluator-optimizer, reflexion.
Real-world Use Case
- Code generation — run the generated code against tests, loop on failures.
- SQL / query generation — execute against a sandbox DB, catch errors.
- Structured output generation — validate schema compliance programmatically.
- Any task where “correct” has a programmatic definition.
Source
📌 TL;DR
Generate multiple candidates, run them, keep what works — don’t trust the model to self-grade.
Advantages
- Correctness is verified, not hoped for — massive reliability improvement.
- Failures provide concrete error messages that dramatically improve next-iteration prompts.
- Scales naturally — run more candidates in parallel to increase pass rate.
Disadvantages
- Requires a verifier — not all tasks have one.
- Test quality gates everything; bad tests pass bad code.
- Can loop forever on unsolvable problems without a step budget or exit condition.