Evaluation-Driven Development
Write the eval before writing the first prompt — freeze what "good" looks like, then let those metrics drive every model, prompt, and tool decision.
Intent & Description
🎯 Intent
Forbid building the LLM application before its evaluation harness exists — freeze the eval set first and let those metrics drive model selection, prompting, and every subsequent change.
📋 Context
The typical LLM project starts with a prompt prototype that “feels right,” then circles back to evaluation when stakeholders ask for numbers. By then, there’s no baseline, no comparator, and every change is judged by vibe. This pattern prevents that.
💡 Solution
Before authoring the first prompt, write the eval. Define what “good” means as a checkable rubric — an expected-output set, a judge prompt against a frozen rubric, a deterministic checker, or a mix. Build the eval set from real user inputs or synthetic inputs spanning the task dimensions. Pin the rubric and set as a versioned artifact. Every prompt change, model swap, and tool edit runs through the harness; any drop is a blocker.
Real-world Use Case
- Starting an LLM application that will evolve prompts, models, or tools over its lifetime.
- Multiple engineers will work on the same prompt and need a shared comparator.
- Quality regressions are user-visible and must be caught before deployment.
Source
📌 TL;DR
Write the eval before the prompt — define “good” first, then build toward it. Every change gets a score; no regression ships.
Advantages
- Every model swap, prompt edit, and tool change has a single, objective comparator from day one.
- Surfaces regressions early — every commit is a measurement.
- Forces explicit articulation of “what good looks like” before a single line of prompt is written.
Disadvantages
- Front-loaded eval work delays the first shippable prototype.
- Eval sets drift away from production traffic if not periodically refreshed.
- A frozen rubric can become a target in itself — gameable by overfitting prompts to the test set.