Eval Harness
Run a held-out golden dataset against agent versions on every meaningful change — quantify quality, catch regressions, and gate promotions with hard numbers.
Intent & Description
🎯 Intent
Run a held-out dataset against agent versions to detect regressions and measure improvement.
📋 Context
An agent’s output depends on prompt, model version, retrieval choices, and tool wiring — none of which is deterministic in the way normal functions are. Small changes anywhere in that stack can silently shift behavior in ways that aren’t obvious from a few hand-tested examples.
💡 Solution
Build a golden dataset of (input, expected output) pairs. Run candidate versions against the dataset and score each. Compare champion (current) vs challenger (proposed). Promote on quality lift; block on regression. Re-run on every meaningful change.
Real-world Use Case
- A change that “feels better” is silently regressing quality in your system.
- A golden dataset of (input, expected output) pairs can be constructed.
- Champion-vs-challenger comparison drives promotion decisions.
Source
📌 TL;DR
Build a golden dataset and make it the arbiter of every change — if the challenger doesn’t beat the champion on the eval, it doesn’t ship.
Advantages
- Quality becomes measurable, comparable, and trendable — not vibes-based.
- Releases gain a quantitative gate.
Disadvantages
- Dataset bias means high scores can hide real-world failures not covered by examples.
- LLM-as-judge scoring has its own calibration cost and potential bias.