Eval as Contract
Treat your eval suite as a binding contract — releases ship only if evals pass, and changing evals is an architectural review, not a config tweak.
Intent & Description
🎯 Intent
Treat the eval suite as the contract the agent must satisfy — releases ship only if evals pass.
📋 Context
You ship an agent to real users and are expected to hold a stable quality bar release after release. You already have an eval suite that gives you a numeric read on quality. The problem is it’s aspirational — engineers can ship past failing evals with enough justification. Stakeholders need that bar to be enforced, not just measured.
💡 Solution
Define a tiered eval suite: blocking evals (must pass for release) and advisory evals (tracked but not blocking). Wire blocking evals into CI. Block PRs and releases when blocking evals fail. Treat eval changes as architectural changes — require review and signoff, not just a commit.
Real-world Use Case
- An eval suite exists that can be tiered into blocking and advisory.
- CI can be wired so blocking eval failures actually prevent release.
- The team is willing to treat eval changes as architectural changes (review + signoff).
Source
📌 TL;DR
Wire your evals into CI and make them blocking — if the suite passes you ship, if it fails you don’t, and changing the suite is a code review, not a hotfix.
Advantages
- Quality bar is enforced, not aspirational — the gate is real.
- The eval suite earns its seat by being load-bearing infrastructure.
Disadvantages
- Bad or miscalibrated evals block legitimate releases — eval quality matters as much as agent quality.
- Calibration is an ongoing empirical effort, not a one-time setup.