Shadow Canary
Run a candidate agent version in shadow alongside the live champion — compare outputs on real traffic without exposing users to the challenger until it proves itself.
Intent & Description
🎯 Intent
Run a candidate agent version in shadow alongside the champion, comparing outputs on real traffic without affecting users.
📋 Context
You want to roll out a new model, tweaked prompt, or reworked tool wiring to an agent serving real users. You have a trusted champion version and a challenger you want to validate. Pre-release evaluation sets never fully capture the long-tail queries that appear in production.
💡 Solution
Route a fraction of real traffic through both champion and challenger. The champion’s output reaches the user. The challenger’s output is logged. Diff the outputs on agreed metrics (judge model, exact match on tool calls, latency, cost). Promote on lift; revert on regression.
Real-world Use Case
- Agent changes are non-deterministic and CI cannot capture real field behavior.
- Real traffic can be replayed through a challenger without affecting users.
- A diff metric (judge model, exact match, latency) can be defined for the comparison.
Source
📌 TL;DR
Run the challenger in the shadows — same real traffic, zero user exposure, real diff metrics. Promote when it wins; revert when it doesn’t.
Advantages
- Catches field-quality regressions that pre-release eval sets miss.
- Gives confidence to roll out non-deterministic changes on production traffic.
Disadvantages
- 2× cost during the shadow window — both versions run on every shadowed request.
- Diff-noise on free-form outputs is hard to attribute to signal vs model variance.