Compound Error Degradation
Deploying a long-horizon agent while ignoring that per-step accuracy compounds — a 20-step pipeline of 95%-accurate steps succeeds less than 36% of the time.
Intent & Description
🎯 Intent
Treating per-step benchmark accuracy as a forecast for end-to-end pipeline quality.
📋 Context
A team measures 95% per-step accuracy and scales to a 20-step pipeline. The math says 0.95^20 ≈ 36% overall success. They learn this in production.
💡 Solution
Model end-to-end task success as the product of per-step success rates (after any per-step recovery). Either cap step count so the product clears your quality bar, or raise effective per-step success with verifiers, retries, and intermediate checkpoints. Treat raw benchmark accuracy as a ceiling, not a forecast.
Real-world Use Case
- Reviewing a long-horizon agent proposal with no step budget and no per-step verifier.
- Per-step benchmarks look healthy but end-to-end success on production traffic does not.
- Naming this failure mode explicitly when it arises in design review.
Source
📌 TL;DR
Model end-to-end success as per-step accuracy multiplied across all steps — and set step budgets accordingly.
Advantages
- Naming the failure mode forces explicit step budgets and per-step recovery planning
- Surfaces when you need a stronger model versus a shorter pipeline
Disadvantages
- Per-step success on production-shaped tasks is hard to measure; benchmarks rarely transfer cleanly
- Per-step verifiers add their own error rates that also need to be modeled