Scaffold Ablation on Model Upgrade
On each model upgrade, treat every harness component as an encoded assumption about a past model weakness — ablate the ones the new model no longer needs, gated by evals.
Intent & Description
🎯 Intent
On each model upgrade, treat every harness component as an encoded assumption about a model weakness — ablate the components the new model no longer needs, gated by evals.
📋 Context
An agent harness accretes over several model generations: retry wrappers, decomposition scaffolds, format-coercion steps, guardrails, planning constructs. Each was added to compensate for something a past model couldn’t do reliably. A stronger model arrives, and the harness is carried over wholesale because it “works.” The result: scaffolding that was designed to patch weaknesses is now constraining strengths.
💡 Solution
Make each harness component carry the assumption it encodes (“the model cannot keep a long plan straight,” “the model will not emit valid JSON”). On model upgrade, walk the components and stress-test each assumption against the new model: temporarily remove the component and run the eval suite. If the eval holds, the assumption has expired and the component comes out; if it regresses, the assumption survives and the component stays. The eval suite is the gate; the anti-pattern is carrying everything over by default.
Real-world Use Case
- A harness has accreted scaffolding across several model generations.
- A model upgrade is being adopted and the team owns an eval suite to gate changes.
- There is evidence or suspicion that carried-over scaffolding is suppressing the new model’s capability.
Source
📌 TL;DR
When you upgrade the model, audit the scaffolding too — remove every component that compensated for a weakness the new model doesn’t have, gated by evals. Stale harness suppresses capability.
Advantages
- Harness complexity tracks the current model’s real weaknesses instead of accumulating across generations.
- Capability suppression from scaffolding built for weaker models is removed, not inherited.
- Each removal is evidence-backed — the review is auditable, not a matter of taste.
Disadvantages
- Ablating a component whose assumption hasn’t fully expired causes regression if the eval missed the edge case.
- The review is only as trustworthy as the eval suite gating it.
- Per-release review is recurring work that a carry-everything-over approach avoids.