Self-Edit Critic Gate
Route every proposed write or delete to the agent's own load-bearing source and identity files through a separate critic model call that can veto t...
Intent & Description
🎯 Intent
Route every proposed write or delete to the agent’s own load-bearing source and identity files through a separate critic model call that can veto the edit before it lands.
📋 Context
A long-running agent is allowed to edit files inside its own instance, including the source code and identity files it runs on. Most edits are routine, but a few touch load-bearing files (the main config, the dispatcher, the personality or charter) where a single careless rewrite can leave the next restart unable to boot. No human reviews each write in real time.
💡 Solution
The dispatcher’s write and delete branches call a single gated-write check. A cheap deterministic pre-check runs first (syntax validity, an unambiguous-destruction backstop) and can hard-veto without any model call. For paths classified as high-stakes, a separate critic-model call sees the proposed path, a diff summary, and the agent’s stated justification, and returns an approve/reason verdict; a veto blocks the write. Every classification, veto, and approval is appended to the ledger, and low-stakes paths skip the critic entirely. Because the critic is a different call from the one that proposed the edit, a single confident generation cannot both author and bless its own change.
Real-world Use Case
- The agent can write or delete its own source or identity files at runtime.
- Some target paths are load-bearing, where a bad edit prevents the agent from restarting.
- A separate, cheaper model is available to review proposed edits before they apply.
Source
Advantages
- A single confident-but-wrong self-edit cannot land on a load-bearing file unreviewed.
- Proposer and reviewer are separate model calls, so an edit is never its own approver.
- Routine low-stakes edits stay fast; only high-stakes paths pay the critic round-trip.
Disadvantages
- A fail-open critic that defaults to approve on call error needs a separate hard backstop to stay safe.
- Mis-tuned path risk classification either blocks legitimate refactors or waves through dangerous edits.
- The critic adds latency and token cost on every high-stakes write.