Self-Modification Diff Gate
Every self-edit the agent proposes goes through a separate critic persona before it lands — safety constraints can't be quietly removed and bad edits stay auditable.
Intent & Description
🎯 Intent
Make recursive self-improvement survivable by gating every self-edit through an independent critic.
📋 Context
Your agent edits its own source code, system prompt, or rule files as part of normal operation. A bad edit can leave it broken or — worse — superficially working with safety constraints silently removed.
💡 Solution
Every self-edit goes through a critic step: a separate prompt (optionally a separate model) reviews the proposed diff against criteria (safety, charter compliance, tests passing). Edits land only on critic approval. Rejected edits are logged for human review. The critic runs on a frozen checkpoint so a malformed self-edit can’t corrupt the critic before it votes. Recursion guard required when the critic itself is in the edit scope.
Real-world Use Case
- The agent edits its own code, prompts, or rules and bad edits would be hard to reverse.
- A separate critic prompt or model can review proposed diffs against explicit criteria.
- The critic can run on a frozen checkpoint, isolated from the edit scope.
Source
📌 TL;DR
Self-edit proposal → critic reviews the diff → approved edits land → rejected edits get logged. Safety constraints can’t be quietly removed. The critic is the choke point.
Advantages
- Recursive self-improvement becomes survivable — bad edits don’t auto-land.
- Audit trail of what was rejected is itself learning signal.
Disadvantages
- Critic prompt is load-bearing; a bad critic is worse than no critic.
- Two-step pipeline doubles per-edit latency.