Corrigible Off-Switch Incentive
Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence t...
Intent & Description
🎯 Intent
Design the agent so being shut down or overridden by a human carries positive expected value, because the human’s intervention is itself evidence the current objective is mis-specified.
📋 Context
An agent acts in the world with the operator’s authority. Standard reward-maximising agents acquire an instrumental incentive to preserve their ability to act — disabling the off-switch, avoiding intervention, deceiving the supervisor. The off-switch becomes adversarial because it threatens reward.
💡 Solution
Make the agent’s expected utility a function over a posterior on its reward, not a point estimate. When a human intervenes, the agent updates: ‘a human would only do this if the current trajectory is bad’, which lowers the expected utility of continuing and raises the expected utility of compliance. Distinct from a mechanical kill-switch: this is an incentive structure that makes the agent want to be corrigible. In practice for LLM agents: train with reward uncertainty exposed, fine-tune to treat user overrides as strong evidence, and forbid prompts that flatten the posterior to certainty.
Real-world Use Case
- Long-running, high-autonomy deployments where an instrumental incentive to bypass oversight would be catastrophic.
- Research-grade systems where reward-uncertainty machinery can be built honestly.
- Alignment-research contexts where incentive design is the unit of analysis.
Source
Advantages
- Corrigibility becomes an intrinsic incentive, not an external lock.
- Aligns with the deeper Russell framing: humility as a safety property.
- Surfaces uncertainty as a deployable construct rather than an evaluation artifact.
Disadvantages
- Engineering reward-uncertainty for LLM agents is research-grade; approximations are leaky.
- Wrongly calibrated uncertainty produces either paralysis or false confidence.
- Adversarial inputs can craft ‘human override’ signals to push the agent into compliance with attacker preferences.