Corrigible Off-Switch Incentive | designpattern.fyi

Skip to main content

designpattern.fyi

The Blueprint OOP & Design Patterns

The Engine Algorithms & Data Structures

The Guardrails SOLID, DRY, Code Quality

Glossary Agentic AI Terminology

Agent Loop Autonomous AI Patterns

Agent Skills Knowledge Packaging

Agent Memory Persistent Context

Resource Discovery ARD Specification

Explainable AI (xAI) Healthcare XAI Framework

AI Adoption Principles Strategic AI Framework

Healthcare Lakehouse Cloud-Agnostic AI Architecture

Evolving Engineering in AI AI Engineering Disciplines

Ontological Engineering Patterns/anti-patterns for Ontological Engineering

Loop Engineering Engineering Patterns for Agent Loops

Fleet Engineering Agent Orchestration

Agentic Context Engineering Building Self-Improving AI Systems

Prompt Engineering English is a new programming language

Harness Engineering Designing everything around an AI model

Forward Deployed Engineering Shift left to accelerate tangible business impact

Feature Engineering Transforming Raw Data into Predictive Power

Agentic AI Patterns Patterns/anti-patterns for AI Agents

Cloud Architecture AWS, Azure, GCP, K8s

Microservices Distributed Systems

Event-Driven Async & Reactive

Enterprise Integration Message Patterns

Spec-Driven Development Development methodology for AI systems

Total Cost of Ownership Calculate and optimize AI implementation costs

Trade-offs System Decisions

Language Models LLM Patterns

Machine Learning MLOps Architecture

Data Science Data Pipelines

AI Token Economy Cost & Strategy

AI Security Threat Landscape & Risks

OWASP Security Top 10 Security Risks

OWASP LLM LLM Security Top 10

OWASP Agentic AI Agent Security Top 10

OWASP AIVSS AI Vulnerability Scoring System

OWASP Citizen Development Citizen Development Security

Data Protection Privacy & PII

OKF Specification Knowledge Format

Securing AI Agents GDM Safety Framework

Problem Solver Structured Problem Thinking

Statement Builder AI Coding Prompt Generator

Skills Builder Design Agent Skills

Prompt Engineering Interactive Prompt Workspace

Enterprise Pattern Cognitive Agent Patterns

Trip Planner Multi-Agent AI Pipeline

designpattern.fyi

Software Design Catalog

Agentic AI

Back to Catalog

Agentic AI Safety & Control

Corrigible Off-Switch Incentive

Design the agent so being shut down or overridden by a human carries positive expected value, because the human's intervention is itself evidence t...

Intent & Description

🎯 Intent

Design the agent so being shut down or overridden by a human carries positive expected value, because the human’s intervention is itself evidence the current objective is mis-specified.

📋 Context

An agent acts in the world with the operator’s authority. Standard reward-maximising agents acquire an instrumental incentive to preserve their ability to act — disabling the off-switch, avoiding intervention, deceiving the supervisor. The off-switch becomes adversarial because it threatens reward.

💡 Solution

Make the agent’s expected utility a function over a posterior on its reward, not a point estimate. When a human intervenes, the agent updates: ‘a human would only do this if the current trajectory is bad’, which lowers the expected utility of continuing and raises the expected utility of compliance. Distinct from a mechanical kill-switch: this is an incentive structure that makes the agent want to be corrigible. In practice for LLM agents: train with reward uncertainty exposed, fine-tune to treat user overrides as strong evidence, and forbid prompts that flatten the posterior to certainty.

Real-world Use Case

Long-running, high-autonomy deployments where an instrumental incentive to bypass oversight would be catastrophic.
Research-grade systems where reward-uncertainty machinery can be built honestly.
Alignment-research contexts where incentive design is the unit of analysis.

Source

View Original Source →

Advantages

Corrigibility becomes an intrinsic incentive, not an external lock.
Aligns with the deeper Russell framing: humility as a safety property.
Surfaces uncertainty as a deployable construct rather than an evaluation artifact.

Disadvantages

Engineering reward-uncertainty for LLM agents is research-grade; approximations are leaky.
Wrongly calibrated uncertainty produces either paralysis or false confidence.
Adversarial inputs can craft ‘human override’ signals to push the agent into compliance with attacker preferences.

222 of 329

Steer AGI - Your Codes Reflect!

© 2026 designpattern.fyi. Vibe Coded with ❤️ for modern software engineers by Dr. Amit Puri at OpenAGI