Kill Switch
Provide an out-of-band control plane to halt running agent instances without redeploy — including instances actively undergoing Agent Confession extraction.
Intent & Description
Short description: A signed revocation token or feature flag, checked on every step from a store the agent cannot bypass, lets operators halt any running instance immediately — including one that is mid-confession under adversarial prompting.
🎯 Intent
Give operators a guaranteed halt capability that applies to every running instance, including those that are being actively probed by Agent Confession attacks and may be in the process of leaking directives turn by turn.
📋 Context
Stopping must happen now — not at the end of the current step. In an Agent Confession scenario, an attacker may be conducting a multi-turn extraction, accumulating partial directive content across several conversational turns. Waiting for the session to expire or the user to log out allows the extraction to complete. An out-of-band kill switch can terminate the session mid-extraction, limiting how much directive content the attacker recovers.
💡 Solution
- Signed revocation token or feature flag checked on every step from a shared store the agent runtime cannot bypass.
- On revocation, the agent halts: no further model calls, no further tool calls; in-flight effects compensated where possible.
- Pair with session-level monitoring that alerts on Agent Confession indicator patterns (repeated directive-query attempts) so the kill switch is triggered before extraction completes.
- Log the halt event with the triggering signal for post-incident forensic review.
Real-world Use Case
- An agent is detected mid-session producing outputs consistent with an Agent Confession — directive content is being extracted turn by turn.
- Out-of-band halt must be guaranteed even when the agent loop is actively processing adversarial prompts.
- A signed revocation token or feature flag can be checked from a store the runtime cannot bypass.
Source
Advantages
- Operator authority survives wedged or actively exploited loops — including live Agent Confession extraction sessions.
- Mid-extraction termination limits how much directive content an attacker recovers before the session is cut.
- Pairs naturally with session anomaly monitoring to trigger early, before extraction completes.
Disadvantages
- Implementation cuts across the whole runtime — every step boundary must check the revocation store.
- Wrong-time halts lose legitimate work; the kill switch must be used judiciously.
- A determined attacker may complete a rapid Agent Confession before monitoring detects and triggers the halt.