Incident Response Runbook
Pre-author step-by-step response procedures for your highest-risk agent failure modes — so when a PII leak or tool exploit fires, the team executes, not panics.
Intent & Description
🎯 Intent
Maintain pre-written response procedures for agent failures (PII leak, tool exploit, mass false action) so detected incidents trigger known steps, not improvised reactions.
📋 Context
Production agents can fail badly: leaking PII across tenants, exploiting a tool with real-world side effects, or triggering a cascade of wrong actions before anyone notices. You already have kill-switches, sandbox monitoring, and provenance logs. What you’re missing is a coordinated, pre-practiced response that respects regulatory clocks (GDPR 72-hour breach notification, EU AI Act serious-incident reports).
💡 Solution
Maintain a runbook covering: severity levels, on-call paths, containment steps (kill-switch invocation, traffic rerouting), forensic preservation (pin traces beyond normal retention), compensating actions, customer communication templates, regulator notification procedures, and a post-mortem template. Wire monitoring alerts (kill-switch, sandbox-escape, cost anomalies) directly to runbook entries.
Real-world Use Case
- An agent is in production where PII leaks, tool exploits, or mass false actions are possible.
- Detection signals exist but no coordinated response procedure does.
- Regulatory or customer obligations require documented containment and notification steps.
Source
📌 TL;DR
Write the incident playbook before you need it — when a PII leak fires at 2am, your team should be executing known steps, not making them up in Slack.
Advantages
- Detection produces coordinated response, not panic — the team executes a known playbook.
- Regulator notification timelines are met without scrambling.
Disadvantages
- Runbook drift — failure scenarios evolve faster than documentation updates.
- Runbook fatigue if drills are too infrequent (forgotten) or too frequent (ignored).