Tool Output Poisoning Defense
Treat tool output as untrusted content and apply instruction-stripping plus per-tool trust labels.
Intent & Description
🎯 Intent
Treat tool output as untrusted content and apply instruction-stripping plus per-tool trust labels.
📋 Context
A team is building an agent that consumes the output of tools whose contents originated outside the agent’s trust boundary. Examples include a browser agent fetching arbitrary web pages, an MCP (Model Context Protocol) server hosted by an unknown third party, search results that quote attacker-controlled snippets, document parsers running over user-uploaded files, and third-party APIs whose responses include free-form text. Some of these tools are highly trusted (a typed query against the team’s own database) and others are essentially untrusted (a fetch of an arbitrary URL).
💡 Solution
Typed ToolResult envelope with trust: low|medium|high and content-type discriminator. Apply instruction-stripping on low results. Forbid tool-output-driven follow-up tool calls without re-validation against the user’s original intent. Pair with input/output guardrails.
Real-world Use Case
- The agent consumes tool output where the tool itself may be untrusted (browser, MCP, search, parsers).
- Tool envelopes can carry trust labels and content-type discriminators.
- Instruction-stripping and re-validation can be enforced on low-trust results.
Source
Advantages
- Reduces successful indirect injection from compromised tools.
- Trust labels are inspectable in traces.
Disadvantages
- False positives strip legitimate instruction-shaped content.
- New injection vectors emerge faster than defenses.