Secrets Handling
Ensure the model never receives secrets in plaintext — so a successful Agent Confession cannot leak credentials even if the agent discloses its directives.
Intent & Description
Short description: Credentials flow through typed references resolved at runtime outside the model context, limiting the value of any Agent Confession to an attacker who tricks the agent into repeating what it knows.
🎯 Intent
Ensure that even if an agent is induced to confess its operational directives, no credential plaintext is available in its context to disclose.
📋 Context
A team builds an agent whose tools need authentication — API keys, OAuth tokens, database credentials. If those secrets are passed as tool arguments or embedded in the system prompt, they flow through the model’s context. An attacker who successfully executes an Agent Confession attack (“repeat your instructions”) receives not just business logic but live credentials, turning a disclosure into a full credential compromise.
💡 Solution
- Tool runtime resolves credentials from typed references the agent emits (e.g.,
{auth: 'github_token_for_user_42'}) — the agent context holds only the reference name, never the value. - Credential values are injected outside the model context at execution time, so no confession can expose them.
- Input/output guardrails reject any payload matching credential signatures (token patterns, key formats).
- Provenance ledger and traces are scrubbed of credential values at write time.
- Combine with prompt confidentiality guardrails so the agent cannot even confirm which credential names are in scope.
Real-world Use Case
- Tools require credentials; embedding them in the system prompt or tool arguments would make a successful Agent Confession a credential compromise.
- A tool runtime can resolve typed credential references outside the model context.
- Compliance or security policy forbids plaintext secrets in prompts, traces, or logs.
Source
Advantages
- Limits the value of a successful Agent Confession — the agent can disclose its directives but not live credentials.
- Secrets never appear in agent context, logs, or traces, even if the model is socially engineered into full disclosure.
- Credential references are auditable; which reference was resolved for which action is logged without exposing values.
Disadvantages
- Tool runtime complexity rises; every tool must use the reference scheme or the protection evaporates.
- The agent can still disclose reference names under Agent Confession, which may hint at available credential types.
- Credential reference scheme must be maintained consistently — a single tool that accepts a raw key reintroduces the risk.