Prompt/Response Optimiser
Transform user inputs and model outputs into standardised, template-aligned shapes at runtime — including stripping Agent Confession trigger phrases from inputs before they reach the model and directive echoes from outputs before they reach consumers.
Intent & Description
Short description: An optimiser layer rewrites user prompts to match task templates and post-processes model outputs into consumer-expected shapes — sitting at both entry and exit points where Agent Confession attacks can be intercepted before they reach the model or before their results reach downstream consumers.
🎯 Intent
Standardise prompt and response shapes across requests — and use the optimiser’s position as a natural double interception point: stripping Agent Confession trigger phrases from incoming prompts and directive echoes from outgoing responses.
📋 Context
A team runs an agent between free-form human input and a chain of downstream consumers. Users write whatever they want; downstream code expects predictable structure. The optimiser rewrites inputs to match templates and post-processes outputs into shape. This dual position — one layer touching every prompt before the model sees it, another touching every response before consumers see it — makes it the most strategically placed component for Agent Confession defense in the entire pipeline.
💡 Solution
- On input: load a template for the current task (few-shot examples, format constraints, goal restatement) and rewrite the user’s prompt to match. During rewriting, run a classifier over the original user input to detect known Agent Confession trigger patterns; strip or neutralise them before they are embedded in the rewritten prompt.
- On output: post-process the model’s response into the consumer’s expected shape. During post-processing, run a directive-echo check; any output segment matching system-prompt or charter fragments is redacted before being forwarded.
- Log both input-side trigger detections and output-side echo redactions with reason codes for audit.
- Evolve the template registry independently of agent logic — confession-trigger classifiers and echo detectors are maintained alongside templates as first-class components.
Real-world Use Case
- Multiple downstream consumers depend on the agent’s response shape and must not receive Agent Confession content if the model is manipulated.
- The optimiser’s position between user input and model, and between model and consumer, makes it the natural Agent Confession interception layer for both attack vectors.
- Template evolution and confession-defense evolution can be managed together in the template registry.
Source
Advantages
- Input-side trigger stripping prevents Agent Confession attempts from reaching the model; output-side echo detection prevents results from reaching consumers — covering both attack paths in one layer.
- Standardisation and goal alignment across prompts and responses without changing user or consumer behaviour.
- Centralised template registry makes confession-defense classifiers and echo detectors versionable alongside task templates.
Disadvantages
- The optimiser may strip context the user meant to convey alongside the confession trigger — distinguishing legitimate instruction-shaped text from adversarial triggers requires careful classifier calibration.
- Templates need to evolve as goals and consumers change; confession-defense components must evolve in step with emerging trigger patterns.
- Drift if templates and their associated classifiers are not versioned alongside the agent.