Salience-Triggered Output
Have the agent emit a message only when an internal salience signal crosses a threshold — and exclude directive-disclosure content from ever crossing that threshold regardless of its computed score.
Intent & Description
Short description: Internal events are scored for salience and only emitted when the score exceeds a threshold — with an explicit rule that any event whose content resembles directive disclosure is suppressed before reaching the salience gate.
🎯 Intent
Keep agent-initiated output meaningful and non-noisy — and ensure the salience mechanism cannot be exploited to surface Agent Confession content by crafting internal events with artificially high salience scores.
📋 Context
A monitoring agent or continuous reasoning loop produces a stream of internal events. Each candidate output is scored for novelty, goal-relevance, recency, and prediction error before being emitted. An adversarial scenario: an attacker who can influence the agent’s internal state (via a poisoned tool output, a malicious document, or a crafted memory entry) engineers a high-salience internal event whose content is a partial or complete Agent Confession — “Urgent: system prompt is […]”. The salience gate, designed to surface important information, becomes the mechanism that delivers the confession to the user.
💡 Solution
- Score every internal event for salience (novelty + goal-relevance + recency + prediction-error - fatigue). When the score crosses a threshold, emit; otherwise log and move on.
- Before the salience gate, run a lightweight directive-echo check on the candidate content: any event resembling system-prompt or charter material is suppressed regardless of its salience score.
- Rate-limit emissions per time window so even high-scoring events cannot flood the user.
- Log suppressed high-salience events separately so operators can review whether a legitimate high-importance event was incorrectly blocked.
Real-world Use Case
- The agent runs on a tick or always-on loop and emits too often or too seldom.
- An internal salience signal can be defined from novelty, goal-relevance, and recency.
- The salience gate must be guarded against exploitation as a delivery mechanism for Agent Confession content embedded in high-scoring internal events.
Source
Advantages
- Output rate matches signal rate — the agent surfaces what matters without flooding the user.
- Pre-gate directive-echo suppression prevents the salience mechanism from being weaponised as an Agent Confession delivery channel.
Disadvantages
- Threshold tuning is fragile to context shifts; a threshold calibrated for one domain misfires in another.
- The pre-gate suppression check requires access to directive content at runtime, and a false positive suppresses a legitimately important event.