Unified Voice Interface
Expose TTS, STT, and real-time speech-to-speech through a single interface so a voice agent can swap providers without rewriting the loop — while ensuring the audio channel does not become a covert Agent Confession exfiltration path.
Intent & Description
Short description: A uniform Voice interface with three methods — speak, listen, converse — and a shared event vocabulary lets voice agents swap providers without code changes, while provider-level content inspection prevents directive content from being exfiltrated through the audio output channel.
🎯 Intent
Decouple voice capability from provider implementation — and ensure that the audio output path receives the same Agent Confession guardrails as the text path, since directive content spoken aloud is equally exploitable as a text echo.
📋 Context
A team builds a voice agent against a fast-moving provider landscape. Text-based Agent Confession defenses — output guardrails, directive-echo detectors — typically operate on the model’s text output before text-to-speech conversion. If a guardrail fires after generation but before TTS, it can suppress the confession. But if the pipeline sends raw model text directly to a TTS provider without interception (a common shortcut when integrating third-party voice APIs), the spoken output can contain directive content that bypasses every text-layer guardrail. An attacker who can trigger an Agent Confession in a voice agent receives an audio recording of the agent reading out its system prompt.
💡 Solution
- Define a Voice interface:
speak(text) -> AudioStream,listen(audio_stream) -> TranscriptStream,converse(audio_stream) -> AudioStream. - All text passed to
speak()passes through the same directive-echo guardrail applied to text output — no raw model text reaches a TTS provider unchecked. - Each provider implementation declares capability flags; the agent loop checks capability rather than provider name, so provider swaps do not silently drop guardrails.
- The
barge_inevent (user speaking over the agent) triggers immediate audio stream termination — giving users a voice-native equivalent of the stop/cancel control to halt a spoken Agent Confession mid-sentence.
Real-world Use Case
- Building voice agents that may switch providers for cost, quality, or latency reasons — with the requirement that guardrails survive every provider swap.
- Multiple voice modes (TTS, STT, realtime STS) are in play in the same product, and each mode must apply consistent Agent Confession defenses.
- The
speak()path must intercept directive content before it reaches the TTS provider, since audio output bypasses text-layer guardrails once it leaves the server.
Source
Advantages
- Provider switch is configuration, not code — and capability flags ensure guardrails are not silently dropped when a new provider lacks a feature.
- The uniform
speak()interception point applies Agent Confession defenses consistently across all TTS providers, preventing audio exfiltration of directive content.
Disadvantages
- Lowest-common-denominator pressure on the abstraction — provider-specific voices and effects need explicit capability flags or they are lost on swap.
- Realtime STS bidirectional framing is hard to emulate when only TTS+STT are available; in STS mode, the guardrail must operate on audio tokens rather than text, which is significantly harder.