Multilingual Voice Agent Stack
Compose a low-latency voice agent as a co-located STT→LLM→TTS pipeline where language identity flows end-to-end — no mid-pipeline translation hacks.
Intent & Description
🎯 Intent
Build a voice agent that speaks the user’s actual language with sub-second turn-taking.
📋 Context
You’re building a voice agent for a multilingual market (India’s 22 scheduled languages, Iberian Spanish and Catalan, etc.) on telephony channels where written input is rare and turn latency must be sub-second.
💡 Solution
Co-locate all three pipeline stages and pass language identity through all of them. Use STT models trained on target languages and accents. Pass detected language tags as structured metadata to the LLM. Use TTS voices native to the target language — never translate back to English mid-pipeline. Optimize for streaming at every hop (incremental STT, streaming LLM, streaming TTS). Treat code-switching as first-class.
Real-world Use Case
- The agent serves users in multiple languages or dialects with code-switching.
- Sub-second turn-taking requires streaming at every hop (STT, LLM, TTS).
- One vendor or co-located stack can carry language tags end-to-end.
Source
📌 TL;DR
STT→LLM→TTS with language tags flowing all the way through. Stream at every hop. Never translate mid-pipeline. That’s how you get sub-second multilingual voice.
Advantages
- Linguistic fidelity preserved end-to-end — no dialect mangling at component boundaries.
- Sub-second turn-taking achievable with streaming components.
- Single vendor owns the cross-component quality contract.
Disadvantages
- Language coverage is bounded by the weakest component in the pipeline.
- Streaming everywhere is significantly harder to implement than batch.
- Telephony audio quality is a hard ceiling on STT accuracy.