Delayed Streams Modeling

Requires time-aligned paired data, which is hard to obtain for some language pairs and modalities.
Fixed offset means latency cannot adapt to easy vs hard segments — a learned policy could.
Single model couples STT, LLM, and TTS quality; weakness in one role is hard to isolate.
Long-context behavioural shaping (instruction-following, refusals) is less clean than in a separate LLM stage.
Architecture commits to streaming use; batch tasks gain little from the offset structure.