The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues

Published in Proceedings of the 16th International Workshop on Spoken Dialogue System Technology (IWSDS 2026), 2026

Download paper here

We compare end-to-end audio language models with traditional modular systems (ASR, LLM, TTS) in multi-turn dialogue tasks. We evaluate open-source models on conversational naturalness and dialogue consistency metrics, revealing that E2E configurations consistently underperform their modular counterparts. Our analysis shows that models exhibit severe degradation in dialogue quality as conversations progress, with the root cause lying in context maintenance and topic tracking deficiencies rather than component quality. This research highlights a critical gap between the theoretical low-latency benefits of E2E audio language models and their practical ability to maintain coherence in complex multi-turn interactions, suggesting a need for focused architectural improvements. Accepted by IWSDS 2026

Recommended citation: Tam, Z.R., Chang, W.Y., & Chen, Y.N. (2026). “The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues.” Proceedings of the 16th International Workshop on Spoken Dialogue System.

Recommended citation: Tam, Z.R., Chang, W.Y., & Chen, Y.N. (2026). "The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues." In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 76–82, Trento, Italy.
Download Paper