TLDR: A new benchmark called C3 has been developed to evaluate Spoken Dialogue Models (SDMs) in complex, real-world conversations, focusing on challenges like ambiguity (phonological and semantic), omission, coreference, and multi-turn interactions in both English and Chinese. The study found that SDMs struggle significantly with ambiguity, particularly semantic ambiguity in Chinese, and that handling omissions is the most difficult aspect of context-dependent dialogues. Overall, SDMs perform better in English than in Chinese.
Spoken Dialogue Models (SDMs) are becoming increasingly common, allowing users to interact with AI through voice. Think of voice assistants or chatbots that you speak to directly. While these models are gaining popularity, there’s been a noticeable gap in understanding how well they truly grasp and mimic the complexities of human conversation, especially when compared to text-based AI models which have been extensively tested.
Human voice interactions are inherently more intricate than text. This is due to unique characteristics of spoken dialogue, such as ambiguity. Ambiguity can arise from the meaning of words (like a word having multiple meanings) or from how words sound (like words that sound the same but are spelled differently). Another major challenge is context-dependency, where understanding a conversation requires knowing what was said before, like when something is left out or pronouns are used to refer to earlier mentions.
To address these challenges and shed light on the current state of SDM development, researchers have introduced a new benchmark dataset called C3. This dataset includes 1,079 conversation instances in both English and Chinese. It also comes with an evaluation method that uses advanced AI models to judge performance, closely matching human judgment. This allows for a thorough exploration of how SDMs handle these real-world conversational difficulties.
The C3 benchmark focuses on five key phenomena that make spoken dialogues complex: phonological ambiguity (related to sound, like pauses, intonation, and stress), semantic ambiguity (related to meaning, like words with multiple interpretations or unclear sentence structures), omission (when parts of a sentence are left out but implied), coreference (when pronouns refer to previously mentioned entities), and multi-turn interaction (sustained conversations over several exchanges).
The study evaluated several popular end-to-end SDMs, including GPT-4o-Audio-Preview, Qwen2.5-Omni, and others. The findings reveal significant insights into their capabilities. One major takeaway is that ambiguity, particularly semantic ambiguity in Chinese, poses a greater challenge for SDMs compared to context-dependency. For instance, SDMs performed much lower in Chinese semantic ambiguity tasks than in English.
Another important finding is that processing omissions is generally the most difficult aspect for SDMs within context-dependent dialogues. While some models are better at detecting that something is missing, they struggle more with actually filling in the omitted content. This suggests that generating new, implied information is harder than simply identifying a gap.
Furthermore, the research highlights a notable performance gap between English and Chinese. Overall, SDMs tend to perform better in English dialogues across most phenomena. This indicates a need for improved cross-linguistic capabilities in current SDMs, especially for languages like Chinese which have unique complexities such as tones.
Also Read:
- Chinese Textual Ambiguity Reveals Fragility in Large Language Models
- Evaluating LLMs: Why Different Voices Matter in Benchmarking
The C3 benchmark, with its focus on real and complex challenges in spoken dialogues, is expected to be a valuable resource for researchers aiming to develop more natural and intelligent spoken interaction systems. For more details, you can refer to the original research paper.


