TLDR: TELEVAL is a new benchmark for evaluating Spoken Language Models (SLMs) in realistic Chinese interactive scenarios. Unlike previous benchmarks, TELEVAL focuses on how well SLMs act as conversational agents, assessing their ability to understand implicit cues like emotion and dialect, and respond naturally without explicit instructions. The benchmark defines three dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. Initial experiments show that while SLMs have progressed in understanding, they still struggle with natural, human-like conversational interactions, especially in handling implicit cues and maintaining consistent behavior across diverse tasks.
Spoken Language Models (SLMs) have seen significant advancements, leading to the creation of various benchmarks to assess their capabilities. However, many existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks similar to Large Language Models (LLMs), often overlooking how users naturally interact in real-world conversations. This gap highlights a crucial need for evaluation methods that truly reflect authentic user experiences.
To address this, researchers have introduced TELEVAL, a dynamic benchmark specifically designed to evaluate the effectiveness of SLMs as conversational agents in realistic Chinese interactive settings. TELEVAL shifts the focus from mere task completion to the model’s ability to engage in natural, human-like dialogue, particularly emphasizing its capacity to extract implicit cues from user speech and respond appropriately without needing explicit instructions.
Understanding TELEVAL’s Design
TELEVAL is structured around three key evaluation dimensions:
- Explicit Semantics: This dimension assesses the model’s understanding and response to the direct linguistic content of user speech. It includes tasks like Basic Knowledge (general and cultural knowledge), Dialect Comprehension (understanding dialectal audio input), Context Memory (retaining dialogue history in multi-turn conversations), Domain Knowledge (handling specific topics like public services), Chitchat & Human-likeness (engaging in natural, colloquial conversations), and Safety & Values (ensuring responses align with positive social values).
- Paralinguistic and Implicit Semantics: This dimension targets acoustic-level cues such as emotion, age, and non-speech vocalizations (like coughing), as well as implicit user intentions (like expecting a response in a specific dialect or an empathetic reply). Unlike previous benchmarks that might only check if a model can recognize these signals, TELEVAL evaluates the model’s ability to generate appropriate responses based on these subtle cues.
- System Abilities: This dimension examines the model’s performance at a system level, including its robustness under varying acoustic conditions (e.g., noise, reverberation, packet loss), response latency, and handling of user interruptions. The initial version of TELEVAL primarily focuses on Acoustic Robustness.
The benchmark employs a dialogue format consistent with real-world usage and evaluates both text and audio outputs separately. For factual question-answering tasks, it uses a string-matching approach to reduce biases that can arise from LLM-as-judge evaluations. For open-ended conversational tasks, it still utilizes LLM-as-judge but with carefully designed scoring prompts to minimize variability and bias.
Data for TELEVAL is constructed using a mix of synthesized and real human speech. Real human recordings are specifically used for tasks involving implicit semantics or casual conversation, such as emotional expressions and non-speech vocalizations, to capture authentic nuances that synthesized speech might miss.
Also Read:
- GOAT-SLM: Advancing Spoken Language Models with Human-like Vocal Understanding
- Boosting ASR Accuracy in CRM Systems with Weak Supervision and Synthetic Data
Key Findings and Insights
Experiments conducted on TELEVAL with various SLMs, including open-source models and GPT-4o Audio, revealed several important insights:
- No single model consistently outperforms others across all evaluation dimensions, indicating differing training emphases and design priorities among SLMs.
- GPT-4o Audio shows a strong advantage in Basic Knowledge tasks, while Qwen2.5-Omni excels in tasks requiring natural and human-like responses. Kimi-Audio performs well with paralinguistic information but can sometimes be overly ‘task-oriented’.
- While models perform reasonably well on general knowledge and Chinese cultural knowledge, they struggle more with culturally specific knowledge from English-speaking communities.
- A significant challenge for SLMs lies in the Dialect Perception & Response task. Many models can understand dialectal audio but still respond in Mandarin, highlighting a gap in their ability to naturally follow dialectal cues without explicit instructions.
- The presence of non-speech vocalizations (NSVs) like coughing can disrupt model performance, and most models fail to generate empathetic or concerned responses to these cues. Similarly, models generally struggle to produce age-appropriate responses, often defaulting to a generic conversational style.
- Acoustic robustness tests show that various types of noise and packet loss significantly degrade model performance, although some SLMs surprisingly exhibit improved performance under certain noisy conditions compared to clean baselines, suggesting different ways they process speech embeddings.
- Regarding audio responses, Qwen2.5-Omni demonstrates the best consistency between its audio output and generated text (lower Character Error Rate), while overall audio quality (DNSMOS) is relatively similar across most models. Empathetic audio responses remain an area for improvement.
In conclusion, despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. They often prioritize completing predefined tasks over naturally incorporating user paralinguistic signals into their responses. This indicates that SLMs are still far from being fully autonomous conversational agents capable of handling the complexities of natural human interaction. The TELEVAL benchmark aims to serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs. For more details, you can refer to the full research paper: TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios.


