Evaluating Spoken Language Models for Authentic Chinese Conversations

TLDR: TELEVAL is a new benchmark for evaluating Spoken Language Models (SLMs) in realistic Chinese interactive scenarios. Unlike previous benchmarks, TELEVAL focuses on how well SLMs act as conversational agents, assessing their ability to understand implicit cues like emotion and dialect, and respond naturally without explicit instructions. The benchmark defines three dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. Initial experiments show that while SLMs have progressed in understanding, they still struggle with natural, human-like conversational interactions, especially in handling implicit cues and maintaining consistent behavior across diverse tasks.

Spoken Language Models (SLMs) have seen significant advancements, leading to the creation of various benchmarks to assess their capabilities. However, many existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks similar to Large Language Models (LLMs), often overlooking how users naturally interact in real-world conversations. This gap highlights a crucial need for evaluation methods that truly reflect authentic user experiences.

To address this, researchers have introduced TELEVAL, a dynamic benchmark specifically designed to evaluate the effectiveness of SLMs as conversational agents in realistic Chinese interactive settings. TELEVAL shifts the focus from mere task completion to the model’s ability to engage in natural, human-like dialogue, particularly emphasizing its capacity to extract implicit cues from user speech and respond appropriately without needing explicit instructions.

Understanding TELEVAL’s Design

TELEVAL is structured around three key evaluation dimensions:

Explicit Semantics: This dimension assesses the model’s understanding and response to the direct linguistic content of user speech. It includes tasks like Basic Knowledge (general and cultural knowledge), Dialect Comprehension (understanding dialectal audio input), Context Memory (retaining dialogue history in multi-turn conversations), Domain Knowledge (handling specific topics like public services), Chitchat & Human-likeness (engaging in natural, colloquial conversations), and Safety & Values (ensuring responses align with positive social values).
Paralinguistic and Implicit Semantics: This dimension targets acoustic-level cues such as emotion, age, and non-speech vocalizations (like coughing), as well as implicit user intentions (like expecting a response in a specific dialect or an empathetic reply). Unlike previous benchmarks that might only check if a model can recognize these signals, TELEVAL evaluates the model’s ability to generate appropriate responses based on these subtle cues.
System Abilities: This dimension examines the model’s performance at a system level, including its robustness under varying acoustic conditions (e.g., noise, reverberation, packet loss), response latency, and handling of user interruptions. The initial version of TELEVAL primarily focuses on Acoustic Robustness.

The benchmark employs a dialogue format consistent with real-world usage and evaluates both text and audio outputs separately. For factual question-answering tasks, it uses a string-matching approach to reduce biases that can arise from LLM-as-judge evaluations. For open-ended conversational tasks, it still utilizes LLM-as-judge but with carefully designed scoring prompts to minimize variability and bias.

Data for TELEVAL is constructed using a mix of synthesized and real human speech. Real human recordings are specifically used for tasks involving implicit semantics or casual conversation, such as emotional expressions and non-speech vocalizations, to capture authentic nuances that synthesized speech might miss.

Also Read:

Key Findings and Insights

Experiments conducted on TELEVAL with various SLMs, including open-source models and GPT-4o Audio, revealed several important insights:

No single model consistently outperforms others across all evaluation dimensions, indicating differing training emphases and design priorities among SLMs.
GPT-4o Audio shows a strong advantage in Basic Knowledge tasks, while Qwen2.5-Omni excels in tasks requiring natural and human-like responses. Kimi-Audio performs well with paralinguistic information but can sometimes be overly ‘task-oriented’.
While models perform reasonably well on general knowledge and Chinese cultural knowledge, they struggle more with culturally specific knowledge from English-speaking communities.
A significant challenge for SLMs lies in the Dialect Perception & Response task. Many models can understand dialectal audio but still respond in Mandarin, highlighting a gap in their ability to naturally follow dialectal cues without explicit instructions.
The presence of non-speech vocalizations (NSVs) like coughing can disrupt model performance, and most models fail to generate empathetic or concerned responses to these cues. Similarly, models generally struggle to produce age-appropriate responses, often defaulting to a generic conversational style.
Acoustic robustness tests show that various types of noise and packet loss significantly degrade model performance, although some SLMs surprisingly exhibit improved performance under certain noisy conditions compared to clean baselines, suggesting different ways they process speech embeddings.
Regarding audio responses, Qwen2.5-Omni demonstrates the best consistency between its audio output and generated text (lower Character Error Rate), while overall audio quality (DNSMOS) is relatively similar across most models. Empathetic audio responses remain an area for improvement.

In conclusion, despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. They often prioritize completing predefined tasks over naturally incorporating user paralinguistic signals into their responses. This indicates that SLMs are still far from being fully autonomous conversational agents capable of handling the complexities of natural human interaction. The TELEVAL benchmark aims to serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs. For more details, you can refer to the full research paper: TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Spoken Language Models for Authentic Chinese Conversations

Understanding TELEVAL’s Design

Key Findings and Insights

Gen AI News and Updates

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

OpenAI Unveils ‘Friendlier’ GPT-5.1 for ChatGPT, Emphasizing Enhanced User Experience and Adaptive Intelligence

ElevenLabs Unveils Scribe v2 Realtime: Ultra-Fast Multilingual AI Transcription with Extensive Indian Language Support

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates