TLDR: The C3T (Cross-modal Capabilities Conservation Test) is a new benchmark designed to evaluate speech-aware large language models (LLMs). It assesses how well language understanding capabilities are preserved when LLMs receive speech input, focusing on fairness across diverse speakers (age, gender, dialect) and robustness between text and speech modalities. Using voice cloning for diverse audio prompts and specific fairness metrics, C3T revealed significant performance drops and inconsistencies in leading speech-aware LLMs, highlighting the need for more robust and equitable AI development.
Large Language Models (LLMs) are rapidly evolving, and with the rise of multimodal models, especially those that can understand speech, new challenges in evaluation have emerged. While LLMs have been rigorously tested for factual knowledge, reasoning, and mathematical abilities through text, assessing their capabilities when interacting via speech input is a more complex task. This is where a new benchmark called the Cross-modal Capabilities Conservation Test (C3T) comes into play, aiming to provide a comprehensive evaluation of speech-aware LLMs.
The core problem C3T addresses is whether the language understanding capabilities of an LLM are truly preserved when it receives input through speech, rather than text. It’s not just about accurate speech recognition; it’s about ensuring the model’s understanding doesn’t degrade or become unfair to certain groups of speakers, even if the transcription is perfect. Previous evaluation methods often fell short because they didn’t account for the nuances of spoken language, the diversity of human voices, or the potential for discriminatory behavior.
C3T stands out by focusing on several key aspects. Firstly, it uses textual tasks that are carefully filtered to be plausible for spoken interaction. Many traditional LLM benchmarks, filled with complex exam questions or lengthy mathematical equations, are simply not realistic for a voice interface. C3T selects tasks that a user would genuinely speak aloud.
Secondly, to address the need for diverse speakers without the massive effort of recording countless individuals, C3T employs a sophisticated voice cloning text-to-speech model. This technology allows the benchmark to synthesize a wide range of voices, simulating different ages, genders, and dialects. This is crucial for evaluating the model’s fairness across various demographic groups.
The benchmark introduces a set of detailed metrics beyond simple accuracy. It measures ‘Overall Fairness,’ which checks if a task returns the same answer regardless of the speaker. It also tracks ‘Conditional Fairness’ for specific characteristics like age, gender, or dialect, ensuring the model doesn’t show bias. Furthermore, ‘Cross-modal Robustness’ is assessed, verifying that the model not only performs fairly but also yields consistent results whether the input is text or speech.
In experiments, C3T evaluated several prominent speech-aware LLMs, including Audio Flamingo 3, Qwen2-Audio, Ultravox, and Voxtral Mini. The findings revealed a significant drop in exact match accuracy when models transitioned from text to speech inputs, ranging from 4% to 13%. More critically, the results highlighted substantial fairness issues. Even for tasks that models could potentially solve, a high percentage showed inconsistent answers across different speakers. For instance, in over 98% of tasks that models could solve via speech, some speakers received an incorrect answer. This indicates that while a model might perform well on average, its behavior can be inconsistent and potentially unfair to specific demographic groups.
Also Read:
- VStyle: A New Benchmark for Teaching AI to Speak with Style and Emotion
- Enhancing Multimodal AI Safety: A New Approach to Optimizing Reasoning Paths
The research paper, available at arxiv.org/pdf/2509.12171, concludes that C3T offers a more fine-grained and realistic evaluation of speech-aware LLMs. By moving beyond raw accuracy and focusing on the preservation of language understanding, fairness across speakers, and robustness across modalities, C3T provides critical insights into the real-world performance and ethical considerations of these advanced AI systems. It confirms that even top-performing models can exhibit inconsistent behavior across modalities, underscoring the importance of such specialized benchmarks for future development.


