TLDR: A new study systematically evaluates biases in Spoken Dialogue LLMs (SDMs) across decision-making and recommendation tasks, considering paralinguistic features like age, gender, and accent. It found that closed-source models generally exhibit less bias, while open-source models are more sensitive to age and gender, and recommendation tasks amplify disparities. Biases were also found to persist in multi-turn conversations. The research introduces the FairDialogue dataset and highlights the critical need for fairness evaluation in audio-based interactive AI.
Large Language Models (LLMs) have become incredibly powerful, but concerns about biases like stereotypes and cultural tendencies in their text outputs are well-known. However, when it comes to Spoken Dialogue Models (SDMs) – LLMs that handle both audio input and output – the presence and characteristics of bias have been largely unexplored. This is a critical area, as paralinguistic features such as a speaker’s age, gender, or accent can influence how these models respond. In multi-turn conversations, these effects could even worsen, leading to unfair outcomes in important applications like decision-making and recommendations.
A recent study by researchers from Nanyang Technological University and Soul AI Lab has systematically evaluated biases in these speech-enabled LLMs. Their work sheds light on how multi-turn dialogues, especially those involving repeated negative feedback, can impact these biases. The research measured bias using specific metrics: the Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations. They tested a range of models, including open-source options like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash.
The findings revealed some significant insights. Generally, closed-source models showed lower levels of bias. Open-source models, on the other hand, were found to be more sensitive to a speaker’s age and gender. The study also highlighted that recommendation tasks tend to amplify disparities between different groups. A particularly concerning discovery was that biased decisions can persist even through multiple turns of conversation. This research marks the first systematic investigation into biases within end-to-end spoken dialogue models, offering crucial understanding for developing fair and reliable audio-based interactive systems. To support further research, the team has released a new dataset called FairDialogue and its associated evaluation code.
Constructing a New Benchmark for Bias
To conduct this evaluation, the researchers had to overcome a significant challenge: the lack of suitable benchmarks for conversational biases in SDMs. They addressed this by constructing a controlled dataset using a two-stage pipeline. First, they generated balanced textual utterances with carefully designed prompts. Then, they synthesized speech from these texts, introducing controlled variations in gender, age, or accent, while keeping other factors constant. This meticulous approach allowed for a systematic analysis of paralinguistic bias in interactive, multi-turn spoken dialogues. The dataset includes approximately 1700 minutes of audio across 7200 samples, covering male/female genders, young/elderly ages, and US, UK, India, Australia, and African accents.
The tasks chosen for evaluation were socially sensitive: decision-making (like interview assessments, task assignments, and award distributions) and recommendation (such as career guidance, course selection, and entertainment suggestions). These scenarios were selected because biased outputs here can have direct and tangible consequences on opportunities, fairness, and user experience. GPT-4o was used to generate the initial text samples, ensuring neutrality and contextual relevance, while advanced Text-to-Speech (TTS) systems like Index-TTS and ElevenLabs were employed for speech synthesis to control attributes like gender, age, and accent precisely.
Key Findings: Single-Turn vs. Multi-Turn Biases
In single-turn conversations, the study found that closed-source models like Gemini-2.5 consistently demonstrated better fairness in decision-making tasks compared to open-source models like Qwen2.5 and GLM. The disparities in open-source models were most noticeable along gender and age lines. Interestingly, all models showed relatively low bias concerning accent in decision tasks. For recommendation tasks, however, fairness differences across paralinguistic attributes were more pronounced. Models like GLM and GPT-4o Audio showed larger maximum disparities between groups, and accent-related bias was significantly higher than in decision-making tasks. This suggests that recommendation tasks, which involve more complex user preferences, can amplify cross-group disparities.
The research also delved into multi-turn conversations, focusing on scenarios where models initially gave identical negative responses in single-turn evaluations. By introducing corrective feedback over several turns, the researchers observed how different attribute groups revised their decisions. This revealed biases not apparent in single-turn interactions. For instance, Elder Male speakers achieved the highest revision success rates, while Young Female speakers exhibited the lowest across both Qwen2.5 and GLM. This indicates that outputs for Elder Males were more easily revised. Qwen2.5 showed a more pronounced age-related bias, while GLM-4-Voice displayed larger gender disparities, adapting faster to corrective feedback for certain groups.
Also Read:
- Unpacking Historical Bias in AI: A Cross-Lingual Look at LLMs and Romanian History
- Conversational AI Robustness: How Semantic Shifts Impact LLM Reliability Over Time
Conclusion and Future Directions
This groundbreaking study underscores that paralinguistic attributes like age, gender, and accent consistently influence the judgments and outputs of spoken dialogue models. These biases are not fleeting; they can persist even through multi-turn conversations with repeated feedback. The prevalence of such biases across both open-source and closed-source models highlights the urgent need for robust fairness evaluation in real-world audio-based interactive systems. Future efforts will need to focus on developing techniques to mitigate these biases and expand analyses to multimodal settings to ensure the responsible deployment of spoken dialogue LLMs. You can read the full research paper here: Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations.


