Unpacking Bias in Voice AI: A New Study on Spoken Dialogue Models

TLDR: A new study systematically evaluates biases in Spoken Dialogue LLMs (SDMs) across decision-making and recommendation tasks, considering paralinguistic features like age, gender, and accent. It found that closed-source models generally exhibit less bias, while open-source models are more sensitive to age and gender, and recommendation tasks amplify disparities. Biases were also found to persist in multi-turn conversations. The research introduces the FairDialogue dataset and highlights the critical need for fairness evaluation in audio-based interactive AI.

Large Language Models (LLMs) have become incredibly powerful, but concerns about biases like stereotypes and cultural tendencies in their text outputs are well-known. However, when it comes to Spoken Dialogue Models (SDMs) – LLMs that handle both audio input and output – the presence and characteristics of bias have been largely unexplored. This is a critical area, as paralinguistic features such as a speaker’s age, gender, or accent can influence how these models respond. In multi-turn conversations, these effects could even worsen, leading to unfair outcomes in important applications like decision-making and recommendations.

A recent study by researchers from Nanyang Technological University and Soul AI Lab has systematically evaluated biases in these speech-enabled LLMs. Their work sheds light on how multi-turn dialogues, especially those involving repeated negative feedback, can impact these biases. The research measured bias using specific metrics: the Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations. They tested a range of models, including open-source options like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash.

The findings revealed some significant insights. Generally, closed-source models showed lower levels of bias. Open-source models, on the other hand, were found to be more sensitive to a speaker’s age and gender. The study also highlighted that recommendation tasks tend to amplify disparities between different groups. A particularly concerning discovery was that biased decisions can persist even through multiple turns of conversation. This research marks the first systematic investigation into biases within end-to-end spoken dialogue models, offering crucial understanding for developing fair and reliable audio-based interactive systems. To support further research, the team has released a new dataset called FairDialogue and its associated evaluation code.

Constructing a New Benchmark for Bias

To conduct this evaluation, the researchers had to overcome a significant challenge: the lack of suitable benchmarks for conversational biases in SDMs. They addressed this by constructing a controlled dataset using a two-stage pipeline. First, they generated balanced textual utterances with carefully designed prompts. Then, they synthesized speech from these texts, introducing controlled variations in gender, age, or accent, while keeping other factors constant. This meticulous approach allowed for a systematic analysis of paralinguistic bias in interactive, multi-turn spoken dialogues. The dataset includes approximately 1700 minutes of audio across 7200 samples, covering male/female genders, young/elderly ages, and US, UK, India, Australia, and African accents.

The tasks chosen for evaluation were socially sensitive: decision-making (like interview assessments, task assignments, and award distributions) and recommendation (such as career guidance, course selection, and entertainment suggestions). These scenarios were selected because biased outputs here can have direct and tangible consequences on opportunities, fairness, and user experience. GPT-4o was used to generate the initial text samples, ensuring neutrality and contextual relevance, while advanced Text-to-Speech (TTS) systems like Index-TTS and ElevenLabs were employed for speech synthesis to control attributes like gender, age, and accent precisely.

Key Findings: Single-Turn vs. Multi-Turn Biases

In single-turn conversations, the study found that closed-source models like Gemini-2.5 consistently demonstrated better fairness in decision-making tasks compared to open-source models like Qwen2.5 and GLM. The disparities in open-source models were most noticeable along gender and age lines. Interestingly, all models showed relatively low bias concerning accent in decision tasks. For recommendation tasks, however, fairness differences across paralinguistic attributes were more pronounced. Models like GLM and GPT-4o Audio showed larger maximum disparities between groups, and accent-related bias was significantly higher than in decision-making tasks. This suggests that recommendation tasks, which involve more complex user preferences, can amplify cross-group disparities.

The research also delved into multi-turn conversations, focusing on scenarios where models initially gave identical negative responses in single-turn evaluations. By introducing corrective feedback over several turns, the researchers observed how different attribute groups revised their decisions. This revealed biases not apparent in single-turn interactions. For instance, Elder Male speakers achieved the highest revision success rates, while Young Female speakers exhibited the lowest across both Qwen2.5 and GLM. This indicates that outputs for Elder Males were more easily revised. Qwen2.5 showed a more pronounced age-related bias, while GLM-4-Voice displayed larger gender disparities, adapting faster to corrective feedback for certain groups.

Also Read:

Conclusion and Future Directions

This groundbreaking study underscores that paralinguistic attributes like age, gender, and accent consistently influence the judgments and outputs of spoken dialogue models. These biases are not fleeting; they can persist even through multi-turn conversations with repeated feedback. The prevalence of such biases across both open-source and closed-source models highlights the urgent need for robust fairness evaluation in real-world audio-based interactive systems. Future efforts will need to focus on developing techniques to mitigate these biases and expand analyses to multimodal settings to ensure the responsible deployment of spoken dialogue LLMs. You can read the full research paper here: Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking Bias in Voice AI: A New Study on Spoken Dialogue Models

Constructing a New Benchmark for Bias

Key Findings: Single-Turn vs. Multi-Turn Biases

Conclusion and Future Directions

Gen AI News and Updates

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

OneShield Achieves Landmark Registration Under Cloud Security Alliance AI Controls Matrix, Setting New Industry Standard

SeedAI Leads Utah’s Proactive Initiative for Ethical AI Integration in Business

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates