Uncovering Hidden Biases in Large Language Models with Adaptive Question Generation

TLDR: A new research paper introduces a Counterfactual Bias Evaluation Framework that automatically generates realistic, open-ended questions to detect subtle biases in Large Language Models (LLMs). This framework, used to create the human-verified CAB benchmark, iteratively refines questions to maximize biased responses across sensitive attributes like sex, race, and religion. The study found that while models like GPT-5 show lower bias, persistent issues remain, and the framing of questions (explicit vs. implicit) significantly impacts bias detection. The research provides a detailed analysis of bias types, aiming to improve fairness in AI.

Large language models (LLMs) are now a common part of our daily lives, used by hundreds of millions of people in various applications. As our reliance on these AI systems grows, so do concerns about their potential to expose users to inherent biases. These biases can systematically disadvantage or stereotype certain groups, often without users even realizing it.

Traditional methods for evaluating bias in LLMs often fall short. Many existing benchmarks use templated prompts or restrictive multiple-choice questions that are too simplistic and don’t reflect the complex ways people interact with LLMs in the real world. These methods can also struggle to differentiate between a model exhibiting bias and a model simply acknowledging the existence of societal biases.

To address these limitations, researchers have introduced a new framework for evaluating bias called the Counterfactual Bias Evaluation Framework. This innovative approach automatically generates realistic, open-ended questions designed to bring out biased behavior in LLMs. It focuses on sensitive attributes like sex, race, or religion.

The framework works by iteratively changing and selecting questions that are most likely to induce bias. This systematic exploration helps uncover areas where models are most vulnerable to biased responses. Beyond just detecting harmful biases, the framework also captures other important aspects of user interactions, such as when models refuse to answer certain questions or explicitly acknowledge bias.

Using this framework, a new human-verified benchmark called CAB (Counterfactual Assessment of Bias) has been developed. CAB includes a diverse range of topics and is designed to allow for direct comparisons between different LLMs. Through CAB, researchers have analyzed several state-of-the-art LLMs, gaining detailed insights into how different models manifest bias. For example, while GPT-5 generally performs well, it still shows persistent biases in specific situations. These findings highlight the ongoing need for improvements to ensure fair and equitable AI behavior.

The core of this adaptive generation process involves several components. It uses LLMs in three distinct roles: a ‘generation model’ to create and modify questions, a ‘target model’ to answer these questions (acting as a proxy for bias-inducing potential), and a ‘judge model’ to evaluate the bias in the responses. The judge model assesses answers across multiple dimensions, only assigning a high ‘fitness’ score when models show unrequested and non-refusing bias that is irrelevant to the user’s question. This multi-dimensional evaluation helps to avoid common pitfalls, such as mistaking a model’s acknowledgment of bias for the exhibition of bias itself.

The framework also supports the creation of ‘implicit’ questions, which avoid explicitly naming sensitive attributes but instead rely on associated stereotypical traits, like first names. This makes the questions feel more natural and representative of real-world chatbot interactions. The CAB benchmark itself contains 405 questions across sex, religion, and race, covering a wide array of topics such as education, finance, and relationships. It highlights areas where biases are most frequently found, like ‘education’ for race or ‘family’ for sex.

The analysis of frontier LLMs using CAB revealed that GPT-5 and CLAUDE-4-SONNET exhibited the lowest levels of bias, while GROK-4 and QWEN3-235B showed higher levels. Interestingly, implicit questions generally resulted in about a 40% decrease in detected bias compared to explicit questions, suggesting that the way a question is framed significantly impacts bias elicitation. However, even subtle identifiers can still lead to harmful biases.

The research also categorized the types of biases observed. For sex, common biases included ‘role and caregiving stereotypes,’ ‘leadership/authority bias,’ and ‘communication tone stereotypes.’ For race, biases often involved ‘language and accent assumptions,’ ‘identity-linked work steering,’ and ‘essentialist trait attribution.’ For religion, biases frequently appeared as ‘religious marker and appearance stereotyping’ or ‘unrequested identity insertion.’ These categorizations provide valuable insights into the specific ways biases manifest in LLM outputs.

Also Read:

This work represents a significant step forward in bias evaluation, offering a more realistic and nuanced approach than previous methods. By providing a framework for generating diverse, bias-eliciting questions and a comprehensive benchmark like CAB, it contributes to the ongoing effort to build more equitable and trustworthy AI systems. For more details, you can refer to the full research paper: Adaptive Generation of Bias-Eliciting Questions for LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Uncovering Hidden Biases in Large Language Models with Adaptive Question Generation

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Morgan Freeman Condemns Unauthorized AI Voice Replication, Citing Theft of Identity and Work

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates