TLDR: A new research paper introduces a Counterfactual Bias Evaluation Framework that automatically generates realistic, open-ended questions to detect subtle biases in Large Language Models (LLMs). This framework, used to create the human-verified CAB benchmark, iteratively refines questions to maximize biased responses across sensitive attributes like sex, race, and religion. The study found that while models like GPT-5 show lower bias, persistent issues remain, and the framing of questions (explicit vs. implicit) significantly impacts bias detection. The research provides a detailed analysis of bias types, aiming to improve fairness in AI.
Large language models (LLMs) are now a common part of our daily lives, used by hundreds of millions of people in various applications. As our reliance on these AI systems grows, so do concerns about their potential to expose users to inherent biases. These biases can systematically disadvantage or stereotype certain groups, often without users even realizing it.
Traditional methods for evaluating bias in LLMs often fall short. Many existing benchmarks use templated prompts or restrictive multiple-choice questions that are too simplistic and don’t reflect the complex ways people interact with LLMs in the real world. These methods can also struggle to differentiate between a model exhibiting bias and a model simply acknowledging the existence of societal biases.
To address these limitations, researchers have introduced a new framework for evaluating bias called the Counterfactual Bias Evaluation Framework. This innovative approach automatically generates realistic, open-ended questions designed to bring out biased behavior in LLMs. It focuses on sensitive attributes like sex, race, or religion.
The framework works by iteratively changing and selecting questions that are most likely to induce bias. This systematic exploration helps uncover areas where models are most vulnerable to biased responses. Beyond just detecting harmful biases, the framework also captures other important aspects of user interactions, such as when models refuse to answer certain questions or explicitly acknowledge bias.
Using this framework, a new human-verified benchmark called CAB (Counterfactual Assessment of Bias) has been developed. CAB includes a diverse range of topics and is designed to allow for direct comparisons between different LLMs. Through CAB, researchers have analyzed several state-of-the-art LLMs, gaining detailed insights into how different models manifest bias. For example, while GPT-5 generally performs well, it still shows persistent biases in specific situations. These findings highlight the ongoing need for improvements to ensure fair and equitable AI behavior.
The core of this adaptive generation process involves several components. It uses LLMs in three distinct roles: a ‘generation model’ to create and modify questions, a ‘target model’ to answer these questions (acting as a proxy for bias-inducing potential), and a ‘judge model’ to evaluate the bias in the responses. The judge model assesses answers across multiple dimensions, only assigning a high ‘fitness’ score when models show unrequested and non-refusing bias that is irrelevant to the user’s question. This multi-dimensional evaluation helps to avoid common pitfalls, such as mistaking a model’s acknowledgment of bias for the exhibition of bias itself.
The framework also supports the creation of ‘implicit’ questions, which avoid explicitly naming sensitive attributes but instead rely on associated stereotypical traits, like first names. This makes the questions feel more natural and representative of real-world chatbot interactions. The CAB benchmark itself contains 405 questions across sex, religion, and race, covering a wide array of topics such as education, finance, and relationships. It highlights areas where biases are most frequently found, like ‘education’ for race or ‘family’ for sex.
The analysis of frontier LLMs using CAB revealed that GPT-5 and CLAUDE-4-SONNET exhibited the lowest levels of bias, while GROK-4 and QWEN3-235B showed higher levels. Interestingly, implicit questions generally resulted in about a 40% decrease in detected bias compared to explicit questions, suggesting that the way a question is framed significantly impacts bias elicitation. However, even subtle identifiers can still lead to harmful biases.
The research also categorized the types of biases observed. For sex, common biases included ‘role and caregiving stereotypes,’ ‘leadership/authority bias,’ and ‘communication tone stereotypes.’ For race, biases often involved ‘language and accent assumptions,’ ‘identity-linked work steering,’ and ‘essentialist trait attribution.’ For religion, biases frequently appeared as ‘religious marker and appearance stereotyping’ or ‘unrequested identity insertion.’ These categorizations provide valuable insights into the specific ways biases manifest in LLM outputs.
Also Read:
- Assessing LLM Fairness: A New Framework Prioritizes Real-World Harm
- Unmasking Hidden Biases: How AI Perpetuates Disability Discrimination in Hiring
This work represents a significant step forward in bias evaluation, offering a more realistic and nuanced approach than previous methods. By providing a framework for generating diverse, bias-eliciting questions and a comprehensive benchmark like CAB, it contributes to the ongoing effort to build more equitable and trustworthy AI systems. For more details, you can refer to the full research paper: Adaptive Generation of Bias-Eliciting Questions for LLMs.


