Unmasking LLM Bias: A New Benchmark Reveals Factual Inconsistencies Across User Demographics

TLDR: ConsistencyAI is a new benchmark that evaluates whether large language models (LLMs) provide consistent factual answers to users from different demographic groups. The study queried 19 LLMs across 15 topics with 100 different personas, measuring factual overlap using cosine similarity. It found significant variations in consistency based on both the LLM provider and the topic, with Grok-3 being the most consistent overall. Controversial topics like the job market and geopolitical conflicts showed lower consistency, and some models even refused to answer certain sensitive questions, suggesting potential self-censorship. The benchmark aims to promote more reliable and trustworthy AI systems by highlighting these inconsistencies.

In an era where large language models (LLMs) are increasingly becoming a primary source of information for millions, a critical question arises: Do these AI systems provide consistent facts to everyone, regardless of who is asking? A new independent benchmark, ConsistencyAI, dives deep into this concern, revealing how LLMs might subtly alter factual responses based on perceived user demographics.

The research paper, titled “ConsistencyAI: A Benchmark to Assess LLMs’ Factual Consistency When Responding to Different Demographic Groups”, was authored by Peter Banyas, Shristi Sharma, Alistair Simmons, and Atharva Vispute. Their work highlights a crucial aspect of AI reliability and fairness that extends beyond simple fact-checking.

The Core Problem: Factual Inconsistency Across Personas

LLMs are transforming how we access information, offering conversational interfaces that synthesize facts and current events. However, if these models present different facts to different demographic groups—or ‘personas’—they could inadvertently contribute to selective information exposure and reinforce divergent worldviews. This isn’t about whether a fact is true or false, but whether the *set of facts* presented remains stable across various users.

The ConsistencyAI benchmark was designed to measure this very phenomenon. It tests whether, when users from different demographic backgrounds ask identical questions, an LLM responds with factually inconsistent answers. This is vital because LLMs are mediating access to information for a significant portion of the global population, and any tailoring of facts based on presumed ideological predispositions could fragment collective social discourse.

Policy and Societal Relevance

The issue of ideological bias in AI systems is a major policy concern. Initiatives like America’s AI Action Plan and executive orders emphasize the need for ideological neutrality and objectivity in LLMs. The goal is for AI systems to pursue objective truth rather than social engineering agendas. If an LLM tailors its factual responses based on demographic cues, it raises serious questions about fairness, transparency, and the integrity of information in a democratic society.

How ConsistencyAI Works

The benchmark’s methodology involved querying 19 different LLMs with prompts that requested five facts for each of 15 topics. This query was repeated 100 times for each LLM, each time adding prompt context from a different persona selected from a subset modeling the general population. These personas were synthetically generated but grounded in US Census and demographic distributions, ensuring diversity across attributes like age, gender, occupation, and geographic region.

The responses were then processed into sentence embeddings, and cross-persona cosine similarity was computed. A higher cosine similarity score indicates greater factual consistency, meaning the model preserves a stable factual core regardless of the audience. The researchers adopted the across-model mean of 0.8656 as a practical industry baseline for factual consistency.

Key Findings: Who’s Consistent, Who’s Not?

The study revealed that factual consistency scores for the 19 tested LLMs ranged from 0.9065 to 0.7896. xAI’s Grok-3 emerged as the most consistent model, followed by Google Gemini-Flash-1.5, Anthropic Claude-3.5-Haiku, and xAI Grok-4. Interestingly, OpenAI’s models generally performed lower in overall consistency.

Consistency also varied significantly by topic. The ‘job market’ was found to be the least consistent topic, with all tested LLMs performing below their average scores. This suggests systemic challenges for models in presenting reliable information on volatile and complex subjects. In contrast, ‘G7 World Leaders’ was the most factually consistent topic, likely due to its stable factual baseline.

For controversial topics like ‘vaccines’ and the ‘Israeli–Palestinian conflict’, performance diverged sharply across providers. Some models showed increased consistency, while others experienced significant declines. This indicates that both the model provider and the specific topic play critical roles in shaping factual consistency.

The Challenge of Self-Censorship

A striking observation was the issue of LLM non-responsiveness. Out of 28,500 queries, 3.9% did not receive a valid response. A significant majority (78.7%) of these non-responses came from queries about the ‘Israeli–Palestinian Conflict’. Certain models, such as Deepseek-Chat-v3-0324, Gemma-3-4b-it, and Deepseek-r1, had disproportionately high non-response rates for this topic. This could suggest that models might be trained or designed to avoid certain controversial subjects, raising concerns about potential censorship.

Also Read:

Implications for the Future of AI

The ConsistencyAI benchmark underscores that factual consistency is not necessarily improving linearly with the release of newer, more advanced models. The study found no significant correlation between factual consistency and conventional performance indicators like release date or reasoning capacity.

The researchers recommend that LLM providers integrate safeguards at the system prompt level, explicitly directing models to present facts objectively, regardless of user persona. Benchmarks like ConsistencyAI offer a crucial independent tool for researchers, developers, journalists, and the general public to evaluate and track improvements in LLM reliability and trustworthiness over time. You can explore more about this benchmark and its findings at the research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking LLM Bias: A New Benchmark Reveals Factual Inconsistencies Across User Demographics

The Core Problem: Factual Inconsistency Across Personas

Policy and Societal Relevance

How ConsistencyAI Works

Key Findings: Who’s Consistent, Who’s Not?

The Challenge of Self-Censorship

Implications for the Future of AI

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

EBU Academy’s School of AI Honored with European Digital Skills Award for Upskilling Media Professionals

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates