spot_img
HomeResearch & DevelopmentUnmasking LLM Bias: A New Benchmark Reveals Factual Inconsistencies...

Unmasking LLM Bias: A New Benchmark Reveals Factual Inconsistencies Across User Demographics

TLDR: ConsistencyAI is a new benchmark that evaluates whether large language models (LLMs) provide consistent factual answers to users from different demographic groups. The study queried 19 LLMs across 15 topics with 100 different personas, measuring factual overlap using cosine similarity. It found significant variations in consistency based on both the LLM provider and the topic, with Grok-3 being the most consistent overall. Controversial topics like the job market and geopolitical conflicts showed lower consistency, and some models even refused to answer certain sensitive questions, suggesting potential self-censorship. The benchmark aims to promote more reliable and trustworthy AI systems by highlighting these inconsistencies.

In an era where large language models (LLMs) are increasingly becoming a primary source of information for millions, a critical question arises: Do these AI systems provide consistent facts to everyone, regardless of who is asking? A new independent benchmark, ConsistencyAI, dives deep into this concern, revealing how LLMs might subtly alter factual responses based on perceived user demographics.

The research paper, titled “ConsistencyAI: A Benchmark to Assess LLMs’ Factual Consistency When Responding to Different Demographic Groups”, was authored by Peter Banyas, Shristi Sharma, Alistair Simmons, and Atharva Vispute. Their work highlights a crucial aspect of AI reliability and fairness that extends beyond simple fact-checking.

The Core Problem: Factual Inconsistency Across Personas

LLMs are transforming how we access information, offering conversational interfaces that synthesize facts and current events. However, if these models present different facts to different demographic groups—or ‘personas’—they could inadvertently contribute to selective information exposure and reinforce divergent worldviews. This isn’t about whether a fact is true or false, but whether the *set of facts* presented remains stable across various users.

The ConsistencyAI benchmark was designed to measure this very phenomenon. It tests whether, when users from different demographic backgrounds ask identical questions, an LLM responds with factually inconsistent answers. This is vital because LLMs are mediating access to information for a significant portion of the global population, and any tailoring of facts based on presumed ideological predispositions could fragment collective social discourse.

Policy and Societal Relevance

The issue of ideological bias in AI systems is a major policy concern. Initiatives like America’s AI Action Plan and executive orders emphasize the need for ideological neutrality and objectivity in LLMs. The goal is for AI systems to pursue objective truth rather than social engineering agendas. If an LLM tailors its factual responses based on demographic cues, it raises serious questions about fairness, transparency, and the integrity of information in a democratic society.

How ConsistencyAI Works

The benchmark’s methodology involved querying 19 different LLMs with prompts that requested five facts for each of 15 topics. This query was repeated 100 times for each LLM, each time adding prompt context from a different persona selected from a subset modeling the general population. These personas were synthetically generated but grounded in US Census and demographic distributions, ensuring diversity across attributes like age, gender, occupation, and geographic region.

The responses were then processed into sentence embeddings, and cross-persona cosine similarity was computed. A higher cosine similarity score indicates greater factual consistency, meaning the model preserves a stable factual core regardless of the audience. The researchers adopted the across-model mean of 0.8656 as a practical industry baseline for factual consistency.

Key Findings: Who’s Consistent, Who’s Not?

The study revealed that factual consistency scores for the 19 tested LLMs ranged from 0.9065 to 0.7896. xAI’s Grok-3 emerged as the most consistent model, followed by Google Gemini-Flash-1.5, Anthropic Claude-3.5-Haiku, and xAI Grok-4. Interestingly, OpenAI’s models generally performed lower in overall consistency.

Consistency also varied significantly by topic. The ‘job market’ was found to be the least consistent topic, with all tested LLMs performing below their average scores. This suggests systemic challenges for models in presenting reliable information on volatile and complex subjects. In contrast, ‘G7 World Leaders’ was the most factually consistent topic, likely due to its stable factual baseline.

For controversial topics like ‘vaccines’ and the ‘Israeli–Palestinian conflict’, performance diverged sharply across providers. Some models showed increased consistency, while others experienced significant declines. This indicates that both the model provider and the specific topic play critical roles in shaping factual consistency.

The Challenge of Self-Censorship

A striking observation was the issue of LLM non-responsiveness. Out of 28,500 queries, 3.9% did not receive a valid response. A significant majority (78.7%) of these non-responses came from queries about the ‘Israeli–Palestinian Conflict’. Certain models, such as Deepseek-Chat-v3-0324, Gemma-3-4b-it, and Deepseek-r1, had disproportionately high non-response rates for this topic. This could suggest that models might be trained or designed to avoid certain controversial subjects, raising concerns about potential censorship.

Also Read:

Implications for the Future of AI

The ConsistencyAI benchmark underscores that factual consistency is not necessarily improving linearly with the release of newer, more advanced models. The study found no significant correlation between factual consistency and conventional performance indicators like release date or reasoning capacity.

The researchers recommend that LLM providers integrate safeguards at the system prompt level, explicitly directing models to present facts objectively, regardless of user persona. Benchmarks like ConsistencyAI offer a crucial independent tool for researchers, developers, journalists, and the general public to evaluate and track improvements in LLM reliability and trustworthiness over time. You can explore more about this benchmark and its findings at the research paper.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -