Evaluating Large Language Models' Grasp of Nonbinary Pronouns

TLDR: A new study, MISGENDERED+, updates the evaluation of Large Language Models (LLMs) on handling nonbinary and gender-neutral pronouns. Benchmarking GPT-4o, Claude 4, DeepSeek-V3, and Qwen models, the research introduces a novel task to infer gender identity from pronoun usage. Findings show significant improvements in modern LLMs, especially with few-shot prompting, but highlight persistent challenges with rare neopronouns and name-based biases, underscoring the need for more inclusive training data and evaluation methods.

Large Language Models, or LLMs, are becoming increasingly common in various applications, including those where fairness and inclusivity are paramount. A significant challenge for these advanced AI systems is their handling of pronouns, especially gender-neutral and neopronouns. Previous evaluations, such as the MISGENDERED benchmark, highlighted considerable limitations in older LLMs regarding inclusive pronoun usage. However, that work was based on outdated models and had a limited scope.

A new study introduces MISGENDERED+, an updated and expanded benchmark designed to thoroughly evaluate how well LLMs handle pronouns. This research benchmarks five prominent LLMs: GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5. The evaluation covers various scenarios, including zero-shot (no examples provided), few-shot (a few examples provided), and a novel gender identity inference task.

Understanding Pronoun Bias in LLMs

Pronoun bias in AI systems isn’t just a technical issue; it reflects broader societal inequities in how pronouns are used to acknowledge or disregard a person’s identity. Misgendering, which is using pronouns or names inconsistent with someone’s identity, can be a form of microaggression, causing distress and marginalization. The study distinguishes between three types of pronouns:

Binary pronouns: Such as he/him and she/her, which are traditionally associated with male and female genders. These are common in training data, leading to higher accuracy but also reinforcing stereotypes.
Gender-neutral pronouns: Primarily singular they/them. While widely accepted for individuals outside the male/female binary or when gender is unknown, LLMs often struggle with their ambiguity due to their dual singular and plural functions.
Neopronouns: Newer forms like xe/xem or ze/zir. These are used by individuals who feel existing pronouns don’t adequately express their identity. They are rare in training data, leading to significantly lower performance in LLMs.

The original MISGENDERED benchmark focused on testing LLMs’ ability to fill in masked pronouns based on explicit declarations. It revealed very low accuracy for neopronouns (as low as 8%) in older models. However, this benchmark had limitations: it only tested one-directional pronoun prediction and used models that are now considered outdated, lacking modern alignment techniques.

Introducing MISGENDERED+

To address these limitations, MISGENDERED+ expands the original dataset with new templates and diverse pronoun forms. A key innovation is the Gender Identity Inference task. Instead of predicting a pronoun, this task reverses the challenge: given a sentence with a pronoun and a name, the model must infer the most likely gender identity of the subject. This helps reveal if models correctly respect pronoun usage, even when a name might suggest a different gender, for example, inferring “non-binary” for “Alex” when “Xe” is used.

Key Findings from the Evaluation

The experiments revealed several important trends:

Zero-shot vs. Few-shot Performance: Providing just a few examples (few-shot prompting) dramatically improved model performance across the board, especially for models like DeepSeek-V3 and Qwen variants. For instance, DeepSeek-V3’s accuracy on common pronouns jumped from around 20% to over 70% with few-shot examples. Top models like GPT-4o and Claude-4-Sonnet already performed exceptionally well in zero-shot settings but still saw marginal gains.
Grammatical Forms: Modern LLMs showed improved consistency across different grammatical forms (nominative, accusative, possessive, reflexive). Few-shot prompting was particularly crucial for boosting the performance of lower-performing models and achieving more balanced pronoun handling across these forms.
Gender Identity Inference: GPT-4o and Claude-4-Sonnet demonstrated near-perfect accuracy in inferring gender identity from pronoun usage, even with mismatched name-pronoun combinations. However, other models like Qwen-Turbo struggled with these mismatches, indicating a persistent bias towards name-based gender assumptions.

Also Read:

Comparing with Past Studies and Future Outlook

Compared to the 2023 MISGENDERED study, the new evaluation shows significant improvements in LLMs’ handling of neopronouns and grammatical consistency. Modern models like GPT-4o and Claude-4-Sonnet now achieve over 95% accuracy on most neopronouns, a substantial leap from the 75% accuracy seen in the best models of 2023. This progress is attributed to larger model sizes, improved training datasets, and better instruction tuning.

Despite these advancements, challenges remain. There’s still incomplete generalization across all pronoun types, especially for very rare neopronouns. Name-based gender biases persist in some models, overriding explicit pronoun cues. Furthermore, there’s a scarcity of diverse, high-quality training data for inclusive pronoun usage and ambiguity in evaluation metrics.

Future research should focus on augmenting training data with more gender-diverse narratives, developing probabilistic models for pronoun preferences, and involving queer, trans, and non-binary communities in designing benchmarks. This will ensure that LLMs are not only technically accurate but also socially respectful and inclusive. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Large Language Models’ Grasp of Nonbinary Pronouns

Understanding Pronoun Bias in LLMs

Introducing MISGENDERED+

Key Findings from the Evaluation

Comparing with Past Studies and Future Outlook

Gen AI News and Updates

A New Benchmark for Evaluating AI in Electronic Health Records: Introducing EHRStruct

Accelerating ML Hardware Design: A New Benchmark and AI Models for FPGA Resource Estimation

Beyond Mirroring: How Large Language Models Invent New Social Biases

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates