spot_img
HomeResearch & DevelopmentEvaluating Large Language Models' Grasp of Nonbinary Pronouns

Evaluating Large Language Models’ Grasp of Nonbinary Pronouns

TLDR: A new study, MISGENDERED+, updates the evaluation of Large Language Models (LLMs) on handling nonbinary and gender-neutral pronouns. Benchmarking GPT-4o, Claude 4, DeepSeek-V3, and Qwen models, the research introduces a novel task to infer gender identity from pronoun usage. Findings show significant improvements in modern LLMs, especially with few-shot prompting, but highlight persistent challenges with rare neopronouns and name-based biases, underscoring the need for more inclusive training data and evaluation methods.

Large Language Models, or LLMs, are becoming increasingly common in various applications, including those where fairness and inclusivity are paramount. A significant challenge for these advanced AI systems is their handling of pronouns, especially gender-neutral and neopronouns. Previous evaluations, such as the MISGENDERED benchmark, highlighted considerable limitations in older LLMs regarding inclusive pronoun usage. However, that work was based on outdated models and had a limited scope.

A new study introduces MISGENDERED+, an updated and expanded benchmark designed to thoroughly evaluate how well LLMs handle pronouns. This research benchmarks five prominent LLMs: GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5. The evaluation covers various scenarios, including zero-shot (no examples provided), few-shot (a few examples provided), and a novel gender identity inference task.

Understanding Pronoun Bias in LLMs

Pronoun bias in AI systems isn’t just a technical issue; it reflects broader societal inequities in how pronouns are used to acknowledge or disregard a person’s identity. Misgendering, which is using pronouns or names inconsistent with someone’s identity, can be a form of microaggression, causing distress and marginalization. The study distinguishes between three types of pronouns:

  • Binary pronouns: Such as he/him and she/her, which are traditionally associated with male and female genders. These are common in training data, leading to higher accuracy but also reinforcing stereotypes.
  • Gender-neutral pronouns: Primarily singular they/them. While widely accepted for individuals outside the male/female binary or when gender is unknown, LLMs often struggle with their ambiguity due to their dual singular and plural functions.
  • Neopronouns: Newer forms like xe/xem or ze/zir. These are used by individuals who feel existing pronouns don’t adequately express their identity. They are rare in training data, leading to significantly lower performance in LLMs.

The original MISGENDERED benchmark focused on testing LLMs’ ability to fill in masked pronouns based on explicit declarations. It revealed very low accuracy for neopronouns (as low as 8%) in older models. However, this benchmark had limitations: it only tested one-directional pronoun prediction and used models that are now considered outdated, lacking modern alignment techniques.

Introducing MISGENDERED+

To address these limitations, MISGENDERED+ expands the original dataset with new templates and diverse pronoun forms. A key innovation is the Gender Identity Inference task. Instead of predicting a pronoun, this task reverses the challenge: given a sentence with a pronoun and a name, the model must infer the most likely gender identity of the subject. This helps reveal if models correctly respect pronoun usage, even when a name might suggest a different gender, for example, inferring “non-binary” for “Alex” when “Xe” is used.

Key Findings from the Evaluation

The experiments revealed several important trends:

  • Zero-shot vs. Few-shot Performance: Providing just a few examples (few-shot prompting) dramatically improved model performance across the board, especially for models like DeepSeek-V3 and Qwen variants. For instance, DeepSeek-V3’s accuracy on common pronouns jumped from around 20% to over 70% with few-shot examples. Top models like GPT-4o and Claude-4-Sonnet already performed exceptionally well in zero-shot settings but still saw marginal gains.
  • Grammatical Forms: Modern LLMs showed improved consistency across different grammatical forms (nominative, accusative, possessive, reflexive). Few-shot prompting was particularly crucial for boosting the performance of lower-performing models and achieving more balanced pronoun handling across these forms.
  • Gender Identity Inference: GPT-4o and Claude-4-Sonnet demonstrated near-perfect accuracy in inferring gender identity from pronoun usage, even with mismatched name-pronoun combinations. However, other models like Qwen-Turbo struggled with these mismatches, indicating a persistent bias towards name-based gender assumptions.

Also Read:

Comparing with Past Studies and Future Outlook

Compared to the 2023 MISGENDERED study, the new evaluation shows significant improvements in LLMs’ handling of neopronouns and grammatical consistency. Modern models like GPT-4o and Claude-4-Sonnet now achieve over 95% accuracy on most neopronouns, a substantial leap from the 75% accuracy seen in the best models of 2023. This progress is attributed to larger model sizes, improved training datasets, and better instruction tuning.

Despite these advancements, challenges remain. There’s still incomplete generalization across all pronoun types, especially for very rare neopronouns. Name-based gender biases persist in some models, overriding explicit pronoun cues. Furthermore, there’s a scarcity of diverse, high-quality training data for inclusive pronoun usage and ambiguity in evaluation metrics.

Future research should focus on augmenting training data with more gender-diverse narratives, developing probabilistic models for pronoun preferences, and involving queer, trans, and non-binary communities in designing benchmarks. This will ensure that LLMs are not only technically accurate but also socially respectful and inclusive. For more details, you can read the full research paper here.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -