Unpacking Linguistic Bias: A New Game-Based Approach to Evaluating LLM Abstract Reasoning Across Languages

TLDR: A new research paper introduces GLOBALGROUP, a multilingual word grouping game inspired by NYT Connections, to evaluate large language models’ (LLMs) abstract reasoning abilities across English, Spanish, Chinese, Hindi, and Arabic. The study found a significant bias towards English performance, better results on non-culturally-related topics, and that multilingual-focused training (like in Aya-8B) can enable smaller open-source models to compete with larger ones. It also validated new metrics for game difficulty, providing a clearer understanding of LLM reasoning in diverse linguistic settings.

Large language models, or LLMs, are increasingly powerful, but their performance can sometimes be influenced by the language they are operating in. This phenomenon, known as linguistic bias, means a model might perform better on a task in one language compared to another, even if the content is similar. While many studies have looked at this bias in tasks requiring specific knowledge or strategies, like math or common sense, there’s been less focus on abstract reasoning – the kind of ‘out-of-the-box’ thinking people use to identify patterns and solve problems without relying on fixed formulas.

To address this gap, researchers César Guerra-Solano, Zhuochun Li, and Xiang Lorraine Li from the University of Pittsburgh introduced a new evaluation method called GLOBALGROUP. This novel word grouping game, inspired by the popular New York Times Connections puzzle, is designed to assess LLMs’ abstract reasoning abilities across multiple languages. The full research paper, titled “Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games,” can be found here.

What is GLOBALGROUP?

In GLOBALGROUP, an LLM is given a pool of words and must sort them into equal-sized groups, providing a unifying topic for each group. Unlike tasks that might rely on a model’s existing knowledge, this game emphasizes inductive reasoning and pattern recognition. The challenge comes from constraints, such as limiting group size, which forces the model to optimize its groupings by finding commonalities without disrupting other potential groups. This game-based format encourages lateral thinking, making it an ideal tool for evaluating abstract reasoning.

A Multilingual Benchmark

The researchers constructed a comprehensive benchmark with five linguistic backgrounds: English, Spanish, Chinese, Hindi, and Arabic. For comparison, they also created English translations of the non-English groupings. The dataset includes both author-created games and games derived from 511 New York Times Connections puzzles, allowing for diverse experimental settings.

To ensure a fair and controlled comparison, especially in reasoning evaluations, the study also proposed methods to measure game difficulty. Three key factors were identified:

Group Count: The number of groups in a game. More groups generally mean higher difficulty.
Adjusted Rand Index (ARI): This measures the semantic similarity of words within a group. Lower ARI scores (meaning less semantic similarity) indicate a more difficult game, as the connections are more abstract.
Word Overlap: This refers to words that could reasonably fit into multiple potential groups. Higher overlap makes a game more challenging, as models need to discern the optimal grouping.

Key Findings from the Evaluation

The study evaluated six LLMs, including both closed-source models like GPT-3.5-Turbo and GPT-4, and open-source models such as Llama3-8B, Llama3.1-70B, Mistral-7B, and Aya-8B. Here are some of the significant observations:

English Bias: Across most models, performance significantly improved when non-English groupings were translated into English. This highlights a strong bias towards English language representations, suggesting limitations in multilingual reasoning capabilities even for similar content.
Cultural Impact: Models generally performed better on groupings labeled as “non-culturally-related” compared to “culturally-related” ones. This indicates a potential cultural influence on model performance and reasoning.
Difficulty of NYT Connections: Games derived from the New York Times Connections were consistently more challenging for all models than the custom-created English games, especially when evaluated on topic prediction. This suggests the NYT games incorporate more intricate design elements, possibly including intentional word overlap.
Open vs. Closed-Source Models: The research demonstrated the immense value of multilingual-focused training. Aya-8B, an open-source model of similar size to others, achieved comparable results to much larger closed- and open-source LLMs due to its multilingual training paradigm. Additionally, model size played a significant role in overall performance and in reducing linguistic bias. Interestingly, for non-English tasks like Arabic games, smaller open-source LLMs sometimes performed on par with larger, often more expensive, closed-source models.
Difficulty Metrics Validated: The proposed difficulty metrics (group count, Adjusted Rand Index, and word overlap) were found to correlate well with model performance, confirming their utility in assessing game difficulty.

Also Read:

Conclusion

The GLOBALGROUP benchmark offers a valuable new tool for evaluating LLM abstract reasoning across languages. It effectively identifies language-related biases and highlights the importance of diverse language choices in training data for developing more equitable and capable language systems. The findings underscore that while LLMs have advanced significantly, there’s still a need to improve their abstract reasoning in diverse linguistic and cultural contexts.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking Linguistic Bias: A New Game-Based Approach to Evaluating LLM Abstract Reasoning Across Languages

What is GLOBALGROUP?

A Multilingual Benchmark

Key Findings from the Evaluation

Conclusion

Gen AI News and Updates

CrochetBench: Advancing AI’s Ability to Understand and Create Crochet Patterns

Unveiling LLM Efficiency: OckBench Introduces a New Metric Beyond Accuracy

Unpacking LPFQA: A New Benchmark for Real-World LLM Evaluation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates