Uncovering Linguistic Bias: How Language Affects LLM Performance in Math Education

TLDR: A new study reveals consistent linguistic bias in LLM-generated math solutions, with English outputs consistently rated highest and Arabic lowest. Researchers developed an automated pipeline to evaluate GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus across English, German, and Arabic math problems. The findings highlight the need for more equitable multilingual AI systems in education, as performance disparities are influenced by linguistic complexity and training data biases.

Large Language Models (LLMs) are rapidly becoming integral to educational support, offering explanations and problem-solving guidance, especially in subjects like mathematics. However, a significant concern arises from the fact that most commercial LLMs are predominantly trained on English-centric data. This can lead to varying response quality depending on the language of interaction, potentially creating an imbalance in educational support for students globally.

A recent study, titled “Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs,” by Mariam Mahran and Katharina Simbeck, delves into this critical issue. The researchers developed an automated multilingual pipeline to generate, solve, and evaluate math problems aligned with the German K–10 curriculum. This comprehensive framework allowed for consistent comparison of LLM performance across different languages.

The study involved generating 628 math exercises, which were then translated into English, German, and Arabic. Three commercial LLMs—GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus—were prompted to produce step-by-step solutions for these problems in each language. To ensure an impartial assessment of solution quality, a panel of LLM judges, including Claude 3.5 Haiku, evaluated these solutions using a comparative framework. This innovative approach utilized LLMs themselves to judge the outputs, with a “held-out” strategy where the model being evaluated was excluded from the judging panel for its own outputs.

The findings revealed a consistent and notable gap in performance: English solutions were consistently rated highest, while Arabic solutions often ranked lower. German solutions typically occupied a middle ground. For instance, GPT-4o-mini showed the strongest language preference, heavily favoring English and consistently rating Arabic lowest. While Qwen-plus followed a similar trend, Gemini 2.5 Flash exhibited a more balanced distribution, though English still led in top rankings.

The evaluation criteria extended beyond mere correctness, focusing on clarity, structured reasoning, and appropriate use of mathematical terminology. English solutions frequently received praise for being “comprehensive,” “well-structured,” and “clear.” German responses often received mixed feedback, sometimes described as adequate but occasionally critiqued for being “less detailed” or assuming prior knowledge. Arabic solutions, unfortunately, had the highest concentration of negative sentiment, with justifications often noting a lack of clarity or depth in explanation.

The researchers suggest that variations in linguistic complexity likely contribute to these observed performance patterns. German, with its complex sentence structure and compound words, can pose challenges for LLMs. Arabic, with its rich morphology and right-to-left script, presents even greater difficulties. The morphological richness of Arabic can lead to shorter outputs that might be perceived as less elaborated compared to English or German responses. Furthermore, LLMs may lack sufficient exposure to formal, school-style Arabic instructional texts, leading them to misinterpret concise Arabic solutions as less educationally complete.

The implications of these findings are significant for education. Teachers cannot assume that LLMs will provide equal support across all languages; stronger guidance may be available in English than in other languages. Therefore, LLMs should be viewed as supplementary tools, ideally with teacher oversight or cross-lingual checks. The automated pipeline developed in this study also serves as a valuable analytical resource, helping educators identify curriculum areas where linguistic gaps in AI support pose the greatest risk. This allows for adaptive instruction to promote more equitable learning in multilingual classrooms.

Also Read:

In conclusion, this study underscores the importance of developing more linguistically inclusive AI systems. As LLMs become more integrated into educational platforms, addressing these disparities is crucial to prevent the reinforcement of existing educational inequalities. Future work should prioritize improving LLM performance in underrepresented languages through targeted fine-tuning, diverse training data, and culturally informed prompt design. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Uncovering Linguistic Bias: How Language Affects LLM Performance in Math Education

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

India to Integrate AI and Computational Thinking into School Curriculum from Grade 3 by 2026

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates