TLDR: A new study reveals consistent linguistic bias in LLM-generated math solutions, with English outputs consistently rated highest and Arabic lowest. Researchers developed an automated pipeline to evaluate GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus across English, German, and Arabic math problems. The findings highlight the need for more equitable multilingual AI systems in education, as performance disparities are influenced by linguistic complexity and training data biases.
Large Language Models (LLMs) are rapidly becoming integral to educational support, offering explanations and problem-solving guidance, especially in subjects like mathematics. However, a significant concern arises from the fact that most commercial LLMs are predominantly trained on English-centric data. This can lead to varying response quality depending on the language of interaction, potentially creating an imbalance in educational support for students globally.
A recent study, titled “Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs,” by Mariam Mahran and Katharina Simbeck, delves into this critical issue. The researchers developed an automated multilingual pipeline to generate, solve, and evaluate math problems aligned with the German K–10 curriculum. This comprehensive framework allowed for consistent comparison of LLM performance across different languages.
The study involved generating 628 math exercises, which were then translated into English, German, and Arabic. Three commercial LLMs—GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus—were prompted to produce step-by-step solutions for these problems in each language. To ensure an impartial assessment of solution quality, a panel of LLM judges, including Claude 3.5 Haiku, evaluated these solutions using a comparative framework. This innovative approach utilized LLMs themselves to judge the outputs, with a “held-out” strategy where the model being evaluated was excluded from the judging panel for its own outputs.
The findings revealed a consistent and notable gap in performance: English solutions were consistently rated highest, while Arabic solutions often ranked lower. German solutions typically occupied a middle ground. For instance, GPT-4o-mini showed the strongest language preference, heavily favoring English and consistently rating Arabic lowest. While Qwen-plus followed a similar trend, Gemini 2.5 Flash exhibited a more balanced distribution, though English still led in top rankings.
The evaluation criteria extended beyond mere correctness, focusing on clarity, structured reasoning, and appropriate use of mathematical terminology. English solutions frequently received praise for being “comprehensive,” “well-structured,” and “clear.” German responses often received mixed feedback, sometimes described as adequate but occasionally critiqued for being “less detailed” or assuming prior knowledge. Arabic solutions, unfortunately, had the highest concentration of negative sentiment, with justifications often noting a lack of clarity or depth in explanation.
The researchers suggest that variations in linguistic complexity likely contribute to these observed performance patterns. German, with its complex sentence structure and compound words, can pose challenges for LLMs. Arabic, with its rich morphology and right-to-left script, presents even greater difficulties. The morphological richness of Arabic can lead to shorter outputs that might be perceived as less elaborated compared to English or German responses. Furthermore, LLMs may lack sufficient exposure to formal, school-style Arabic instructional texts, leading them to misinterpret concise Arabic solutions as less educationally complete.
The implications of these findings are significant for education. Teachers cannot assume that LLMs will provide equal support across all languages; stronger guidance may be available in English than in other languages. Therefore, LLMs should be viewed as supplementary tools, ideally with teacher oversight or cross-lingual checks. The automated pipeline developed in this study also serves as a valuable analytical resource, helping educators identify curriculum areas where linguistic gaps in AI support pose the greatest risk. This allows for adaptive instruction to promote more equitable learning in multilingual classrooms.
Also Read:
- How Different Languages Enhance AI’s Mathematical Abilities
- Decoding Slang: How AI’s Informal Language Differs from Human Expression
In conclusion, this study underscores the importance of developing more linguistically inclusive AI systems. As LLMs become more integrated into educational platforms, addressing these disparities is crucial to prevent the reinforcement of existing educational inequalities. Future work should prioritize improving LLM performance in underrepresented languages through targeted fine-tuning, diverse training data, and culturally informed prompt design. For more details, you can read the full research paper here.


