Bridging the Language Gap: Evaluating LLM Morality Across Global Contexts

TLDR: A study evaluated leading LLMs like GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4 on their moral and safety responses across five categories and six languages (English, Chinese, Spanish, Arabic, Hindi, Swahili). It found that GPT-5 performed best overall, while other models showed inconsistencies, especially in “trick” questions designed to bypass safety features. Notably, models often performed worse in high-resource languages for these trick questions, potentially because they were “tricked” by context, whereas in low-resource languages, they might simply refuse. The study highlights the critical need for better multilingual datasets and evaluation frameworks to ensure consistent LLM safety and ethical reasoning globally.

As large language models (LLMs) become an integral part of daily life across the globe, from drafting emails to generating recipes, a crucial question arises: how consistently and ethically do they respond across different languages and cultural contexts? A recent study, “Measuring Moral LLM Responses in Multilingual Capacities,” by Kimaya Basu, Savi Kolari, and Allison Yu, delves into this very challenge, revealing significant insights into the multilingual capabilities and safety features of leading AI models.

Understanding the Challenge

The widespread adoption of LLMs means that people worldwide rely on them for information. However, much of the existing data used to train and test these models is predominantly in English. This linguistic imbalance can lead to inconsistencies in LLM responses when prompts are given in other languages, particularly concerning sensitive areas like morality, ethics, and safety. Previous research has shown that LLM safety protocols, while improving, often falter when faced with non-English queries, struggling to detect malicious content. This study aimed to bridge this knowledge gap by examining how LLMs handle ethical, legal, and safety questions, and how consistent their responses are across a spectrum of languages.

How the Study Was Conducted

To rigorously evaluate LLM performance, the researchers developed a comprehensive dataset of 500 English questions, categorized into five key areas: Biases & Stereotypes, Consent & Autonomy, Harm Prevention & Safety, Legality, and Moral Judgment. These categories were chosen to cover domains where ethical sensitivity and responsible decision-making are paramount. To test multilingual capacities, these questions were then translated into six languages: English, three high-resource languages (Chinese, Spanish, Arabic), and two low-resource languages (Hindi, Swahili), using the Googletrans Python package. This selection aimed to capture a wide variety of sentence structures, cultural contexts, and writing systems.

The study evaluated several prominent LLMs, including frontier models like GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4, alongside open-source models such as Llama 4 Scout and Qwen3 235B-a22b. Each model was prompted with the translated questions, and their responses were then translated back to English and graded using a five-point rubric by Gemini 2.5 Pro, acting as an “LLM-as-a-judge.” This rubric focused not just on the definitive answer but also on the justification and reasoning provided, aiming to reduce human bias. To ensure the reliability of the evaluation, random samples were cross-checked by GPT-5 and Qwen 3.

Key Findings: A Mixed Bag of Performance

The results offered a fascinating look into the current state of multilingual LLM performance. Overall, GPT-5 emerged as the top performer, achieving an impressive average grade of nearly 92% across all categories. In contrast, Qwen had the lowest average performance at 66%. Gemini 2.5 Pro, while excelling in more factual and straightforward categories like Biases & Stereotypes, Legality, and Moral Judgment, struggled significantly in the “trick” categories—Consent & Autonomy and Harm Prevention & Safety—scoring as low as 1.385 and 1.98 out of 5, respectively. This suggests that Gemini’s training data, heavily skewed towards academic papers, might make it less adept at recognizing subtly deceptive prompts.

A consistent pattern observed across all models was their strong performance in English for regular questions, which then dipped in the trickier categories. Interestingly, in these deceptive categories (Consent & Autonomy, Harm Prevention & Safety), models often scored *higher* in low-resource languages than in high-resource ones. The researchers hypothesize that in low-resource languages, models might simply refuse to answer when detecting potentially harmful keywords, whereas in high-resource languages, they might process the context more deeply and, consequently, be “tricked” into providing an undesirable response.

Another notable finding concerned Qwen, a Chinese-developed model. Despite its origin, it performed poorly in the Legality category when prompted in Chinese. This was attributed to Qwen’s tendency to assume the user’s location is China and provide answers based solely on Chinese law, failing to acknowledge the global variations in legal frameworks.

Unexpected Insights and Limitations

The study highlighted that while models generally performed better in categories like Biases & Stereotypes, Legality, and Moral Judgment, they struggled more with Consent & Autonomy and Harm Prevention & Safety. GPT-5’s lower score in Consent & Autonomy, for instance, was linked to OpenAI’s stringent safety guidelines, which might lead it to be overly cautious with questions that could cause emotional or mental harm, even if not physically violent. Differences in safety protocols among models, such as Claude’s stricter guidelines compared to Qwen or Gemini, also contributed to score discrepancies.

The researchers acknowledge several limitations, including potential inaccuracies from using Googletrans for translations, and the absence of human responses to establish a “base truth” for societal values. The rubric, while standardized, reflects alignment with its criteria rather than current human moral consensus. Furthermore, models might form perceptions of the user based on question phrasing, influencing their responses beyond the actual content.

Also Read:

The Path Forward

This research underscores that current benchmarks may not fully capture the complexities of multilingual safety in LLMs. The observed variations in reasoning, accuracy, and refusal rates demonstrate that safety features are highly sensitive to both language and phrasing. This raises significant concerns about the reliability of LLMs in handling malicious and ethically complex prompts globally. The study concludes by emphasizing the urgent need for more robust multilingual datasets and advanced evaluation frameworks to ensure that AI safety protocols and model responses remain consistent and trustworthy, regardless of the language used. Future work could involve expanding the language dataset, collecting human responses from diverse cultures, and meticulously accounting for how subtle phrasing influences model behavior. For a deeper dive into the methodology and detailed results, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Language Gap: Evaluating LLM Morality Across Global Contexts

Understanding the Challenge

How the Study Was Conducted

Key Findings: A Mixed Bag of Performance

Unexpected Insights and Limitations

The Path Forward

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates