spot_img
HomeResearch & DevelopmentBridging the Language Gap: Evaluating LLM Morality Across Global...

Bridging the Language Gap: Evaluating LLM Morality Across Global Contexts

TLDR: A study evaluated leading LLMs like GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4 on their moral and safety responses across five categories and six languages (English, Chinese, Spanish, Arabic, Hindi, Swahili). It found that GPT-5 performed best overall, while other models showed inconsistencies, especially in “trick” questions designed to bypass safety features. Notably, models often performed worse in high-resource languages for these trick questions, potentially because they were “tricked” by context, whereas in low-resource languages, they might simply refuse. The study highlights the critical need for better multilingual datasets and evaluation frameworks to ensure consistent LLM safety and ethical reasoning globally.

As large language models (LLMs) become an integral part of daily life across the globe, from drafting emails to generating recipes, a crucial question arises: how consistently and ethically do they respond across different languages and cultural contexts? A recent study, “Measuring Moral LLM Responses in Multilingual Capacities,” by Kimaya Basu, Savi Kolari, and Allison Yu, delves into this very challenge, revealing significant insights into the multilingual capabilities and safety features of leading AI models.

Understanding the Challenge

The widespread adoption of LLMs means that people worldwide rely on them for information. However, much of the existing data used to train and test these models is predominantly in English. This linguistic imbalance can lead to inconsistencies in LLM responses when prompts are given in other languages, particularly concerning sensitive areas like morality, ethics, and safety. Previous research has shown that LLM safety protocols, while improving, often falter when faced with non-English queries, struggling to detect malicious content. This study aimed to bridge this knowledge gap by examining how LLMs handle ethical, legal, and safety questions, and how consistent their responses are across a spectrum of languages.

How the Study Was Conducted

To rigorously evaluate LLM performance, the researchers developed a comprehensive dataset of 500 English questions, categorized into five key areas: Biases & Stereotypes, Consent & Autonomy, Harm Prevention & Safety, Legality, and Moral Judgment. These categories were chosen to cover domains where ethical sensitivity and responsible decision-making are paramount. To test multilingual capacities, these questions were then translated into six languages: English, three high-resource languages (Chinese, Spanish, Arabic), and two low-resource languages (Hindi, Swahili), using the Googletrans Python package. This selection aimed to capture a wide variety of sentence structures, cultural contexts, and writing systems.

The study evaluated several prominent LLMs, including frontier models like GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4, alongside open-source models such as Llama 4 Scout and Qwen3 235B-a22b. Each model was prompted with the translated questions, and their responses were then translated back to English and graded using a five-point rubric by Gemini 2.5 Pro, acting as an “LLM-as-a-judge.” This rubric focused not just on the definitive answer but also on the justification and reasoning provided, aiming to reduce human bias. To ensure the reliability of the evaluation, random samples were cross-checked by GPT-5 and Qwen 3.

Key Findings: A Mixed Bag of Performance

The results offered a fascinating look into the current state of multilingual LLM performance. Overall, GPT-5 emerged as the top performer, achieving an impressive average grade of nearly 92% across all categories. In contrast, Qwen had the lowest average performance at 66%. Gemini 2.5 Pro, while excelling in more factual and straightforward categories like Biases & Stereotypes, Legality, and Moral Judgment, struggled significantly in the “trick” categories—Consent & Autonomy and Harm Prevention & Safety—scoring as low as 1.385 and 1.98 out of 5, respectively. This suggests that Gemini’s training data, heavily skewed towards academic papers, might make it less adept at recognizing subtly deceptive prompts.

A consistent pattern observed across all models was their strong performance in English for regular questions, which then dipped in the trickier categories. Interestingly, in these deceptive categories (Consent & Autonomy, Harm Prevention & Safety), models often scored *higher* in low-resource languages than in high-resource ones. The researchers hypothesize that in low-resource languages, models might simply refuse to answer when detecting potentially harmful keywords, whereas in high-resource languages, they might process the context more deeply and, consequently, be “tricked” into providing an undesirable response.

Another notable finding concerned Qwen, a Chinese-developed model. Despite its origin, it performed poorly in the Legality category when prompted in Chinese. This was attributed to Qwen’s tendency to assume the user’s location is China and provide answers based solely on Chinese law, failing to acknowledge the global variations in legal frameworks.

Unexpected Insights and Limitations

The study highlighted that while models generally performed better in categories like Biases & Stereotypes, Legality, and Moral Judgment, they struggled more with Consent & Autonomy and Harm Prevention & Safety. GPT-5’s lower score in Consent & Autonomy, for instance, was linked to OpenAI’s stringent safety guidelines, which might lead it to be overly cautious with questions that could cause emotional or mental harm, even if not physically violent. Differences in safety protocols among models, such as Claude’s stricter guidelines compared to Qwen or Gemini, also contributed to score discrepancies.

The researchers acknowledge several limitations, including potential inaccuracies from using Googletrans for translations, and the absence of human responses to establish a “base truth” for societal values. The rubric, while standardized, reflects alignment with its criteria rather than current human moral consensus. Furthermore, models might form perceptions of the user based on question phrasing, influencing their responses beyond the actual content.

Also Read:

The Path Forward

This research underscores that current benchmarks may not fully capture the complexities of multilingual safety in LLMs. The observed variations in reasoning, accuracy, and refusal rates demonstrate that safety features are highly sensitive to both language and phrasing. This raises significant concerns about the reliability of LLMs in handling malicious and ethically complex prompts globally. The study concludes by emphasizing the urgent need for more robust multilingual datasets and advanced evaluation frameworks to ensure that AI safety protocols and model responses remain consistent and trustworthy, regardless of the language used. Future work could involve expanding the language dataset, collecting human responses from diverse cultures, and meticulously accounting for how subtle phrasing influences model behavior. For a deeper dive into the methodology and detailed results, you can read the full research paper here.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -