Bridging the Language Gap: New Multilingual AI Models Evaluate LLMs Across 72 Languages

TLDR: MR3 is a new family of multilingual, rubric-agnostic reward reasoning models trained on 72 languages. It achieves state-of-the-art performance in evaluating LLMs across diverse languages, outperforming larger models while being significantly smaller. The research highlights effective training strategies, including target-language reasoning, and makes its models, data, and code open source.

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities, especially as they become more sophisticated. While automatic evaluation methods using LLM judges have proven effective in English, their performance often falls short in non-English languages. This gap highlights a significant challenge: how to effectively train these judges for multilingual settings.

A new research paper introduces MR3, a groundbreaking family of massively multilingual, rubric-agnostic reward reasoning models. This model stands out for its extensive language coverage, trained on an impressive 72 languages, making it the broadest in reward modeling to date. The researchers behind MR3, including David Anugraha and Genta Indra Winata, conducted a thorough study on data and curriculum selection to identify the most effective strategies for building high-quality reward models. This included integrating reasoning datasets specifically tailored for target languages.

MR3 addresses several key challenges in multilingual LLM evaluation. Previous research has has largely focused on English, leaving non-English evaluation underexplored. Building reward models that generalize across many languages, particularly in low-resource settings, is inherently difficult. Additionally, collecting human judgments for preference alignment is expensive and time-consuming. While existing human evaluation data offers an alternative, it often lacks standardization and consistent criteria, and is subject to privacy and proprietary restrictions.

The MR3 framework is designed to overcome these hurdles. It provides a unified, open-ended multilingual reasoning evaluation system that assesses candidate responses against a human-defined rubric. The model generates a reasoning trace, a concise explanation for its judgment, and a final scalar score. This framework supports various evaluation settings, including point-wise, pair-wise, and binary evaluations, making rubrics more versatile.

A significant aspect of MR3’s development involved meticulous dataset construction. The team curated a vast collection of over 3 million examples across 125 languages from diverse public sources. For datasets lacking explicit evaluation rubrics, GPT-4.1 was used to automatically generate them in English. The researchers then distilled expected natural language outputs using a powerful open-source reasoning model, GPT-OSS-120B. A rigorous filtering process was applied to ensure high-quality supervision, retaining only challenging examples that smaller models couldn’t consistently solve. This resulted in a final dataset of 100,000 high-quality examples spanning 72 languages.

The training of MR3 models, primarily using the Qwen3 model family, involved supervised fine-tuning to enhance reasoning capabilities. The researchers experimented with various curriculum strategies, finding that an “easy-to-hard” ordering of samples yielded the best performance. This approach involved sorting the dataset from simpler to more difficult examples.

The results are impressive. MR3 achieves state-of-the-art performance on multilingual reward model benchmarks, even outperforming much larger models like GPT-OSS-120B while being up to nine times smaller. For instance, MR3-QWEN3-14B achieved an average accuracy of 84.94% on pairwise preference benchmarks, surpassing the strongest multilingual baselines. This demonstrates the effectiveness of MR3’s multilingual supervision dataset and training pipeline.

The research also delved into the impact of different prompting and reasoning language strategies. While English-prompted, English-reasoning (eng-eng) remained the strongest in absolute terms, fine-tuning significantly improved performance across all strategies, especially for target-language reasoning (tgt-tgt). This is crucial for interpretability, allowing users to understand model decisions in their preferred language. The study found that explicitly generating reasoning in the target language through “language forcing” was more effective than post-hoc translation.

Furthermore, an evaluation of reasoning faithfulness showed that MR3 consistently improved reasoning quality compared to its baseline, particularly in low-resource languages. This indicates that the model not only performs well but also produces more plausible and logically coherent explanations.

Also Read:

The MR3 models, data, and code are available as open source, fostering further research and development in multilingual LLM evaluation. You can find more details in the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Language Gap: New Multilingual AI Models Evaluate LLMs Across 72 Languages

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates