spot_img
HomeResearch & DevelopmentBridging the Language Gap: New Multilingual AI Models Evaluate...

Bridging the Language Gap: New Multilingual AI Models Evaluate LLMs Across 72 Languages

TLDR: MR3 is a new family of multilingual, rubric-agnostic reward reasoning models trained on 72 languages. It achieves state-of-the-art performance in evaluating LLMs across diverse languages, outperforming larger models while being significantly smaller. The research highlights effective training strategies, including target-language reasoning, and makes its models, data, and code open source.

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities, especially as they become more sophisticated. While automatic evaluation methods using LLM judges have proven effective in English, their performance often falls short in non-English languages. This gap highlights a significant challenge: how to effectively train these judges for multilingual settings.

A new research paper introduces MR3, a groundbreaking family of massively multilingual, rubric-agnostic reward reasoning models. This model stands out for its extensive language coverage, trained on an impressive 72 languages, making it the broadest in reward modeling to date. The researchers behind MR3, including David Anugraha and Genta Indra Winata, conducted a thorough study on data and curriculum selection to identify the most effective strategies for building high-quality reward models. This included integrating reasoning datasets specifically tailored for target languages.

MR3 addresses several key challenges in multilingual LLM evaluation. Previous research has has largely focused on English, leaving non-English evaluation underexplored. Building reward models that generalize across many languages, particularly in low-resource settings, is inherently difficult. Additionally, collecting human judgments for preference alignment is expensive and time-consuming. While existing human evaluation data offers an alternative, it often lacks standardization and consistent criteria, and is subject to privacy and proprietary restrictions.

The MR3 framework is designed to overcome these hurdles. It provides a unified, open-ended multilingual reasoning evaluation system that assesses candidate responses against a human-defined rubric. The model generates a reasoning trace, a concise explanation for its judgment, and a final scalar score. This framework supports various evaluation settings, including point-wise, pair-wise, and binary evaluations, making rubrics more versatile.

A significant aspect of MR3’s development involved meticulous dataset construction. The team curated a vast collection of over 3 million examples across 125 languages from diverse public sources. For datasets lacking explicit evaluation rubrics, GPT-4.1 was used to automatically generate them in English. The researchers then distilled expected natural language outputs using a powerful open-source reasoning model, GPT-OSS-120B. A rigorous filtering process was applied to ensure high-quality supervision, retaining only challenging examples that smaller models couldn’t consistently solve. This resulted in a final dataset of 100,000 high-quality examples spanning 72 languages.

The training of MR3 models, primarily using the Qwen3 model family, involved supervised fine-tuning to enhance reasoning capabilities. The researchers experimented with various curriculum strategies, finding that an “easy-to-hard” ordering of samples yielded the best performance. This approach involved sorting the dataset from simpler to more difficult examples.

The results are impressive. MR3 achieves state-of-the-art performance on multilingual reward model benchmarks, even outperforming much larger models like GPT-OSS-120B while being up to nine times smaller. For instance, MR3-QWEN3-14B achieved an average accuracy of 84.94% on pairwise preference benchmarks, surpassing the strongest multilingual baselines. This demonstrates the effectiveness of MR3’s multilingual supervision dataset and training pipeline.

The research also delved into the impact of different prompting and reasoning language strategies. While English-prompted, English-reasoning (eng-eng) remained the strongest in absolute terms, fine-tuning significantly improved performance across all strategies, especially for target-language reasoning (tgt-tgt). This is crucial for interpretability, allowing users to understand model decisions in their preferred language. The study found that explicitly generating reasoning in the target language through “language forcing” was more effective than post-hoc translation.

Furthermore, an evaluation of reasoning faithfulness showed that MR3 consistently improved reasoning quality compared to its baseline, particularly in low-resource languages. This indicates that the model not only performs well but also produces more plausible and logically coherent explanations.

Also Read:

The MR3 models, data, and code are available as open source, fostering further research and development in multilingual LLM evaluation. You can find more details in the original research paper.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -