TLDR: MultiNRC is a new benchmark with over 1,000 native, culturally and linguistically grounded reasoning questions in French, Spanish, and Chinese, designed to evaluate LLMs’ true multilingual capabilities. It reveals that current LLMs struggle significantly with native multilingual reasoning (none score above 50%) and perform better on math problems translated into English, but not on cultural reasoning, highlighting persistent challenges with culturally specific knowledge in non-English contexts.
While Large Language Models (LLMs) have shown impressive progress in English reasoning, evaluating their capabilities across diverse languages and cultural contexts has remained a significant challenge. Many existing multilingual benchmarks are simply translations of English ones, which can inadvertently bias the evaluation towards English-centric reasoning problems and cultural contexts.
Introducing MultiNRC: A Native Multilingual Benchmark
To address this critical gap, researchers Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing from Scale AI have introduced the Multilingual Native Reasoning Challenge (MultiNRC). This innovative benchmark is specifically designed to assess LLMs on over 1,000 native, linguistically, and culturally grounded reasoning questions. These questions were meticulously written by native speakers in French, Spanish, and Chinese, ensuring authenticity and relevance to each language’s unique nuances.
MultiNRC covers four core reasoning categories:
- Language-specific linguistic reasoning
- Wordplay & riddles
- Cultural/tradition reasoning
- Math reasoning with cultural relevance
For the cultural/tradition and culturally relevant math reasoning categories, the benchmark also provides English equivalent translations. These translations, manually created by native speakers fluent in English, allow for a direct comparison of LLM reasoning capacity in other languages versus English on the exact same questions. The full research paper can be found here: MultiNRC Research Paper.
Key Findings from LLM Evaluation
The researchers systematically evaluated 14 leading LLMs, covering most major LLM families, on MultiNRC and its English equivalent set. The results highlight several important insights:
- Persistent Challenges: Current LLMs are still not proficient in native multilingual reasoning, with none of the tested models scoring above 50% accuracy on MultiNRC. This underscores the high difficulty of the benchmark and the significant room for improvement in LLM multilingual capabilities.
- Varied Strengths: LLMs exhibit distinct strengths and weaknesses across different linguistic, cultural, and logical reasoning tasks. For instance, some models might perform better in French wordplay, while others excel in Chinese cultural prompts.
- English Advantage in Math: Most models performed substantially better in math reasoning when questions were presented in English compared to their original languages (showing an average improvement of +10%). This suggests that LLMs are better able to retrieve and apply culturally grounded knowledge for math problems when the context is provided in English.
- Limited Cultural Improvement: In contrast, for cultural/tradition reasoning, there was no significant performance difference between English equivalent prompts and the original multilingual prompts. This indicates that the cultural context in these questions is often too specific and nuanced, and may be absent from the LLM’s knowledge base regardless of the language.
Also Read:
- Unpacking How Question Types Affect Large Language Model Performance
- A Two-Stage Framework to Reduce AI Hallucinations in Multilingual Models
Implications and Future Directions
The MultiNRC benchmark serves as a robust testbed for future advancements in multilingual LLM development. The findings highlight the sensitivity of large language models to linguistic and cultural nuances, especially when reasoning in languages other than English. The research also points to limitations, such as not exploring models specifically finetuned for multilingual tasks and the need to include more languages and consider regional or dialectal variations in future work. This comprehensive evaluation reinforces the need for more diverse training data and tailored model evaluation to ensure robust and equitable progress in multilingual LLMs.


