TLDR: MERA Code is a new benchmark for evaluating large language models (LLMs) in code generation, particularly for the Russian language. It features 11 tasks across 8 programming languages, focusing on practical coding skills, code quality, and real-world scenarios, unlike many existing English-centric benchmarks. It offers an open-source platform with a scoring system and leaderboard to standardize evaluation and guide future research in multilingual code generation.
A new research paper introduces MERA Code, a comprehensive framework designed to evaluate the capabilities of large language models (LLMs) in generating code. This initiative addresses a significant gap in current evaluation methods, which often overlook crucial aspects like code quality, real-world applicability, and support for multiple human languages, especially non-English ones.
The paper, titled “MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks,” was authored by a large team of researchers including Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, and many others from institutions like SberAI, ITMO University, and Skoltech. Their work highlights that while LLMs have made impressive strides in automating software engineering tasks, existing benchmarks tend to focus more on natural language understanding or are limited to English programming contexts.
MERA Code stands out by specifically targeting the evaluation of code generation LLMs in Russian, providing a much-needed resource for multilingual software development scenarios. It features a robust set of 11 evaluation tasks that span across 8 popular programming languages: Python, Java, C#, JavaScript, Go, C, C++, and Scala. This broad coverage ensures a more holistic assessment of a model’s practical coding skills.
A Deeper Look at MERA Code’s Approach
The framework introduces a unique evaluation methodology, including a detailed taxonomy that outlines the practical coding skills necessary for models to successfully complete tasks. This taxonomy breaks down skills into foundational categories like Perception (input understanding), Reasoning and Knowledge (internal processing), and Generation (output creation), with more niche skills branching out from these.
MERA Code also provides an open-source codebase, allowing users to conduct their own assessments. It features a sophisticated scoring system compatible with various programming environments and a public platform with a leaderboard and submission system. This transparency and accessibility are crucial for fostering collaboration and standardizing evaluation procedures within the research and industrial communities.
<
Tasks and Metrics
The benchmark includes diverse tasks such as CodeLinterEval (evaluating code correction based on linter errors), RealCode and RealCodeJava (assessing function body synthesis within real-world projects), JavaTestGen (generating JUnit 5 unit tests), StRuCom (creating structured Russian-language code documentation), and RuCodeReviewer (generating code review comments in Russian). Each task employs specific metrics like Pass@k (functional correctness), Compile@k (compilation success), chrF (for Russian morphological complexity in text-to-code tasks), and CodeBLEU (for code syntax and semantics similarity).
Also Read:
- Evaluating LLMs for Software Engineering: Introducing SWE-MERA, a Dynamic Benchmark
- CodeJudgeBench: A New Benchmark for Evaluating AI Code Judges
Key Findings from Evaluations
The researchers evaluated both open-source and proprietary LLMs, including models like OpenAI GPT-4, Gemini 2.5, Deepseek Coder V2, and Mixtral. The results showed that models like GPT-4o and Gemini 2.5 Flash demonstrated superior overall performance across the tasks, excelling in multilingual documentation and various code completion tasks. However, all models exhibited weaknesses in areas like generating unit tests in Python and automated comment generation, indicating areas for future improvement.
MERA Code is a foundational resource for the research and industrial community, promoting collaboration to enhance task coverage and adapt to evolving LLM capabilities. By combining natural and programming language evaluation, it supports more relevant assessments of LLMs in software engineering. For more detailed information, you can refer to the full research paper available at arXiv.org.


