MERA Code: A New Standard for Multilingual Code Generation Evaluation

TLDR: MERA Code is a new benchmark for evaluating large language models (LLMs) in code generation, particularly for the Russian language. It features 11 tasks across 8 programming languages, focusing on practical coding skills, code quality, and real-world scenarios, unlike many existing English-centric benchmarks. It offers an open-source platform with a scoring system and leaderboard to standardize evaluation and guide future research in multilingual code generation.

A new research paper introduces MERA Code, a comprehensive framework designed to evaluate the capabilities of large language models (LLMs) in generating code. This initiative addresses a significant gap in current evaluation methods, which often overlook crucial aspects like code quality, real-world applicability, and support for multiple human languages, especially non-English ones.

The paper, titled “MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks,” was authored by a large team of researchers including Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, and many others from institutions like SberAI, ITMO University, and Skoltech. Their work highlights that while LLMs have made impressive strides in automating software engineering tasks, existing benchmarks tend to focus more on natural language understanding or are limited to English programming contexts.

MERA Code stands out by specifically targeting the evaluation of code generation LLMs in Russian, providing a much-needed resource for multilingual software development scenarios. It features a robust set of 11 evaluation tasks that span across 8 popular programming languages: Python, Java, C#, JavaScript, Go, C, C++, and Scala. This broad coverage ensures a more holistic assessment of a model’s practical coding skills.

A Deeper Look at MERA Code’s Approach

The framework introduces a unique evaluation methodology, including a detailed taxonomy that outlines the practical coding skills necessary for models to successfully complete tasks. This taxonomy breaks down skills into foundational categories like Perception (input understanding), Reasoning and Knowledge (internal processing), and Generation (output creation), with more niche skills branching out from these.

MERA Code also provides an open-source codebase, allowing users to conduct their own assessments. It features a sophisticated scoring system compatible with various programming environments and a public platform with a leaderboard and submission system. This transparency and accessibility are crucial for fostering collaboration and standardizing evaluation procedures within the research and industrial communities.

Tasks and Metrics

The benchmark includes diverse tasks such as CodeLinterEval (evaluating code correction based on linter errors), RealCode and RealCodeJava (assessing function body synthesis within real-world projects), JavaTestGen (generating JUnit 5 unit tests), StRuCom (creating structured Russian-language code documentation), and RuCodeReviewer (generating code review comments in Russian). Each task employs specific metrics like Pass@k (functional correctness), Compile@k (compilation success), chrF (for Russian morphological complexity in text-to-code tasks), and CodeBLEU (for code syntax and semantics similarity).

Also Read:

Key Findings from Evaluations

The researchers evaluated both open-source and proprietary LLMs, including models like OpenAI GPT-4, Gemini 2.5, Deepseek Coder V2, and Mixtral. The results showed that models like GPT-4o and Gemini 2.5 Flash demonstrated superior overall performance across the tasks, excelling in multilingual documentation and various code completion tasks. However, all models exhibited weaknesses in areas like generating unit tests in Python and automated comment generation, indicating areas for future improvement.

MERA Code is a foundational resource for the research and industrial community, promoting collaboration to enhance task coverage and adapt to evolving LLM capabilities. By combining natural and programming language evaluation, it supports more relevant assessments of LLMs in software engineering. For more detailed information, you can refer to the full research paper available at arXiv.org.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

MERA Code: A New Standard for Multilingual Code Generation Evaluation

A Deeper Look at MERA Code’s Approach

Tasks and Metrics

Key Findings from Evaluations

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates