TLDR: SWE-MERA is a new, dynamic benchmark for evaluating Large Language Models (LLMs) on software engineering tasks. Unlike previous static benchmarks like SWE-bench, SWE-MERA continuously collects and validates real-world GitHub issues through an automated seven-stage pipeline, minimizing data contamination and ensuring fair, up-to-date evaluations. It uses the Aider coding agent to test LLMs and provides reliable performance baselines, addressing critical limitations of existing evaluation methods.
The world of artificial intelligence, particularly Large Language Models (LLMs), is rapidly changing how we approach software engineering. These advanced models are becoming increasingly capable of assisting with coding tasks, from generating code to fixing bugs. However, evaluating their true performance has been a significant challenge, largely due to limitations in existing benchmarks.
One of the most widely used benchmarks, SWE-bench, has faced critical issues. Recent studies have revealed significant data contamination, meaning that a substantial portion of successful solutions were either directly leaked or passed due to inadequate test cases. Furthermore, SWE-bench is static; its tasks were collected once and never updated. This leads to problems like data leakage, where models might memorize solutions, and benchmark saturation, where top models achieve near-perfect scores, making it difficult to gauge real progress.
To address these fundamental challenges, researchers have introduced a new dynamic benchmark called SWE-MERA. This innovative benchmark is designed to be continuously updated, ensuring that LLMs are evaluated on fresh, real-world software engineering problems. SWE-MERA aims to provide a more reliable and fair assessment of LLM capabilities by minimizing contamination risks and implementing rigorous quality validation.
How SWE-MERA Works: A Seven-Stage Pipeline
The core of SWE-MERA is a robust, automated seven-stage pipeline that systematically collects and validates evaluation tasks from real-world GitHub issues. This pipeline runs monthly to ensure the dataset remains current and reflective of ongoing developments in software engineering. Here’s a simplified look at the steps:
- Repository Selection: GitHub repositories are chosen based on criteria like a minimum number of stars and forks, recent activity, Python as the primary language, and an open-source license.
- PR-Issue Mapping Construction: The system links pull requests (PRs) to their corresponding issues, ensuring a one-to-one relationship and that the PR is merged and the issue closed.
- Metadata Extraction and Filtering: Information like issue titles, descriptions, and comments are downloaded and filtered to ensure sufficient length.
- Patch Extraction and Validation: The actual code changes (patches) from pull requests are extracted and validated. Only changes that modify both source code and test files, and involve fewer than 15 source files, are kept.
- Repository Build Validation: For each task, a suitable environment is built in a Docker container, and validation succeeds if at least one test passes.
- End-to-End Task Execution: Each generated task is executed in a controlled Docker environment to verify its reproducibility and correctness.
- LLM-based Pipeline Evaluation: Finally, a powerful LLM (Qwen3-32B) evaluates the task description, patch, and associated tests based on criteria like task correctness, test correctness, test completeness, and complexity. Tasks falling into the bottom 25% for correctness and completeness are filtered out, ensuring high-quality problems.
This meticulous process results in approximately 10,000 potential tasks, with 300 samples currently available for evaluation.
Evaluating LLMs with SWE-MERA
To assess LLMs in issue-solving scenarios, SWE-MERA employs the Aider coding agent, a popular framework known for its performance. Models are given six attempts to fix a given issue, with each attempt allowing for reflections based on linting or test outputs. The benchmark reports two key metrics: pass@1 (success on the first attempt) and pass@6 (success within six attempts).
The researchers evaluated a dozen recent LLMs, including Codestral, Qwen2.5-Coder, Llama-3.3, DeepSeek-R1, and Devstral-Small. The results, collected from tasks between September 2024 and June 2025, demonstrate SWE-MERA’s strong ability to differentiate between the performance of various state-of-the-art models. For instance, DeepSeek-R1-0528 showed the highest pass@6 rate at 40.2%, followed by Devstral-Small-2505 at 28.2% and Qwen3-32B at 26.1%. Interestingly, some models like DeepSeek-R1 performed better on tasks from 2024 compared to 2025, suggesting the dynamic nature of the benchmark effectively captures evolving challenges.
Also Read:
- CodeJudgeBench: A New Benchmark for Evaluating AI Code Judges
- LiveRepoReflection: A New Standard for Evaluating AI’s Code Understanding in Complex Projects
Key Insights and Future Directions
The development of SWE-MERA has yielded several important observations. The GitHub API rate limits are surprisingly efficient, allowing for the collection of relevant tasks from the past month within two days using a single GitHub token. The LLM-based evaluation step is crucial for maintaining task quality, filtering out problems that are either too complex due to insufficient information or too trivial due to explicit solutions.
A notable finding during the security assessment was the discovery of two repositories, suitable for benchmarking, that also exhibited known virus signatures. This highlights the importance of integrating basic virus signature checking into such systems to ensure the integrity and safety of collected repositories.
The SWE-MERA evaluation platform offers a reproducible and transparent environment for benchmarking software engineering agents. It features an interactive web interface where users can visualize evaluation metrics across different dates and inspect potential data contamination events. The dataset is updated monthly and is available via Hugging Face, encouraging community participation and submissions.
While SWE-MERA offers significant advantages, the authors acknowledge certain limitations. Dynamically collected tasks might sometimes lack the nuanced complexity of human-authored problems, potentially resulting in unnaturally phrased prompts or incomplete specifications. Ensuring the quality and fairness of these problems is also an ongoing challenge, as biases could be introduced. Furthermore, while dynamic generation reduces memorization risks, it doesn’t eliminate them entirely. The infrastructure for dynamic problem generation also adds technical complexity and potential instability. Lastly, the current benchmark primarily focuses on programming correctness, leaving other crucial aspects like code readability, maintainability, efficiency, and security for future work.
In conclusion, SWE-MERA represents a significant step forward in evaluating LLMs for software engineering tasks. By embracing dynamic data collection, automated quality validation, and continuous updates, it effectively addresses the limitations of traditional static benchmarks, providing a more reliable and contamination-free assessment of AI capabilities in this critical domain. For more details, you can refer to the full research paper: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks.


