OpenCodeEdit: Crafting Superior Datasets for Instruction-Guided Code Editing with Open-Source LLMs

TLDR: OpenCodeEdit is an open-source pipeline that uses multiple LLMs to synthesize high-quality, diverse datasets for code editing. It generates realistic code-edit triplets with both concise (“lazy”) and detailed (“descriptive”) instructions, applying filtering for quality. The resulting 20,000-sample OCEDataFT dataset significantly boosts open-source LLM performance on code editing benchmarks, closing the gap to proprietary models like GPT-4 without relying on closed-source resources.

Code editing is a fundamental aspect of software engineering, where developers modify existing code based on natural language instructions. This task is crucial for bug fixes, refactoring, API migrations, and adding new features. Unlike generating code from scratch, editing requires a deep understanding of the existing program’s context and dependencies to make precise, localized changes while preserving functionality.

However, a significant challenge in advancing automated code editing has been the lack of high-quality training data. Traditional datasets, often derived from commit messages, tend to be noisy, lack diversity, and don’t accurately reflect the varied styles of real-world editing instructions. This limitation has hindered the performance of even advanced language models, especially open-source ones, in instruction-guided code editing tasks.

Introducing OpenCodeEdit: A New Approach to Data Synthesis

To address this critical gap, researchers have introduced OpenCodeEdit, an innovative open-source pipeline designed to synthesize realistic code-edit datasets. This pipeline leverages multiple open-source large language models (LLMs) to create high-quality “code-edit triplets,” which consist of a pre-edit code snippet, a natural language instruction for the edit, and the corresponding post-edit code. A key innovation of OpenCodeEdit is its ability to generate two distinct instruction styles: concise “lazy” instructions, similar to quick developer prompts, and more detailed “descriptive” ones, which provide comprehensive context.

The OpenCodeEdit pipeline ensures data quality and variety through a sophisticated filtering process. This includes “diff filtering” to remove overly complex or trivial edits, and “topic filtering” to ensure a balanced and diverse range of editing scenarios, preventing redundancy and improving the overall learning experience for models.

OCEDataFT: The Curated Dataset

Using the OpenCodeEdit pipeline, the team constructed OCEDataFT, a carefully curated dataset comprising 20,000 samples. This dataset is specifically designed for instruction tuning, meaning it helps LLMs learn to follow instructions for code editing more effectively. Unlike datasets derived from a single model, OCEDataFT combines data from multiple large models, which enhances task diversity and leads to better generalization.

Remarkable Performance Gains

The impact of OCEDataFT is significant. When three advanced base models (Qwen3-8B-Base, Qwen-2.5-Coder-7B-Base, and DeepSeekCoder-Base-6.7b) were fine-tuned on this dataset, they showed substantial performance boosts on the rigorous CanItEdit benchmark. The relative pass@1 improvements, a metric for successful code edits, ranged from 4.50% to an impressive 20.79%. Notably, the models trained with OCEDataFT achieved performance levels very close to closed-source systems, narrowing the gap to GPT-4 to just 3.54%. This achievement is particularly significant because it was accomplished without relying on proprietary resources or extensive manual annotation, making it a truly open-source advancement.

How OpenCodeEdit Works: A Closer Look

The pipeline operates in four main stages:

1. Seed Code Snippet Extraction: Authentic code fragments are sampled from open-source codebases to serve as the foundation for generating realistic editing tasks.

2. Pre-edit Code and Instruction Generation: Two complementary open-source LLMs are used to generate the initial code snippets and corresponding natural-language edit instructions (both lazy and descriptive styles). This multi-LLM approach increases diversity and reduces bias.

3. Post-edit Code Generation: In a second round, the LLMs generate the revised code based on the pre-edit code and instruction. A self-checking mechanism ensures the task is reasonable before generating the solution.

4. Data Filtering: A two-step process, DT-Filtering, is applied. Diff filtering analyzes the complexity of changes (modified lines and “hunks”) to remove overly simple or complex edits. Topic filtering uses advanced modeling to ensure topical diversity and reduce redundancy, leading to a more balanced dataset.

Why Synthetic Data Outperforms Raw Commits

The research also highlighted why synthesizing data is superior to using raw commit data. While commit data might seem relevant, it often contains overly simplistic edits (e.g., single-line changes) and lacks the diverse instruction styles needed for robust model training. OpenCodeEdit, by contrast, generates tasks that better reflect real-world complexity and instruction variety, leading to significantly better model performance.

The Power of Diversity: Multi-LLM and Instruction Styles

Integrating data from multiple LLMs proved to be a powerful strategy, as different models generate distinct types of editing tasks and linguistic styles, leading to a richer and more flexible dataset. Furthermore, combining both descriptive and lazy instruction styles in training data significantly improved the models’ ability to generalize across different query types, with training on lazy instructions proving particularly effective for developing strong inference capabilities.

Also Read:

“Less Is More” with DT-Filtering

The DT-Filtering method demonstrated a “less-is-more” effect. Despite reducing the dataset size by two-thirds (from 60,000 to 20,000 samples), the filtered dataset yielded superior fine-tuning performance. This indicates that removing redundant and noisy samples significantly enhances data quality, reduces training costs, and boosts efficiency.

In conclusion, OpenCodeEdit represents a significant step forward in the field of automated code editing. By providing a robust, open-source pipeline for generating high-quality, diverse, and realistic code editing datasets, it empowers developers and researchers to build more capable LLMs for software engineering tasks. The dataset, code, and fine-tuned models are openly available for replication and further research. You can find more details about this research paper here: Generating High-Quality Datasets for Code Editing via Open-Source Language Models.