TLDR: SPICE is an automated pipeline that labels software engineering datasets for issue clarity, test coverage, and effort estimation. It significantly reduces the cost of labeling 1,000 instances from an estimated $100,000 (manual) to just $5.10, while maintaining high accuracy compared to human experts. SPICE also generates detailed rationales and has created SPICE-bench, a new dataset 13 times larger than existing human-labeled benchmarks, accelerating AI development in software engineering.
Creating high-quality labeled datasets is a critical, yet often prohibitively expensive and labor-intensive, step in training and evaluating artificial intelligence models for software engineering tasks. These datasets, like the widely used SWE-bench Verified (SWE-V), are essential for benchmarking model performance and serving as valuable resources for pretraining and fine-tuning large language models (LLMs) in the software domain.
However, the manual process of labeling these datasets comes with significant challenges. For instance, constructing SWE-V, which contains 500 instances, is estimated to have required over 2,200 engineer hours and cost more than $170,000. Beyond the financial burden, manual labeling is prone to subjectivity, leading to low agreement among human annotators, and simply doesn’t scale to the massive datasets needed for modern AI.
Introducing SPICE: An Automated Labeling Solution
A new automated pipeline called SPICE (Scalable Pipeline for Issue Clarity, Test Coverage, and Effort Estimation) has been introduced to address these limitations. SPICE is designed to label SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation, offering a scalable and cost-effective alternative to manual annotation.
SPICE leverages a combination of advanced techniques: context-aware code navigation, rationale-driven prompting, and a multi-pass consensus mechanism. Its design was heavily influenced by the frustrations and insights gained from manually labeling over 800 instances from SWE-Gym, highlighting the practical need for automation.
How SPICE Works
SPICE operates through two primary modular pipelines: the Issue Clarity Assessment (ICA) pipeline and the Test Coverage Assessment (TCA) pipeline. Both pipelines run multiple times, and SPICE uses majority voting to determine the final label, mimicking human consensus.
The ICA pipeline takes the issue title and description of a GitHub issue and classifies it as either ‘well-specified’ or ‘underspecified’. It uses a specially designed ‘rationale-informed prompt’ that was developed by analyzing human rationales from SWE-V, allowing the AI to generate both a label and a supporting explanation.
The TCA pipeline is more complex, requiring a deep understanding of the codebase. It utilizes Aider, an open-source AI pair programming tool, which can create a concise representation of an entire Git repository. SPICE instructs Aider to focus on relevant files (those included in the ‘gold patch’ and ‘test patch’) to assess test coverage. A structured prompt is then executed, and an auxiliary model parses Aider’s output to retrieve the test label score.
Remarkable Results: Accuracy and Cost Savings
Evaluations of SPICE demonstrate its effectiveness across several key metrics:
- Accuracy: SPICE shows strong agreement with human-labeled SWE-V data. For Issue Clarity Assessment, it achieves an accuracy of up to 87.3%. For Test Coverage Assessment, the accuracy reaches 68.5%. While TCA is inherently more complex, SPICE’s performance is still significant, and its generated rationales can help human reviewers quickly identify and correct any errors.
- Rationale Quality: SPICE-generated rationales are semantically similar to those written by human experts, with median similarities around 0.72 for ICA and 0.743 for TCA. Notably, SPICE’s rationales are often more detailed, providing roughly 1.5 times more explanation for ICA and nearly 7 times more for TCA, which is invaluable for debugging and understanding the AI’s reasoning.
- Cost Efficiency: This is where SPICE truly shines. The estimated cost of manually labeling 1,000 instances by expert software engineers is around $100,000. SPICE reduces this cost to an astonishing $5.10 for 1,000 instances (using a combination of GPT-4o-mini for ICA and DeepSeek Reasoner for TCA). This represents a remarkable cost reduction factor of approximately 19,600 times. Furthermore, SPICE is significantly faster, completing the labeling process approximately 53 times quicker than human annotators.
Also Read:
- New Framework Improves Software Change Management with AI Transparency
- DatasetAgent: Automating Image Dataset Creation with Multi-Agent AI
Supporting the Community
To further support the software engineering and AI communities, the researchers are releasing both the SPICE tool and SPICE-bench, a new dataset. SPICE-bench comprises 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym, making it over 13 times larger than the human-labeled SWE-bench Verified dataset. This vast new resource is intended to accelerate the fine-tuning, benchmarking, and tool development for software-focused foundation models.
SPICE represents a significant step forward in automating the creation of high-quality software engineering datasets. By drastically reducing costs and time, it paves the way for more extensive and diverse datasets, ultimately fostering the development of more capable and reliable AI systems for real-world software maintenance tasks. You can learn more about this groundbreaking work by reading the full research paper: SPICE: An Automated SWE-Bench Labeling Pipeline.


