Automating Software Engineering Dataset Labeling: Introducing SPICE for Cost-Effective AI Training

TLDR: SPICE is an automated pipeline that labels software engineering datasets for issue clarity, test coverage, and effort estimation. It significantly reduces the cost of labeling 1,000 instances from an estimated $100,000 (manual) to just $5.10, while maintaining high accuracy compared to human experts. SPICE also generates detailed rationales and has created SPICE-bench, a new dataset 13 times larger than existing human-labeled benchmarks, accelerating AI development in software engineering.

Creating high-quality labeled datasets is a critical, yet often prohibitively expensive and labor-intensive, step in training and evaluating artificial intelligence models for software engineering tasks. These datasets, like the widely used SWE-bench Verified (SWE-V), are essential for benchmarking model performance and serving as valuable resources for pretraining and fine-tuning large language models (LLMs) in the software domain.

However, the manual process of labeling these datasets comes with significant challenges. For instance, constructing SWE-V, which contains 500 instances, is estimated to have required over 2,200 engineer hours and cost more than $170,000. Beyond the financial burden, manual labeling is prone to subjectivity, leading to low agreement among human annotators, and simply doesn’t scale to the massive datasets needed for modern AI.

Introducing SPICE: An Automated Labeling Solution

A new automated pipeline called SPICE (Scalable Pipeline for Issue Clarity, Test Coverage, and Effort Estimation) has been introduced to address these limitations. SPICE is designed to label SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation, offering a scalable and cost-effective alternative to manual annotation.

SPICE leverages a combination of advanced techniques: context-aware code navigation, rationale-driven prompting, and a multi-pass consensus mechanism. Its design was heavily influenced by the frustrations and insights gained from manually labeling over 800 instances from SWE-Gym, highlighting the practical need for automation.

How SPICE Works

SPICE operates through two primary modular pipelines: the Issue Clarity Assessment (ICA) pipeline and the Test Coverage Assessment (TCA) pipeline. Both pipelines run multiple times, and SPICE uses majority voting to determine the final label, mimicking human consensus.

The ICA pipeline takes the issue title and description of a GitHub issue and classifies it as either ‘well-specified’ or ‘underspecified’. It uses a specially designed ‘rationale-informed prompt’ that was developed by analyzing human rationales from SWE-V, allowing the AI to generate both a label and a supporting explanation.

The TCA pipeline is more complex, requiring a deep understanding of the codebase. It utilizes Aider, an open-source AI pair programming tool, which can create a concise representation of an entire Git repository. SPICE instructs Aider to focus on relevant files (those included in the ‘gold patch’ and ‘test patch’) to assess test coverage. A structured prompt is then executed, and an auxiliary model parses Aider’s output to retrieve the test label score.

Remarkable Results: Accuracy and Cost Savings

Evaluations of SPICE demonstrate its effectiveness across several key metrics:

Accuracy: SPICE shows strong agreement with human-labeled SWE-V data. For Issue Clarity Assessment, it achieves an accuracy of up to 87.3%. For Test Coverage Assessment, the accuracy reaches 68.5%. While TCA is inherently more complex, SPICE’s performance is still significant, and its generated rationales can help human reviewers quickly identify and correct any errors.
Rationale Quality: SPICE-generated rationales are semantically similar to those written by human experts, with median similarities around 0.72 for ICA and 0.743 for TCA. Notably, SPICE’s rationales are often more detailed, providing roughly 1.5 times more explanation for ICA and nearly 7 times more for TCA, which is invaluable for debugging and understanding the AI’s reasoning.
Cost Efficiency: This is where SPICE truly shines. The estimated cost of manually labeling 1,000 instances by expert software engineers is around $100,000. SPICE reduces this cost to an astonishing $5.10 for 1,000 instances (using a combination of GPT-4o-mini for ICA and DeepSeek Reasoner for TCA). This represents a remarkable cost reduction factor of approximately 19,600 times. Furthermore, SPICE is significantly faster, completing the labeling process approximately 53 times quicker than human annotators.

Also Read:

Supporting the Community

To further support the software engineering and AI communities, the researchers are releasing both the SPICE tool and SPICE-bench, a new dataset. SPICE-bench comprises 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym, making it over 13 times larger than the human-labeled SWE-bench Verified dataset. This vast new resource is intended to accelerate the fine-tuning, benchmarking, and tool development for software-focused foundation models.

SPICE represents a significant step forward in automating the creation of high-quality software engineering datasets. By drastically reducing costs and time, it paves the way for more extensive and diverse datasets, ultimately fostering the development of more capable and reliable AI systems for real-world software maintenance tasks. You can learn more about this groundbreaking work by reading the full research paper: SPICE: An Automated SWE-Bench Labeling Pipeline.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating Software Engineering Dataset Labeling: Introducing SPICE for Cost-Effective AI Training

Introducing SPICE: An Automated Labeling Solution

How SPICE Works

Remarkable Results: Accuracy and Cost Savings

Supporting the Community

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates