Assessing LLM Code Generation: Introducing ReCatcher for Regression Testing

TLDR: ReCatcher is a new framework for regression testing Large Language Models (LLMs) used in code generation. It systematically compares two LLMs across logical correctness, static code quality, and execution performance to identify regressions introduced by model updates like fine-tuning, merging, or new releases. The study found that updates can lead to significant regressions in areas like syntax errors, missing imports, and execution time, emphasizing the need for thorough testing before adopting new LLM versions.

Large Language Models, or LLMs, are rapidly transforming how we generate code. These powerful AI models are constantly evolving through updates like fine-tuning, merging with other models, or entirely new versions being released. While these updates aim to improve performance, they can sometimes introduce unexpected problems, known as regressions. These regressions aren’t just about whether the code works correctly; they can also affect its quality and how fast it runs.

To tackle this challenge, researchers have introduced a new framework called ReCatcher. This innovative tool is designed specifically for regression testing Python code generated by LLMs. ReCatcher works by systematically comparing two LLMs: typically, the version currently in use and a potential new update. It evaluates them across three crucial dimensions: logical correctness (does the code do what it’s supposed to?), static code quality (is the code well-structured and free of common issues?), and execution performance (how fast and efficiently does the code run?).

The creators of ReCatcher applied their framework to assess regressions in three common LLM update scenarios. They looked at how fine-tuning, merging models, and releasing new model versions impacted code generation using popular LLMs like CodeLlama, DeepSeek-Coder, and GPT-4o.

Also Read:

Key Findings from ReCatcher’s Evaluation

The evaluation revealed several important insights into how LLM updates can introduce regressions:

Fine-tuning Impact: When LLMs were fine-tuned using datasets from different programming languages (e.g., Kotlin data for Python code generation), it led to an increase in syntax errors by up to 12%. This suggests that while fine-tuning can enhance logical reasoning, a mismatch in programming languages during training can disrupt the model’s ability to generate syntactically correct code. However, it also showed improvements in logical correctness.
Merging Impact: Combining LLMs can have mixed results. Merging a code-generation model with a general-purpose model like Llama2 led to regressions in logical correctness by up to 18%. This indicates that if the merged model isn’t optimized for code, it can introduce inconsistencies. Conversely, merging with code-specific models, like OpenCodeInterpreter, showed overall improvements in code quality and performance.
Model Release Impact: New model releases within the same family can also introduce regressions. For instance, GPT-4o showed a significant regression of up to 50% in handling missing imports compared to its predecessor, GPT-3.5-turbo, especially in diverse real-world tasks. Additionally, GPT-4o-mini, a smaller version, experienced a substantial performance degradation of up to 80% in execution time when compared to GPT-4o, particularly for algorithmic tasks.

Overall, the study found that logical correctness, execution performance, and error handling (such as syntax errors and missing imports) are the areas most prone to regressions. Readability and maintainability, on the other hand, tended to remain relatively stable across updates.

ReCatcher also demonstrated superior and more consistent accuracy in detecting these regressions compared to traditional baseline solutions, especially in logical and performance aspects. This highlights its robustness in identifying subtle quality differences that other methods might miss.

The development of ReCatcher underscores the critical need for systematic regression evaluation before adopting new LLM versions for code generation. By providing a comprehensive report on how updates affect various code aspects, ReCatcher empowers researchers and practitioners to make more informed decisions, ensuring that new models genuinely improve software quality rather than introducing new problems. The framework is open-sourced, providing a reusable and extensible tool for the community to support informed LLM updating decisions. You can learn more about ReCatcher and access the replication package by visiting the research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing LLM Code Generation: Introducing ReCatcher for Regression Testing

Key Findings from ReCatcher’s Evaluation

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates