RECODE-H: A New Benchmark for Interactive Research Code Generation with Human Feedback

TLDR: RECODE-H is a novel benchmark designed to evaluate large language models (LLMs) in generating and refining scientific research code through multi-turn interactions with simulated human feedback. It features 102 tasks from real research papers, structured instructions, unit tests, and a five-level feedback hierarchy. Experiments with leading LLMs demonstrate significant performance gains with richer feedback, particularly for GPT-5 and DeepSeek-V3.1, underscoring the importance of interactive refinement for complex research code. The benchmark highlights that LLMs still face challenges in interpreting research descriptions and bridging domain knowledge gaps, rather than just basic coding errors.

Large language models, or LLMs, are becoming increasingly common in scientific research, helping with everything from brainstorming ideas to writing papers. However, their ability to generate accurate and executable code for research purposes has remained a significant challenge. Current methods often evaluate these models in a ‘one-shot’ setting, meaning they are expected to produce perfect code in a single attempt. This approach overlooks the real-world process of scientific code development, which is typically iterative and heavily relies on human feedback.

To bridge this gap, a new benchmark called RECODE-H has been introduced. This benchmark is designed to evaluate how LLM agents perform when generating and refining research code through multi-turn interactions, simulating the kind of feedback a human researcher would provide. RECODE-H comprises 102 tasks derived from actual research papers and their associated code repositories, spanning fields like machine learning, natural language processing, computer vision, and computational science.

Each task in RECODE-H comes with structured instructions, unit tests, and a unique five-level feedback hierarchy. This hierarchy allows for a systematic evaluation of LLMs under progressively richer forms of guidance, from minimal feedback (just a failure notification) to highly detailed feedback, including explicit code snippets for correction. This structured approach mirrors realistic collaboration between researchers and AI agents.

The creators of RECODE-H also developed ReCodeAgent, a framework specifically designed to integrate this iterative feedback into the code generation process. ReCodeAgent operates in four stages: Observation (gathering repository state, execution logs, and feedback), Reflection (analyzing failures and integrating feedback), Planning (formulating next steps), and Action (executing operations like reading or writing files). It also includes memory management to keep context relevant across multiple interaction turns.

Experiments conducted on RECODE-H with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, revealed some compelling insights. All models showed substantial performance improvements when provided with richer feedback. For instance, GPT-5’s recall rate significantly improved from 29.4% without feedback to 71.6% with the most detailed feedback, and DeepSeek-V3.1 saw a similar jump from 10.8% to 70.6%. This highlights that even minimal diagnostic information can nearly double success rates, and the benefits continue to grow with more granular feedback.

Larger models like GPT-5 and DeepSeek-V3.1 demonstrated a stronger ability to adapt to and leverage progressively richer feedback, often solving tasks in fewer turns. Other models, such as Claude-Sonnet-4 and some Gemini variants, showed more modest gains, suggesting that model architecture and training play a crucial role in how effectively LLMs can incorporate iterative guidance.

An in-depth error analysis revealed that most failures were not due to basic syntax or runtime errors, which modern LLMs generally handle well. Instead, the majority of issues stemmed from higher-level semantic problems, such as misinterpreting paper instructions, gaps in domain knowledge, or overlooking implicit assumptions. This indicates that while LLMs are proficient at basic coding, they still struggle with the nuanced understanding required to faithfully implement complex research methods and integrate them into existing codebases.

The study also examined feedback adoption rates, finding that models generally adopt feedback that leads to successful corrections. Stronger feedback guidance increased the likelihood of adoption, particularly for models that showed significant performance gains. Feedback related to code correctness and repository integration errors had the highest adoption rates, while addressing subtle logical errors requiring a deeper understanding of research intent proved more challenging.

Also Read:

RECODE-H establishes a vital foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation. It moves beyond one-shot evaluations to capture the iterative, human-centric nature of real-world code development, paving the way for more capable and collaborative AI assistants in science. You can learn more about this benchmark and its findings in the full research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RECODE-H: A New Benchmark for Interactive Research Code Generation with Human Feedback

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates