spot_img
HomeResearch & DevelopmentRECODE-H: A New Benchmark for Interactive Research Code Generation...

RECODE-H: A New Benchmark for Interactive Research Code Generation with Human Feedback

TLDR: RECODE-H is a novel benchmark designed to evaluate large language models (LLMs) in generating and refining scientific research code through multi-turn interactions with simulated human feedback. It features 102 tasks from real research papers, structured instructions, unit tests, and a five-level feedback hierarchy. Experiments with leading LLMs demonstrate significant performance gains with richer feedback, particularly for GPT-5 and DeepSeek-V3.1, underscoring the importance of interactive refinement for complex research code. The benchmark highlights that LLMs still face challenges in interpreting research descriptions and bridging domain knowledge gaps, rather than just basic coding errors.

Large language models, or LLMs, are becoming increasingly common in scientific research, helping with everything from brainstorming ideas to writing papers. However, their ability to generate accurate and executable code for research purposes has remained a significant challenge. Current methods often evaluate these models in a ‘one-shot’ setting, meaning they are expected to produce perfect code in a single attempt. This approach overlooks the real-world process of scientific code development, which is typically iterative and heavily relies on human feedback.

To bridge this gap, a new benchmark called RECODE-H has been introduced. This benchmark is designed to evaluate how LLM agents perform when generating and refining research code through multi-turn interactions, simulating the kind of feedback a human researcher would provide. RECODE-H comprises 102 tasks derived from actual research papers and their associated code repositories, spanning fields like machine learning, natural language processing, computer vision, and computational science.

Each task in RECODE-H comes with structured instructions, unit tests, and a unique five-level feedback hierarchy. This hierarchy allows for a systematic evaluation of LLMs under progressively richer forms of guidance, from minimal feedback (just a failure notification) to highly detailed feedback, including explicit code snippets for correction. This structured approach mirrors realistic collaboration between researchers and AI agents.

The creators of RECODE-H also developed ReCodeAgent, a framework specifically designed to integrate this iterative feedback into the code generation process. ReCodeAgent operates in four stages: Observation (gathering repository state, execution logs, and feedback), Reflection (analyzing failures and integrating feedback), Planning (formulating next steps), and Action (executing operations like reading or writing files). It also includes memory management to keep context relevant across multiple interaction turns.

Experiments conducted on RECODE-H with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, revealed some compelling insights. All models showed substantial performance improvements when provided with richer feedback. For instance, GPT-5’s recall rate significantly improved from 29.4% without feedback to 71.6% with the most detailed feedback, and DeepSeek-V3.1 saw a similar jump from 10.8% to 70.6%. This highlights that even minimal diagnostic information can nearly double success rates, and the benefits continue to grow with more granular feedback.

Larger models like GPT-5 and DeepSeek-V3.1 demonstrated a stronger ability to adapt to and leverage progressively richer feedback, often solving tasks in fewer turns. Other models, such as Claude-Sonnet-4 and some Gemini variants, showed more modest gains, suggesting that model architecture and training play a crucial role in how effectively LLMs can incorporate iterative guidance.

An in-depth error analysis revealed that most failures were not due to basic syntax or runtime errors, which modern LLMs generally handle well. Instead, the majority of issues stemmed from higher-level semantic problems, such as misinterpreting paper instructions, gaps in domain knowledge, or overlooking implicit assumptions. This indicates that while LLMs are proficient at basic coding, they still struggle with the nuanced understanding required to faithfully implement complex research methods and integrate them into existing codebases.

The study also examined feedback adoption rates, finding that models generally adopt feedback that leads to successful corrections. Stronger feedback guidance increased the likelihood of adoption, particularly for models that showed significant performance gains. Feedback related to code correctness and repository integration errors had the highest adoption rates, while addressing subtle logical errors requiring a deeper understanding of research intent proved more challenging.

Also Read:

RECODE-H establishes a vital foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation. It moves beyond one-shot evaluations to capture the iterative, human-centric nature of real-world code development, paving the way for more capable and collaborative AI assistants in science. You can learn more about this benchmark and its findings in the full research paper.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -