Decoding Code Agent Decisions: An Analysis of Success and Failure Paths

TLDR: This paper empirically studies the problem-solving processes (trajectories) of three LLM-based code agents (OpenHands, SWE-agent, Prometheus) on the SWE-Bench benchmark. It reveals that successful and failed attempts have distinct patterns, with failures often involving longer, more variable trajectories. The study highlights the importance of repository-aware context, the need for agents to abandon unproductive reasoning, and that approximate fault localization is often sufficient for success, rather than perfect, line-by-line matches. These insights are crucial for developing more robust and interpretable AI software engineering systems.

Large Language Models (LLMs) are rapidly transforming software engineering, moving beyond simple code completion to tackle complex, repository-level problems. This has led to the rise of ‘code agents’ – systems that combine LLMs with tools and reasoning loops to autonomously resolve software issues. While these agents show impressive capabilities, their internal decision-making processes often remain a mystery, making it hard to understand why they succeed or fail.

A recent empirical study, “Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories,” delves into this opacity by analyzing the ‘trajectories’ – detailed logs of every step an agent takes during problem-solving. The research, conducted by Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye, examines the behavior of three state-of-the-art code agents: OpenHands, SWE-agent, and Prometheus, using the SWE-Bench benchmark. The goal is to move beyond simple success rates and understand the pathways agents follow, comparing both successful and failed attempts.

Unpacking Agent Problem-Solving Strategies

The study first investigated what bugs each agent could uniquely resolve, offering a glimpse into their distinct capabilities. For instance, Prometheus demonstrated a strong understanding of existing architectural patterns within a repository, which helped it localize problems at the right abstraction level. In a Django issue, Prometheus successfully identified an enum representation problem, while other agents struggled, going down complex, misguided paths. This highlights the importance of ‘repository-aware context gathering’ for effective problem-solving.

OpenHands, in a Pylint issue, successfully identified that ignore logic was being applied too late. However, the study also revealed a critical insight: simply limiting recursion isn’t enough to prevent agents from getting stuck in unproductive reasoning loops. Agents need mechanisms to recognize when a path is fruitless and abandon it.

SWE-agent, on the other hand, often succeeded using a ‘defensive programming’ approach. In a Sympy issue involving polynomial division, SWE-agent’s patch, while not mathematically elegant, achieved the desired fix by adding checks to retain the original field type. This suggests that even without deep domain knowledge, a robust, defensive approach can lead to success, compensating for limited problem-specific reasoning.

The Anatomy of Success and Failure

One of the study’s key findings concerns the length and variability of agent trajectories. Consistently, failed trajectories were found to be longer and exhibited a wider distribution of steps compared to successful ones. This suggests that agents often get stuck in unsuccessful paths, wasting computational resources. For example, OpenHands and Prometheus showed dramatic increases in trajectory length for failures, while SWE-agent’s failures were more modest in length, indicating it might fail faster, which is desirable for efficiency.

Interestingly, agents with a higher absolute number of steps in successful trajectories (like SWE-agent) tended to have less divergent failures. This implies that a more granular, incremental problem-solving strategy, involving many small steps, might lead to more robust behavior and less catastrophic failures compared to shorter, more aggressive approaches.

Where Agents Look for Faults

The research also explored how well agents localize faults before producing a fix. It measured fault localization at three levels: correct file, correct function, and correct ‘hunk’ (a block of changed lines). The findings show that successful patches almost always find the correct file (over 90% of the time). However, success doesn’t necessarily mean modifying the exact same function or hunk as a ‘gold patch’ (the ideal fix). Agents can still succeed with approximate edits, suggesting that aiming for a flawless, line-by-line match might be an unproductive goal.

Even in failed attempts, agents often correctly locate the problematic file (72-81% in some cases). This indicates that failures frequently stem from fine-grained reasoning within the correct file, rather than a complete misunderstanding of the repository structure. A surprising observation from the SWE-Bench Verified dataset was that failing solutions spent the same amount of time exploring both wrong and correct file paths. This highlights a need for stronger signals to help agents abandon unproductive paths earlier, especially when they’ve identified the wrong file.

Also Read:

Towards More Interpretable and Robust Agents

This study provides a crucial foundation for understanding and improving code agents. It emphasizes that evaluating agents solely on binary success metrics overlooks rich behavioral insights. Factors like how agents gather context, recognize architectural patterns, abandon unproductive searches, and achieve approximate fault localization are critical for success. The findings advocate for moving beyond leaderboard-based evaluations and developing frameworks that prioritize interpretability and robustness in autonomous software engineering systems. You can read the full research paper here: Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Decoding Code Agent Decisions: An Analysis of Success and Failure Paths

Unpacking Agent Problem-Solving Strategies

The Anatomy of Success and Failure

Where Agents Look for Faults

Towards More Interpretable and Robust Agents

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates