TLDR: This paper empirically studies the problem-solving processes (trajectories) of three LLM-based code agents (OpenHands, SWE-agent, Prometheus) on the SWE-Bench benchmark. It reveals that successful and failed attempts have distinct patterns, with failures often involving longer, more variable trajectories. The study highlights the importance of repository-aware context, the need for agents to abandon unproductive reasoning, and that approximate fault localization is often sufficient for success, rather than perfect, line-by-line matches. These insights are crucial for developing more robust and interpretable AI software engineering systems.
Large Language Models (LLMs) are rapidly transforming software engineering, moving beyond simple code completion to tackle complex, repository-level problems. This has led to the rise of ‘code agents’ – systems that combine LLMs with tools and reasoning loops to autonomously resolve software issues. While these agents show impressive capabilities, their internal decision-making processes often remain a mystery, making it hard to understand why they succeed or fail.
A recent empirical study, “Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories,” delves into this opacity by analyzing the ‘trajectories’ – detailed logs of every step an agent takes during problem-solving. The research, conducted by Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye, examines the behavior of three state-of-the-art code agents: OpenHands, SWE-agent, and Prometheus, using the SWE-Bench benchmark. The goal is to move beyond simple success rates and understand the pathways agents follow, comparing both successful and failed attempts.
Unpacking Agent Problem-Solving Strategies
The study first investigated what bugs each agent could uniquely resolve, offering a glimpse into their distinct capabilities. For instance, Prometheus demonstrated a strong understanding of existing architectural patterns within a repository, which helped it localize problems at the right abstraction level. In a Django issue, Prometheus successfully identified an enum representation problem, while other agents struggled, going down complex, misguided paths. This highlights the importance of ‘repository-aware context gathering’ for effective problem-solving.
OpenHands, in a Pylint issue, successfully identified that ignore logic was being applied too late. However, the study also revealed a critical insight: simply limiting recursion isn’t enough to prevent agents from getting stuck in unproductive reasoning loops. Agents need mechanisms to recognize when a path is fruitless and abandon it.
SWE-agent, on the other hand, often succeeded using a ‘defensive programming’ approach. In a Sympy issue involving polynomial division, SWE-agent’s patch, while not mathematically elegant, achieved the desired fix by adding checks to retain the original field type. This suggests that even without deep domain knowledge, a robust, defensive approach can lead to success, compensating for limited problem-specific reasoning.
The Anatomy of Success and Failure
One of the study’s key findings concerns the length and variability of agent trajectories. Consistently, failed trajectories were found to be longer and exhibited a wider distribution of steps compared to successful ones. This suggests that agents often get stuck in unsuccessful paths, wasting computational resources. For example, OpenHands and Prometheus showed dramatic increases in trajectory length for failures, while SWE-agent’s failures were more modest in length, indicating it might fail faster, which is desirable for efficiency.
Interestingly, agents with a higher absolute number of steps in successful trajectories (like SWE-agent) tended to have less divergent failures. This implies that a more granular, incremental problem-solving strategy, involving many small steps, might lead to more robust behavior and less catastrophic failures compared to shorter, more aggressive approaches.
Where Agents Look for Faults
The research also explored how well agents localize faults before producing a fix. It measured fault localization at three levels: correct file, correct function, and correct ‘hunk’ (a block of changed lines). The findings show that successful patches almost always find the correct file (over 90% of the time). However, success doesn’t necessarily mean modifying the exact same function or hunk as a ‘gold patch’ (the ideal fix). Agents can still succeed with approximate edits, suggesting that aiming for a flawless, line-by-line match might be an unproductive goal.
Even in failed attempts, agents often correctly locate the problematic file (72-81% in some cases). This indicates that failures frequently stem from fine-grained reasoning within the correct file, rather than a complete misunderstanding of the repository structure. A surprising observation from the SWE-Bench Verified dataset was that failing solutions spent the same amount of time exploring both wrong and correct file paths. This highlights a need for stronger signals to help agents abandon unproductive paths earlier, especially when they’ve identified the wrong file.
Also Read:
- HAFixAgent: Leveraging Repository History for Smarter Software Bug Repair
- Navigating CI/CD Configuration Changes with Large Language Models
Towards More Interpretable and Robust Agents
This study provides a crucial foundation for understanding and improving code agents. It emphasizes that evaluating agents solely on binary success metrics overlooks rich behavioral insights. Factors like how agents gather context, recognize architectural patterns, abandon unproductive searches, and achieve approximate fault localization are critical for success. The findings advocate for moving beyond leaderboard-based evaluations and developing frameworks that prioritize interpretability and robustness in autonomous software engineering systems. You can read the full research paper here: Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories.


