TLDR: ToM-SWE introduces a dual-agent architecture that pairs a software engineering (SWE) agent with a theory-of-mind (ToM) agent. The ToM agent models user goals, preferences, and constraints from instructions and interaction history, maintaining a persistent memory of the user. This framework significantly improves task success rates and user satisfaction on both ambiguous and stateful software engineering benchmarks, outperforming state-of-the-art SWE agents. A human study also confirmed its practical usefulness for professional developers, demonstrating that effective human-AI collaboration in software engineering requires agents that can proactively model and adapt to user mental states.
In the rapidly evolving landscape of artificial intelligence, coding agents have made remarkable strides, capable of tackling complex software engineering tasks from generating code to debugging and system design. However, a persistent challenge remains: these agents often struggle with truly understanding and adapting to human developers’ intentions, especially when instructions are vague or context-dependent. This gap in communication and collaboration is a critical hurdle for effective human-AI partnership in software development.
Addressing this challenge, researchers have introduced ToM-SWE, a novel framework designed to integrate ‘theory-of-mind’ (ToM) reasoning into software engineering agents. ToM, in this context, refers to an AI’s ability to model a user’s mental state, including their goals, preferences, and intentions, based on their instructions and past interactions. ToM-SWE employs a unique dual-agent architecture: a primary Software Engineering (SWE) agent focuses on the coding tasks, while a dedicated, lightweight ToM partner agent is solely responsible for modeling the user’s mental state.
The ToM agent plays a crucial role by inferring user goals, constraints, and preferences from instructions and interaction history. It maintains a persistent memory of the user, allowing it to track evolving needs across multiple sessions, and provides user-related suggestions to the SWE agent. This separation of concerns is a key innovation, as it allows the SWE agent to maintain its coding performance without being overwhelmed by extensive user history, while enabling specialized and persistent user modeling.
The ToM agent operates in two modes: ‘in-session ToM’ infers immediate user intent during active coding, and ‘after-session ToM’ consolidates interaction history to refine its beliefs about the user’s mental state in a hierarchical memory system. This system stores raw session data, session-based user analyses, and an overall user model that aggregates cross-session patterns.
To rigorously evaluate ToM-SWE, a new benchmark called ‘Stateful SWE benchmark’ was introduced. Unlike previous benchmarks that primarily assess technical problem-solving, Stateful SWE evaluates an agent’s ability to sustain meaningful interactions over time, leveraging realistic conversation histories and an LLM-powered user simulator. The framework was also tested on the ‘Ambiguous SWE-bench’, which focuses on resolving underspecified instructions.
The results are compelling. ToM-SWE, specifically its implementation ToMCodeAct, significantly outperformed state-of-the-art SWE agents like OpenHands CodeAct on both benchmarks. On the ambiguous SWE-bench, ToMCodeAct achieved a 63.4% issue resolved rate compared to CodeAct’s 51.9%. More strikingly, on the stateful SWE benchmark, ToMCodeAct achieved a 57.4% task resolved rate, a substantial improvement over CodeAct’s 13.5%. Furthermore, ToMCodeAct demonstrated significantly higher user satisfaction scores.
Beyond offline benchmarks, a three-week human study with professional developers using ToM-SWE in their daily work revealed its practical utility. Participants found the ToM agent’s suggestions useful 86% of the time, underscoring the value of stateful user modeling for real-world coding agents. The study highlighted that ToM agents excel with moderately underspecified queries that have sufficient technical context, and that higher confidence levels in ToM suggestions correlated with higher acceptance rates.
The computational overhead introduced by the ToM agent is modest, adding only a small fraction to the overall session cost, making it a cost-effective enhancement. While the research acknowledges limitations such as potential biases in LLM-powered user simulators, computational costs, user privacy concerns, and generalization across domains, ToM-SWE represents a significant step forward in making AI coding assistants more collaborative and user-centric.
Also Read:
- SwiftSolve: A Multi-Agent System for Efficient Competitive Programming Solutions
- Advancing Automated Code Repair with Agentic Reinforcement Learning
This innovative approach paves the way for more intuitive and effective human-AI collaboration in software development, ensuring that coding agents not only perform tasks but also truly understand and adapt to the human behind the keyboard. You can read the full research paper here: ToM-SWE: USERMENTALMODELING FORSOFTWAREENGINEERINGAGENTS.


