Bridging the Communication Gap: How ToM-SWE Empowers Coding Agents to Understand User Intent

TLDR: ToM-SWE introduces a dual-agent architecture that pairs a software engineering (SWE) agent with a theory-of-mind (ToM) agent. The ToM agent models user goals, preferences, and constraints from instructions and interaction history, maintaining a persistent memory of the user. This framework significantly improves task success rates and user satisfaction on both ambiguous and stateful software engineering benchmarks, outperforming state-of-the-art SWE agents. A human study also confirmed its practical usefulness for professional developers, demonstrating that effective human-AI collaboration in software engineering requires agents that can proactively model and adapt to user mental states.

In the rapidly evolving landscape of artificial intelligence, coding agents have made remarkable strides, capable of tackling complex software engineering tasks from generating code to debugging and system design. However, a persistent challenge remains: these agents often struggle with truly understanding and adapting to human developers’ intentions, especially when instructions are vague or context-dependent. This gap in communication and collaboration is a critical hurdle for effective human-AI partnership in software development.

Addressing this challenge, researchers have introduced ToM-SWE, a novel framework designed to integrate ‘theory-of-mind’ (ToM) reasoning into software engineering agents. ToM, in this context, refers to an AI’s ability to model a user’s mental state, including their goals, preferences, and intentions, based on their instructions and past interactions. ToM-SWE employs a unique dual-agent architecture: a primary Software Engineering (SWE) agent focuses on the coding tasks, while a dedicated, lightweight ToM partner agent is solely responsible for modeling the user’s mental state.

The ToM agent plays a crucial role by inferring user goals, constraints, and preferences from instructions and interaction history. It maintains a persistent memory of the user, allowing it to track evolving needs across multiple sessions, and provides user-related suggestions to the SWE agent. This separation of concerns is a key innovation, as it allows the SWE agent to maintain its coding performance without being overwhelmed by extensive user history, while enabling specialized and persistent user modeling.

The ToM agent operates in two modes: ‘in-session ToM’ infers immediate user intent during active coding, and ‘after-session ToM’ consolidates interaction history to refine its beliefs about the user’s mental state in a hierarchical memory system. This system stores raw session data, session-based user analyses, and an overall user model that aggregates cross-session patterns.

To rigorously evaluate ToM-SWE, a new benchmark called ‘Stateful SWE benchmark’ was introduced. Unlike previous benchmarks that primarily assess technical problem-solving, Stateful SWE evaluates an agent’s ability to sustain meaningful interactions over time, leveraging realistic conversation histories and an LLM-powered user simulator. The framework was also tested on the ‘Ambiguous SWE-bench’, which focuses on resolving underspecified instructions.

The results are compelling. ToM-SWE, specifically its implementation ToMCodeAct, significantly outperformed state-of-the-art SWE agents like OpenHands CodeAct on both benchmarks. On the ambiguous SWE-bench, ToMCodeAct achieved a 63.4% issue resolved rate compared to CodeAct’s 51.9%. More strikingly, on the stateful SWE benchmark, ToMCodeAct achieved a 57.4% task resolved rate, a substantial improvement over CodeAct’s 13.5%. Furthermore, ToMCodeAct demonstrated significantly higher user satisfaction scores.

Beyond offline benchmarks, a three-week human study with professional developers using ToM-SWE in their daily work revealed its practical utility. Participants found the ToM agent’s suggestions useful 86% of the time, underscoring the value of stateful user modeling for real-world coding agents. The study highlighted that ToM agents excel with moderately underspecified queries that have sufficient technical context, and that higher confidence levels in ToM suggestions correlated with higher acceptance rates.

The computational overhead introduced by the ToM agent is modest, adding only a small fraction to the overall session cost, making it a cost-effective enhancement. While the research acknowledges limitations such as potential biases in LLM-powered user simulators, computational costs, user privacy concerns, and generalization across domains, ToM-SWE represents a significant step forward in making AI coding assistants more collaborative and user-centric.

Also Read:

This innovative approach paves the way for more intuitive and effective human-AI collaboration in software development, ensuring that coding agents not only perform tasks but also truly understand and adapt to the human behind the keyboard. You can read the full research paper here: ToM-SWE: USERMENTALMODELING FORSOFTWAREENGINEERINGAGENTS.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Communication Gap: How ToM-SWE Empowers Coding Agents to Understand User Intent

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates