TLDR: A new framework called PULSE is proposed for efficiently assessing human-agent interactions in software agent design. It combines user feedback with machine learning to predict satisfaction, deployed on a large scale with the OpenHands agent. Case studies reveal that the choice of LLM backbone significantly impacts user satisfaction, more so than planning or memory strategies. Crucially, the study found a disconnect between traditional benchmark performance and actual user satisfaction, underscoring the need for human-centric evaluation.
Large Language Model (LLM)-powered agents are rapidly emerging as a transformative technology, yet their inherent complexity makes assessing their real-world usefulness a significant challenge. Traditional benchmarks often fall short because they primarily focus on full automation, overlooking the crucial collaborative nature of how humans and agents actually work together.
A recent research paper, titled “How can we assess human-agent interactions? Case studies in software agent design,” by Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, and Graham Neubig, introduces a novel framework to address this gap. The paper, available at arXiv:2510.09801, proposes PULSE (Prediction-powered User Label Synthesis and Evaluation), a three-step framework designed for more efficient human-centric evaluation of agent designs.
Understanding PULSE: A Three-Step Approach
PULSE operates through three main stages:
1. Collecting User Feedback: The framework begins by setting up an interface to gather direct user feedback from human-agent interactions. In the context of software agents, users are prompted to rate the agent’s performance on a five-star scale after each “work segment” – a period where the agent takes actions between user commands.
2. Training a Model to Predict Satisfaction: Given that explicit user ratings can be sparse, PULSE trains a machine learning model to predict user satisfaction. This model extracts important features about the user, the agent’s actions, and the task completion status. Interestingly, the study found that traditional ML models trained on these specific features significantly outperformed state-of-the-art LLMs when used as a judge for predicting satisfaction.
3. Computing Results with Enhanced Confidence: Finally, PULSE extends prediction-powered inference to combine human satisfaction ratings with model-generated predictions (pseudo-labels) for unlabeled interactions. This approach leads to more robust conclusions about agent design, reducing confidence intervals by an average of 40% compared to a standard A/B test.
Real-World Deployment with OpenHands
To validate PULSE, the researchers deployed the framework on a large-scale web platform built around OpenHands, an open-source software engineering agent. This extensive study collected “in-the-wild” usage data from over 15,000 users across more than 36,000 sessions. Users engaged in a diverse range of coding tasks, from fixing bugs to creating new programs, using various programming and natural languages.
Key Insights from Case Studies
The study conducted three case studies to understand how different agent design decisions impact developer satisfaction:
1. LLM Model Backbone: This case study compared three state-of-the-art LLMs: Claude-3.7-sonnet, Claude-4-sonnet, and GPT-5. The findings revealed that the choice of LLM backbone had the most significant impact on user satisfaction. Users consistently preferred agents powered by Claude-4-sonnet over the other two. For instance, there was a 5.86% difference in satisfaction between Claude-3.7-sonnet and Claude-4-sonnet, and a -7.8% difference between Claude-4-sonnet and GPT-5 (meaning users preferred Claude-4-sonnet over GPT-5).
2. Planning Strategy: The researchers investigated whether showing users the agent’s plan of attack influenced their experience. A small but statistically significant positive difference (3.1%) in user satisfaction was observed when plans were shown. Behavioral features indicated that showing plans led to less misunderstanding and improved user engagement.
3. Memory Management: This study explored how summarizing older interactions to manage context length and cost affected user satisfaction. Decreasing the maximum steps for memory from 120 to 80, which offered cost savings, had no significant negative impact on user experience.
Benchmarks Versus Reality: A Crucial Discrepancy
One of the most striking findings was the substantial discrepancy between in-the-wild results and traditional benchmark performance. While GPT-5 often outperformed Claude-4-sonnet on six out of seven code-related benchmarks, human users preferred Claude-4-sonnet over GPT-5 on four out of seven task subsets. This anti-correlation underscores the limitations of relying solely on benchmark-driven evaluation for collaborative LLM agents and highlights the critical importance of human-in-the-loop assessment.
Also Read:
- Unlocking User Intent: How Small Language Models Are Enhancing Recommendations with ‘Thought Space’
- Unpacking AI Agent Performance: A New Evaluation Framework
Conclusion
The PULSE framework offers a rigorous and efficient method for evaluating human-agent interactions, providing practical insights for software agent design. The research emphasizes that while LLM backbone quality is a primary driver of user satisfaction, understanding real-world human preferences often diverges from static benchmark scores. This work paves the way for optimizing LLMs for interactivity and developing better models of user satisfaction and engagement in collaborative AI systems.


