spot_img
HomeResearch & DevelopmentEvaluating Human-Agent Collaboration: A New Framework for Software Agent...

Evaluating Human-Agent Collaboration: A New Framework for Software Agent Design

TLDR: A new framework called PULSE is proposed for efficiently assessing human-agent interactions in software agent design. It combines user feedback with machine learning to predict satisfaction, deployed on a large scale with the OpenHands agent. Case studies reveal that the choice of LLM backbone significantly impacts user satisfaction, more so than planning or memory strategies. Crucially, the study found a disconnect between traditional benchmark performance and actual user satisfaction, underscoring the need for human-centric evaluation.

Large Language Model (LLM)-powered agents are rapidly emerging as a transformative technology, yet their inherent complexity makes assessing their real-world usefulness a significant challenge. Traditional benchmarks often fall short because they primarily focus on full automation, overlooking the crucial collaborative nature of how humans and agents actually work together.

A recent research paper, titled “How can we assess human-agent interactions? Case studies in software agent design,” by Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, and Graham Neubig, introduces a novel framework to address this gap. The paper, available at arXiv:2510.09801, proposes PULSE (Prediction-powered User Label Synthesis and Evaluation), a three-step framework designed for more efficient human-centric evaluation of agent designs.

Understanding PULSE: A Three-Step Approach

PULSE operates through three main stages:

1. Collecting User Feedback: The framework begins by setting up an interface to gather direct user feedback from human-agent interactions. In the context of software agents, users are prompted to rate the agent’s performance on a five-star scale after each “work segment” – a period where the agent takes actions between user commands.

2. Training a Model to Predict Satisfaction: Given that explicit user ratings can be sparse, PULSE trains a machine learning model to predict user satisfaction. This model extracts important features about the user, the agent’s actions, and the task completion status. Interestingly, the study found that traditional ML models trained on these specific features significantly outperformed state-of-the-art LLMs when used as a judge for predicting satisfaction.

3. Computing Results with Enhanced Confidence: Finally, PULSE extends prediction-powered inference to combine human satisfaction ratings with model-generated predictions (pseudo-labels) for unlabeled interactions. This approach leads to more robust conclusions about agent design, reducing confidence intervals by an average of 40% compared to a standard A/B test.

Real-World Deployment with OpenHands

To validate PULSE, the researchers deployed the framework on a large-scale web platform built around OpenHands, an open-source software engineering agent. This extensive study collected “in-the-wild” usage data from over 15,000 users across more than 36,000 sessions. Users engaged in a diverse range of coding tasks, from fixing bugs to creating new programs, using various programming and natural languages.

Key Insights from Case Studies

The study conducted three case studies to understand how different agent design decisions impact developer satisfaction:

1. LLM Model Backbone: This case study compared three state-of-the-art LLMs: Claude-3.7-sonnet, Claude-4-sonnet, and GPT-5. The findings revealed that the choice of LLM backbone had the most significant impact on user satisfaction. Users consistently preferred agents powered by Claude-4-sonnet over the other two. For instance, there was a 5.86% difference in satisfaction between Claude-3.7-sonnet and Claude-4-sonnet, and a -7.8% difference between Claude-4-sonnet and GPT-5 (meaning users preferred Claude-4-sonnet over GPT-5).

2. Planning Strategy: The researchers investigated whether showing users the agent’s plan of attack influenced their experience. A small but statistically significant positive difference (3.1%) in user satisfaction was observed when plans were shown. Behavioral features indicated that showing plans led to less misunderstanding and improved user engagement.

3. Memory Management: This study explored how summarizing older interactions to manage context length and cost affected user satisfaction. Decreasing the maximum steps for memory from 120 to 80, which offered cost savings, had no significant negative impact on user experience.

Benchmarks Versus Reality: A Crucial Discrepancy

One of the most striking findings was the substantial discrepancy between in-the-wild results and traditional benchmark performance. While GPT-5 often outperformed Claude-4-sonnet on six out of seven code-related benchmarks, human users preferred Claude-4-sonnet over GPT-5 on four out of seven task subsets. This anti-correlation underscores the limitations of relying solely on benchmark-driven evaluation for collaborative LLM agents and highlights the critical importance of human-in-the-loop assessment.

Also Read:

Conclusion

The PULSE framework offers a rigorous and efficient method for evaluating human-agent interactions, providing practical insights for software agent design. The research emphasizes that while LLM backbone quality is a primary driver of user satisfaction, understanding real-world human preferences often diverges from static benchmark scores. This work paves the way for optimizing LLMs for interactivity and developing better models of user satisfaction and engagement in collaborative AI systems.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -