Evaluating Human-Agent Collaboration: A New Framework for Software Agent Design

TLDR: A new framework called PULSE is proposed for efficiently assessing human-agent interactions in software agent design. It combines user feedback with machine learning to predict satisfaction, deployed on a large scale with the OpenHands agent. Case studies reveal that the choice of LLM backbone significantly impacts user satisfaction, more so than planning or memory strategies. Crucially, the study found a disconnect between traditional benchmark performance and actual user satisfaction, underscoring the need for human-centric evaluation.

Large Language Model (LLM)-powered agents are rapidly emerging as a transformative technology, yet their inherent complexity makes assessing their real-world usefulness a significant challenge. Traditional benchmarks often fall short because they primarily focus on full automation, overlooking the crucial collaborative nature of how humans and agents actually work together.

A recent research paper, titled “How can we assess human-agent interactions? Case studies in software agent design,” by Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, and Graham Neubig, introduces a novel framework to address this gap. The paper, available at arXiv:2510.09801, proposes PULSE (Prediction-powered User Label Synthesis and Evaluation), a three-step framework designed for more efficient human-centric evaluation of agent designs.

Understanding PULSE: A Three-Step Approach

PULSE operates through three main stages:

1. Collecting User Feedback: The framework begins by setting up an interface to gather direct user feedback from human-agent interactions. In the context of software agents, users are prompted to rate the agent’s performance on a five-star scale after each “work segment” – a period where the agent takes actions between user commands.

2. Training a Model to Predict Satisfaction: Given that explicit user ratings can be sparse, PULSE trains a machine learning model to predict user satisfaction. This model extracts important features about the user, the agent’s actions, and the task completion status. Interestingly, the study found that traditional ML models trained on these specific features significantly outperformed state-of-the-art LLMs when used as a judge for predicting satisfaction.

3. Computing Results with Enhanced Confidence: Finally, PULSE extends prediction-powered inference to combine human satisfaction ratings with model-generated predictions (pseudo-labels) for unlabeled interactions. This approach leads to more robust conclusions about agent design, reducing confidence intervals by an average of 40% compared to a standard A/B test.

Real-World Deployment with OpenHands

To validate PULSE, the researchers deployed the framework on a large-scale web platform built around OpenHands, an open-source software engineering agent. This extensive study collected “in-the-wild” usage data from over 15,000 users across more than 36,000 sessions. Users engaged in a diverse range of coding tasks, from fixing bugs to creating new programs, using various programming and natural languages.

Key Insights from Case Studies

The study conducted three case studies to understand how different agent design decisions impact developer satisfaction:

1. LLM Model Backbone: This case study compared three state-of-the-art LLMs: Claude-3.7-sonnet, Claude-4-sonnet, and GPT-5. The findings revealed that the choice of LLM backbone had the most significant impact on user satisfaction. Users consistently preferred agents powered by Claude-4-sonnet over the other two. For instance, there was a 5.86% difference in satisfaction between Claude-3.7-sonnet and Claude-4-sonnet, and a -7.8% difference between Claude-4-sonnet and GPT-5 (meaning users preferred Claude-4-sonnet over GPT-5).

2. Planning Strategy: The researchers investigated whether showing users the agent’s plan of attack influenced their experience. A small but statistically significant positive difference (3.1%) in user satisfaction was observed when plans were shown. Behavioral features indicated that showing plans led to less misunderstanding and improved user engagement.

3. Memory Management: This study explored how summarizing older interactions to manage context length and cost affected user satisfaction. Decreasing the maximum steps for memory from 120 to 80, which offered cost savings, had no significant negative impact on user experience.

Benchmarks Versus Reality: A Crucial Discrepancy

One of the most striking findings was the substantial discrepancy between in-the-wild results and traditional benchmark performance. While GPT-5 often outperformed Claude-4-sonnet on six out of seven code-related benchmarks, human users preferred Claude-4-sonnet over GPT-5 on four out of seven task subsets. This anti-correlation underscores the limitations of relying solely on benchmark-driven evaluation for collaborative LLM agents and highlights the critical importance of human-in-the-loop assessment.

Also Read:

Conclusion

The PULSE framework offers a rigorous and efficient method for evaluating human-agent interactions, providing practical insights for software agent design. The research emphasizes that while LLM backbone quality is a primary driver of user satisfaction, understanding real-world human preferences often diverges from static benchmark scores. This work paves the way for optimizing LLMs for interactivity and developing better models of user satisfaction and engagement in collaborative AI systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Human-Agent Collaboration: A New Framework for Software Agent Design

Understanding PULSE: A Three-Step Approach

Real-World Deployment with OpenHands

Key Insights from Case Studies

Benchmarks Versus Reality: A Crucial Discrepancy

Conclusion

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates