A New Platform for Evaluating AI Research Agents with Human Feedback

TLDR: DEEP RESEARCH COMPARATOR is a new open-source platform for evaluating deep research AI agents. It allows side-by-side comparison of agent-generated reports and provides fine-grained human feedback on both final reports and intermediate steps. The platform also introduces SIMPLE DEEPRESEARCH, a baseline agent scaffold for easy LLM integration. Experiments show its utility in ranking agents and analyzing their process-level performance, highlighting the importance of clear intermediate steps for human understanding and feedback.

Evaluating the performance of advanced AI systems, particularly those that autonomously search the web, analyze information, and generate comprehensive reports, has been a significant challenge. These “deep research agents” produce long-form content and involve multiple intermediate steps, making traditional evaluation methods difficult.

To address this, researchers from Carnegie Mellon University and Amazon have introduced DEEP RESEARCH COMPARATOR, an innovative platform designed for the detailed human evaluation of these AI agents. This platform offers a complete framework for hosting deep research agents, comparing them side-by-side, collecting detailed human feedback, and calculating their performance rankings.

How DEEP RESEARCH COMPARATOR Works

When a user submits a query to the platform, it displays the final reports from two different AI agents alongside their intermediate steps. This side-by-side view allows human annotators to assess the overall quality of the final reports. Beyond just overall quality, the platform enables fine-grained feedback. Annotators can provide detailed feedback by evaluating individual intermediate steps or specific sections of text within the final report, indicating whether they are helpful or not.

This detailed feedback is crucial for understanding how AI agents arrive at their conclusions and for improving their behavior. It helps in identifying which steps in the agent’s process are effective and which need refinement, which is particularly valuable for training methods like reinforcement learning.

Introducing SIMPLE DEEPRESEARCH

Alongside the evaluation platform, the team also developed SIMPLE DEEPRESEARCH, an end-to-end agent scaffold. This scaffold acts as a baseline, making it easy to integrate various large language models (LLMs) and transform them into deep research agents ready for evaluation on the platform. Its prompt-based design allows for flexible customization and testing of different LLM configurations.

The workflow of SIMPLE DEEPRESEARCH is an iterative process. It takes a user query and its past actions as input, then generates a thought process and an action. Actions can include searching the web, planning the research, drafting parts of the report, summarizing previous steps, or finally, generating the complete report. This structured approach helps in understanding the agent’s reasoning.

Real-World Application and Findings

To demonstrate the utility of DEEP RESEARCH COMPARATOR, the researchers conducted an experiment with 17 annotators who evaluated three deep research agents: Perplexity Deep Research, GPT Researcher, and SIMPLE DEEPRESEARCH (using Gemini 2.5 Flash). They collected data from 176 real user queries covering a wide range of topics.

The results showed that GPT Researcher ranked highest based on overall user preference for final reports. However, interestingly, GPT Researcher received the lowest “step upvote rate” for its intermediate steps. This suggests that while its final output was preferred, its internal process, as displayed, was less clear or detailed compared to others like SIMPLE DEEPRESEARCH, which provided more granular and understandable intermediate steps, leading to a higher upvote rate for its process.

This highlights an important insight for AI developers: the way an agent presents its intermediate steps can significantly impact human understanding and feedback, even if the final output is strong. The platform also found a close alignment between human rankings and automated LLM-based evaluations for report quality, though human agreement on subtle distinctions was sometimes low, suggesting the need for expert annotations in specialized domains.

Also Read:

Future Impact

DEEP RESEARCH COMPARATOR is an open-source platform that promises to be a valuable resource for the deep research agent community. It facilitates benchmarking, detailed analysis of agent behavior, and the training of agents through techniques like process supervision and reinforcement learning from human feedback. The code base and collected data are planned for release, further supporting advancements in this field. You can find the full research paper here: Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Platform for Evaluating AI Research Agents with Human Feedback

How DEEP RESEARCH COMPARATOR Works

Introducing SIMPLE DEEPRESEARCH

Real-World Application and Findings

Future Impact

Gen AI News and Updates

Cresta Introduces Four Major AI Innovations at Inaugural Wave Conference to Enhance Customer Experience

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates