TLDR: DEEP RESEARCH COMPARATOR is a new open-source platform for evaluating deep research AI agents. It allows side-by-side comparison of agent-generated reports and provides fine-grained human feedback on both final reports and intermediate steps. The platform also introduces SIMPLE DEEPRESEARCH, a baseline agent scaffold for easy LLM integration. Experiments show its utility in ranking agents and analyzing their process-level performance, highlighting the importance of clear intermediate steps for human understanding and feedback.
Evaluating the performance of advanced AI systems, particularly those that autonomously search the web, analyze information, and generate comprehensive reports, has been a significant challenge. These “deep research agents” produce long-form content and involve multiple intermediate steps, making traditional evaluation methods difficult.
To address this, researchers from Carnegie Mellon University and Amazon have introduced DEEP RESEARCH COMPARATOR, an innovative platform designed for the detailed human evaluation of these AI agents. This platform offers a complete framework for hosting deep research agents, comparing them side-by-side, collecting detailed human feedback, and calculating their performance rankings.
How DEEP RESEARCH COMPARATOR Works
When a user submits a query to the platform, it displays the final reports from two different AI agents alongside their intermediate steps. This side-by-side view allows human annotators to assess the overall quality of the final reports. Beyond just overall quality, the platform enables fine-grained feedback. Annotators can provide detailed feedback by evaluating individual intermediate steps or specific sections of text within the final report, indicating whether they are helpful or not.
This detailed feedback is crucial for understanding how AI agents arrive at their conclusions and for improving their behavior. It helps in identifying which steps in the agent’s process are effective and which need refinement, which is particularly valuable for training methods like reinforcement learning.
Introducing SIMPLE DEEPRESEARCH
Alongside the evaluation platform, the team also developed SIMPLE DEEPRESEARCH, an end-to-end agent scaffold. This scaffold acts as a baseline, making it easy to integrate various large language models (LLMs) and transform them into deep research agents ready for evaluation on the platform. Its prompt-based design allows for flexible customization and testing of different LLM configurations.
The workflow of SIMPLE DEEPRESEARCH is an iterative process. It takes a user query and its past actions as input, then generates a thought process and an action. Actions can include searching the web, planning the research, drafting parts of the report, summarizing previous steps, or finally, generating the complete report. This structured approach helps in understanding the agent’s reasoning.
Real-World Application and Findings
To demonstrate the utility of DEEP RESEARCH COMPARATOR, the researchers conducted an experiment with 17 annotators who evaluated three deep research agents: Perplexity Deep Research, GPT Researcher, and SIMPLE DEEPRESEARCH (using Gemini 2.5 Flash). They collected data from 176 real user queries covering a wide range of topics.
The results showed that GPT Researcher ranked highest based on overall user preference for final reports. However, interestingly, GPT Researcher received the lowest “step upvote rate” for its intermediate steps. This suggests that while its final output was preferred, its internal process, as displayed, was less clear or detailed compared to others like SIMPLE DEEPRESEARCH, which provided more granular and understandable intermediate steps, leading to a higher upvote rate for its process.
This highlights an important insight for AI developers: the way an agent presents its intermediate steps can significantly impact human understanding and feedback, even if the final output is strong. The platform also found a close alignment between human rankings and automated LLM-based evaluations for report quality, though human agreement on subtle distinctions was sometimes low, suggesting the need for expert annotations in specialized domains.
Also Read:
- Assessing LLM Agent Memory: A New Benchmark for Interactive Intelligence
- SI-Agent: Automating Clear and Effective Instructions for AI Models
Future Impact
DEEP RESEARCH COMPARATOR is an open-source platform that promises to be a valuable resource for the deep research agent community. It facilitates benchmarking, detailed analysis of agent behavior, and the training of agents through techniques like process supervision and reinforcement learning from human feedback. The code base and collected data are planned for release, further supporting advancements in this field. You can find the full research paper here: Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents.


