spot_img
HomeResearch & DevelopmentAdvancing Open-Source AI for Complex Web Research

Advancing Open-Source AI for Complex Web Research

TLDR: A new open-source Deep Research Agent, ODR+, significantly improves performance on complex web-based question answering by breaking down queries, iteratively searching for information, and synthesizing structured responses. It outperforms previous open-source models and even some proprietary systems on a challenging new benchmark, making advanced AI research more accessible.

Deep Research Agents (DRAs) are advanced systems designed to understand a natural language question from a user and then autonomously search the internet to find and use content to answer that question. These agents represent a significant leap in artificial intelligence, combining complex tasks like breaking down questions, using search engines effectively, and reasoning about retrieved information to provide comprehensive answers. While impressive, much of the recent progress in DRAs has come from proprietary, closed-source systems, making it difficult for the broader research community to understand, evaluate, and build upon them.

A recent paper titled “Improving and Evaluating Open Deep Research Agents” by Doaa Allabadi, Kyle Bradbury, and Jordan M. Malof addresses this challenge by focusing on open-source DRAs. The researchers highlight that at the time of their work, only one open-source DRA, called Open Deep Research (ODR), was available. A major hurdle for evaluating these systems is the lack of suitable benchmarks that are both challenging and computationally feasible for academic labs.

The BrowseComp-Small Benchmark

To tackle the evaluation problem, the authors adapted the challenging BrowseComp benchmark, which features over 1200 complex questions requiring multi-hop reasoning and synthesis of information from various web sources. They introduced BrowseComp-Small (BC-Small), a more manageable subset of 120 questions, split into training and testing sets. This smaller benchmark allows for more accessible evaluation while still maintaining the difficulty needed to test advanced DRAs.

Introducing ODR+

The original ODR system, while a foundational open-source agent, struggled with complex, multi-hop research questions, achieving 0% accuracy on the BC-Small test set. The researchers hypothesized that ODR’s limitations stemmed from its inability to decompose complex queries, its lack of iterative reasoning, and its unstructured output. To overcome these, they developed ODR+, an enhanced version of ODR, by introducing three strategic improvements:

  • Question Decomposition: ODR+ first breaks down the user’s original complex question into a set of simpler, focused sub-questions. This involves extracting specific constraints like names, dates, and locations to narrow down the search.
  • Iterative Sub-solution Search: Instead of a single search, ODR+ iteratively addresses each sub-question. It performs multiple web searches for each sub-question, selects the most frequent and relevant URLs, and extracts precise facts from the content. It then analyzes these findings, determines if a sub-question is answered, and suggests new follow-up questions if needed. This dynamic process allows the agent to adapt its strategy based on the information it gathers.
  • Response Synthesis: Once the iterative search is complete, ODR+ synthesizes all the accumulated evidence into a structured final answer. This answer includes an explanation based on the findings, a concise exact answer, and a confidence score. This structured output is crucial for automated evaluation and ensures transparency.

Performance and Impact

The results of their experiments on the BC-Small test set were significant. The original ODR system achieved 0% accuracy. In stark contrast, ODR+ achieved a 10% success rate, correctly answering 6 out of 60 test questions. This makes ODR+ the current state-of-the-art among open-source models on the BrowseComp benchmark. Surprisingly, ODR+ also outperformed several proprietary systems, including Claude-DeepResearch from Anthropic and Gemini-DeepResearch from Google, both of which also scored 0% accuracy on the test set. The authors noted that proprietary systems often produced lengthy reports rather than concise, structured answers, which are penalized by BrowseComp’s strict evaluation criteria.

Ablation studies further confirmed the importance of each new component in ODR+. Disabling any of the three core modules—Question Decomposition, Iterative Planning, or Structured Synthesis—led to a significant drop in accuracy, underscoring their collective contribution to ODR+’s improved performance.

Also Read:

Looking Ahead

The introduction of ODR+ and its public release of code aim to foster continued progress in the development and evaluation of Deep Research Agents within the open research community. By providing a robust, analyzable, and extensible open-source DRA, this work paves the way for future innovations in autonomous web-based question answering. You can find more details about this research in the full paper available at arXiv:2508.10152.

Dev Sundaram
Dev Sundaramhttps://blogs.edgentiq.com
Dev Sundaram is an investigative tech journalist with a nose for exclusives and leaks. With stints in cybersecurity and enterprise AI reporting, Dev thrives on breaking big stories—product launches, funding rounds, regulatory shifts—and giving them context. He believes journalism should push the AI industry toward transparency and accountability, especially as Generative AI becomes mainstream. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -