TLDR: This survey explores the critical role of Reinforcement Learning (RL) in developing deep research systems, which are AI agents capable of solving complex, multi-step tasks using reasoning, web search, and tool use. It highlights the limitations of traditional training methods like SFT and DPO for such long-horizon, interactive tasks. The paper systematizes RL foundations across data synthesis, RL methods (stability, reward design, multimodality), and training frameworks. It also covers agent architecture and evaluation, offering practical guidance for building robust and transparent deep research agents.
Deep research systems, a new generation of AI, are designed to tackle complex, multi-step tasks by combining reasoning, searching the internet, accessing user files, and using various tools. These advanced AI agents are evolving towards a hierarchical structure, typically involving a Planner, a Coordinator, and multiple Executors. The Planner breaks down tasks, the Coordinator manages assignments and aggregates results, and the Executors perform specific actions like searching or browsing.
While these systems hold immense promise, training them effectively presents significant challenges. Traditional AI training methods like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) have limitations. SFT, which involves training models on predefined examples, can lead to imitation bias, where the AI simply copies patterns without truly understanding, and struggles to learn from real-world feedback like tool failures. DPO, which relies on human preferences, is often too focused on immediate textual choices and struggles with long-term planning and complex decision-making, especially when dealing with multiple objectives like accuracy and cost.
This is where Reinforcement Learning (RL) emerges as a powerful alternative. RL is well-suited for training deep research agents because it allows the AI to learn by interacting with its environment, making decisions, and receiving feedback over an entire sequence of actions. This approach enables the AI to explore different strategies, recover from errors, and assign credit to decisions that lead to long-term success, reducing its reliance on human-defined rules and biases.
A recent survey, the first of its kind, delves into the foundational aspects of using RL for deep research systems. It organizes the field into three main areas:
Data Synthesis and Curation
This section explores methods for creating and refining high-quality training data, often generated synthetically, to support the multi-step reasoning and tool usage required by these agents. The goal is to produce tasks that genuinely challenge the AI to perform complex operations, rather than relying on simple lookups or memorized facts. Strategies include creating questions that require integrating information from multiple documents, navigating through web-like structures, and gradually increasing task difficulty.
RL Methods for Agentic Research
This area focuses on the specific RL techniques used to train these agents. Key advancements include improving training stability and efficiency, handling long sequences of interactions, designing effective reward systems that guide the AI towards desired behaviors, optimizing for multiple objectives, and integrating multimodal information (like images and text). Researchers are also exploring how to make agents learn when to use tools and when to rely on their internal knowledge, as well as how to manage the costs and latency associated with real-time tool use.
Agentic RL Training Frameworks
Training deep research agents that interact with tools over extended periods is a complex engineering challenge. This part of the survey examines the open-source infrastructure and frameworks designed to make RL training practical and scalable. It highlights common bottlenecks, such as slow data collection and unstable learning, and discusses how new frameworks are addressing these issues through asynchronous training, better credit assignment, and efficient system orchestration.
Beyond these core training foundations, the survey also touches upon crucial cross-cutting areas:
Agent Architecture and Coordination
This involves how agents are structured and how different components work together. Hierarchical designs, where a main planner delegates tasks to specialized sub-agents, are becoming common. This modularity allows for better division of labor, parallel execution, and easier integration of new tools.
Also Read:
- Structuring Intelligence: Language Models Crafting Hierarchical Learning Environments for AI Agents
- Making Sense of AI Actions: TalkToAgent’s Approach to Explaining Reinforcement Learning
Evaluations and Benchmarks
Measuring the progress of deep research systems requires robust evaluation methods. The survey covers various benchmarks, from traditional question-answering tasks to more complex scenarios involving visual information, long-form text generation, and domain-specific challenges that mimic real-world professional workflows.
This comprehensive overview provides a roadmap for understanding and advancing the field of deep research systems, emphasizing the critical role of Reinforcement Learning in building intelligent, adaptable, and robust AI agents. For more in-depth information, you can read the full paper here.


