TLDR: Recon-Act is a self-evolving multi-agent system designed to improve AI agents’ ability to interact with real-world webpages. It uses a dual-team framework: a Reconnaissance Team that learns from task successes and failures to generate new ‘generalized tools’ (hints or code), and an Action Team that uses these tools to execute tasks. This closed-loop learning process allows Recon-Act to adapt to unseen websites and solve complex, multi-step tasks more effectively, achieving state-of-the-art performance on the VisualWebArena dataset.
In the rapidly evolving landscape of artificial intelligence, the development of agents capable of interacting with real-world webpages remains a significant challenge. While multimodal models have made considerable progress, existing browser-use agents often struggle with complex, multi-step tasks, exhibiting disorganized actions and excessive trial-and-error. Addressing these limitations, a new framework called Recon-Act has been introduced, offering a self-evolving multi-agent system designed to enhance web interaction through a unique Reconnaissance–Action behavioral approach.
Recon-Act operates on a dual-team structure: the Reconnaissance Team and the Action Team. The Reconnaissance Team is tasked with a crucial learning role. It conducts comparative analysis, examining both successful and unsuccessful task trajectories. By contrasting these outcomes, it identifies the root causes of failures and devises solutions. These solutions are then abstracted into what the paper calls “generalized tools,” which can take the form of helpful hints or rule-based code. These newly generated tools are registered in real-time to a central archive, making the system continuously smarter.
The Action Team, on the other hand, is responsible for executing tasks. It breaks down user intents, orchestrates the use of available tools (including the newly generated generalized tools), and performs actions on webpages. Empowered by the insights and tools provided by the Reconnaissance Team, the Action Team can re-evaluate and refine its approach, creating a closed-loop training pipeline of data, tools, actions, and feedback. This iterative process allows Recon-Act to evolve and improve its performance over time.
The researchers behind Recon-Act have outlined a 6-level implementation roadmap for their system, progressively increasing its autonomy. Currently, the system has reached Level 3, which involves a hybrid human-AI collaboration. At this stage, components like the Master, Execution Agent, and Coder are powered by large language or vision-language models, while the Analyst and Tool Manager still benefit from human intervention. This configuration allows for robust learning and adaptation while leveraging human expertise where current AI capabilities are still developing.
A key aspect of Recon-Act is its ability to perform “reconnaissance operations” within the browser environment. This involves conducting exploratory actions to distill crucial observations from information-rich web pages. This targeted exploration, especially when the agent encounters difficulties, helps in generating specific feedback and creating tools that address particular problems. This mechanism significantly improves the system’s adaptability to unfamiliar websites and its ability to solve long-horizon tasks.
The effectiveness of Recon-Act has been demonstrated through experiments on the challenging VisualWebArena dataset, a benchmark designed for evaluating agents on realistic visual web tasks. Recon-Act achieved a state-of-the-art overall success rate of 36.48%, outperforming previous best methods by a notable margin. For instance, on the Shopping subdomain, it achieved 39.27% success, a substantial improvement over prior results. While there remains a gap to human performance, these results highlight Recon-Act’s significant advancements in autonomous web interaction.
The system’s architecture during training involves a user query and browser context being processed by a Master Agent. If a trajectory is incorrect, the Reconnaissance Team steps in. Its Analyst devises a plan, and the Coder implements a new tool. This tool is then registered and deployed to the Action Team’s Tool Manager, augmenting the system’s capabilities for subsequent tasks. During inference, only the Action Team is active, leveraging the pre-trained and automatically generated tools to efficiently complete tasks.
The tools created by Recon-Act can operate in two modes: Hint or Decision. Hint-mode tools provide reconnaissance signals to the Execution Agent to guide task completion, especially for less deterministic or context-sensitive situations. Decision-mode tools, on the other hand, directly emit an action that the system executes, suitable for consistently stable behaviors. This dual approach allows for flexible and effective tool utilization.
Also Read:
- Benchmarking the Future of Multi-Agent AI: Introducing HeMAC for Heterogeneous Teams
- UserRL: A Framework for Developing AI Agents That Truly Understand and Assist People
Looking ahead, the researchers plan to further increase Recon-Act’s autonomy, aiming for intelligence beyond Level 5. Future work includes enabling random-walk-style self-exploration to generate more training data, strengthening the reasoning and coding skills of the Analyst and Tool Manager components to reduce human reliance, and expanding the reconnaissance capabilities to generalize across a broader range of heterogeneous web environments. For more details, you can read the full research paper here.


