spot_img
HomeResearch & DevelopmentDeepEyesV2: Enabling AI to Actively Use Tools for Complex...

DeepEyesV2: Enabling AI to Actively Use Tools for Complex Multimodal Tasks

TLDR: DeepEyesV2 is a new agentic multimodal AI model that can understand text and images, and actively use external tools like code execution and web search. It employs a two-stage training process (cold-start then reinforcement learning) to teach robust tool-use patterns. Evaluated on a new benchmark, RealX-Bench, designed for integrated perception, search, and reasoning, DeepEyesV2 demonstrated strong performance and adaptive tool invocation, showcasing its ability to tackle complex real-world problems.

In the rapidly evolving landscape of artificial intelligence, the concept of “agentic multimodal models” is gaining significant traction. These advanced AI systems are designed to do more than just understand text and images; they can actively use external tools, such as code execution environments and web search, and integrate these actions into their reasoning processes. This capability is crucial for tackling complex, real-world problems that require dynamic interaction and information gathering.

A new research paper introduces DeepEyesV2, a pioneering agentic multimodal model that explores how to build such a system from the ground up. The researchers, including Jack Hong, Chenxiao Zhao, ChengLIn Zhu, Weiheng Lu, Guohai Xu, and XingYu from Xiaohongshu Inc., delve into data construction, training methods, and model evaluation to create a robust and versatile AI agent.

The Challenge of Tool Use in AI

Existing multimodal models, while impressive in their perception and interpretation abilities, often remain passive. They can process information but lack the autonomy to invoke external tools when needed. For instance, identifying a flower species from an image might require cropping a specific region and then searching for that cropped image online. Current models struggle with this multi-step, tool-augmented reasoning.

The DeepEyesV2 team observed that simply using reinforcement learning (RL) alone wasn’t enough to teach models reliable tool-use behavior. This led them to develop a unique two-stage training pipeline: a “cold-start” stage to establish fundamental tool-use patterns, followed by a reinforcement learning stage to refine and enhance tool invocation.

DeepEyesV2’s Innovative Approach

DeepEyesV2 stands out by seamlessly integrating programmatic code execution and web retrieval as complementary tools within a single, dynamic reasoning loop. When faced with an image and a user query, DeepEyesV2 first plans its approach. If tools are necessary, it can generate executable Python code or issue web search queries. The results from these tools – whether transformed images, numerical data, or search snippets – are then incorporated back into the model’s context, allowing it to refine its hypotheses and continue reasoning iteratively until a conclusive answer is reached.

This integrated approach offers several advantages:

  • Expanded Analytical Capability: Through executable code, DeepEyesV2 can perform complex operations on visual or numerical data, such as fine-grained image manipulations (cropping, measuring) and quantitative computations.
  • Active Knowledge Seeking: It can proactively access up-to-date external knowledge by retrieving multimodal evidence from the web, reducing reliance on potentially outdated internal knowledge.
  • Iterative, Interleaved Multi-Tool Reasoning: Code execution and search can be dynamically combined within a single problem-solving trajectory, rather than being isolated functions.

Introducing RealX-Bench: A New Evaluation Standard

To thoroughly evaluate agentic multimodal models, the researchers introduced RealX-Bench, a comprehensive benchmark designed to assess real-world multimodal reasoning. Unlike existing benchmarks that often focus on isolated capabilities like perception or search, RealX-Bench demands the integration of multiple skills simultaneously. It features challenging questions grounded in real-world scenarios, requiring models to attend to fine-grained visual regions, retrieve external evidence, and reason over multimodal contexts.

On RealX-Bench, current models, even proprietary ones, perform significantly below human levels, highlighting the benchmark’s difficulty and the substantial gap that DeepEyesV2 aims to bridge.

Performance and Insights

DeepEyesV2 was evaluated on RealX-Bench and a wide array of other benchmarks covering real-world understanding, mathematical reasoning, and search-intensive tasks. The results demonstrate its effectiveness, often outperforming both general-purpose multimodal models and prior specialized reasoning approaches. For instance, DeepEyesV2 showed significant gains in mathematical reasoning and achieved superior search capabilities on retrieval-oriented benchmarks.

A key finding from the analysis is DeepEyesV2’s task-adaptive tool invocation. For perception tasks, it primarily uses image operations like cropping. For reasoning tasks, it favors numerical computations. In search tasks, it intelligently combines image manipulation with search tools. Reinforcement learning further refines this behavior, enabling more complex tool combinations and adaptive decision-making, where the model learns to selectively invoke tools based on the problem context, rather than over-relying on them.

Also Read:

Conclusion

DeepEyesV2 represents a significant step toward building truly agentic multimodal models. By combining a two-stage training pipeline, a carefully curated dataset, and a novel evaluation benchmark, the researchers have demonstrated a model that can actively invoke and integrate external tools into its reasoning process. This work provides valuable guidance for the AI community in developing more reliable, flexible, and intelligent multimodal agents. You can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -