DeepEyesV2: Enabling AI to Actively Use Tools for Complex Multimodal Tasks

TLDR: DeepEyesV2 is a new agentic multimodal AI model that can understand text and images, and actively use external tools like code execution and web search. It employs a two-stage training process (cold-start then reinforcement learning) to teach robust tool-use patterns. Evaluated on a new benchmark, RealX-Bench, designed for integrated perception, search, and reasoning, DeepEyesV2 demonstrated strong performance and adaptive tool invocation, showcasing its ability to tackle complex real-world problems.

In the rapidly evolving landscape of artificial intelligence, the concept of “agentic multimodal models” is gaining significant traction. These advanced AI systems are designed to do more than just understand text and images; they can actively use external tools, such as code execution environments and web search, and integrate these actions into their reasoning processes. This capability is crucial for tackling complex, real-world problems that require dynamic interaction and information gathering.

A new research paper introduces DeepEyesV2, a pioneering agentic multimodal model that explores how to build such a system from the ground up. The researchers, including Jack Hong, Chenxiao Zhao, ChengLIn Zhu, Weiheng Lu, Guohai Xu, and XingYu from Xiaohongshu Inc., delve into data construction, training methods, and model evaluation to create a robust and versatile AI agent.

The Challenge of Tool Use in AI

Existing multimodal models, while impressive in their perception and interpretation abilities, often remain passive. They can process information but lack the autonomy to invoke external tools when needed. For instance, identifying a flower species from an image might require cropping a specific region and then searching for that cropped image online. Current models struggle with this multi-step, tool-augmented reasoning.

The DeepEyesV2 team observed that simply using reinforcement learning (RL) alone wasn’t enough to teach models reliable tool-use behavior. This led them to develop a unique two-stage training pipeline: a “cold-start” stage to establish fundamental tool-use patterns, followed by a reinforcement learning stage to refine and enhance tool invocation.

DeepEyesV2’s Innovative Approach

DeepEyesV2 stands out by seamlessly integrating programmatic code execution and web retrieval as complementary tools within a single, dynamic reasoning loop. When faced with an image and a user query, DeepEyesV2 first plans its approach. If tools are necessary, it can generate executable Python code or issue web search queries. The results from these tools – whether transformed images, numerical data, or search snippets – are then incorporated back into the model’s context, allowing it to refine its hypotheses and continue reasoning iteratively until a conclusive answer is reached.

This integrated approach offers several advantages:

Expanded Analytical Capability: Through executable code, DeepEyesV2 can perform complex operations on visual or numerical data, such as fine-grained image manipulations (cropping, measuring) and quantitative computations.
Active Knowledge Seeking: It can proactively access up-to-date external knowledge by retrieving multimodal evidence from the web, reducing reliance on potentially outdated internal knowledge.
Iterative, Interleaved Multi-Tool Reasoning: Code execution and search can be dynamically combined within a single problem-solving trajectory, rather than being isolated functions.

Introducing RealX-Bench: A New Evaluation Standard

To thoroughly evaluate agentic multimodal models, the researchers introduced RealX-Bench, a comprehensive benchmark designed to assess real-world multimodal reasoning. Unlike existing benchmarks that often focus on isolated capabilities like perception or search, RealX-Bench demands the integration of multiple skills simultaneously. It features challenging questions grounded in real-world scenarios, requiring models to attend to fine-grained visual regions, retrieve external evidence, and reason over multimodal contexts.

On RealX-Bench, current models, even proprietary ones, perform significantly below human levels, highlighting the benchmark’s difficulty and the substantial gap that DeepEyesV2 aims to bridge.

Performance and Insights

DeepEyesV2 was evaluated on RealX-Bench and a wide array of other benchmarks covering real-world understanding, mathematical reasoning, and search-intensive tasks. The results demonstrate its effectiveness, often outperforming both general-purpose multimodal models and prior specialized reasoning approaches. For instance, DeepEyesV2 showed significant gains in mathematical reasoning and achieved superior search capabilities on retrieval-oriented benchmarks.

A key finding from the analysis is DeepEyesV2’s task-adaptive tool invocation. For perception tasks, it primarily uses image operations like cropping. For reasoning tasks, it favors numerical computations. In search tasks, it intelligently combines image manipulation with search tools. Reinforcement learning further refines this behavior, enabling more complex tool combinations and adaptive decision-making, where the model learns to selectively invoke tools based on the problem context, rather than over-relying on them.

Also Read:

Conclusion

DeepEyesV2 represents a significant step toward building truly agentic multimodal models. By combining a two-stage training pipeline, a carefully curated dataset, and a novel evaluation benchmark, the researchers have demonstrated a model that can actively invoke and integrate external tools into its reasoning process. This work provides valuable guidance for the AI community in developing more reliable, flexible, and intelligent multimodal agents. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

DeepEyesV2: Enabling AI to Actively Use Tools for Complex Multimodal Tasks

The Challenge of Tool Use in AI

DeepEyesV2’s Innovative Approach

Introducing RealX-Bench: A New Evaluation Standard

Performance and Insights

Conclusion

Gen AI News and Updates

SOCi Achieves Major Milestone with 150,000 AI Agents Automating 10 Million Local Marketing Tasks

TD Synnex Unveils Agentic AI-Powered Digital Bridge to Revolutionize Partner Sales and Productivity

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates