TLDR: MMSearch-Plus is a new benchmark for multimodal AI agents, featuring 311 tasks that require deep visual reasoning, iterative search, and provenance verification. It uses “Spatial-Temporal Extrapolation” to create questions needing inference of “out-of-image” facts from subtle visual cues. Evaluations show current MLLMs, even advanced ones like o3, struggle significantly, achieving only up to 36% accuracy, highlighting major gaps in fine-grained multimodal understanding and long-horizon tool use.
In the rapidly evolving world of artificial intelligence, large multimodal language models (MLLMs) are becoming increasingly adept at navigating the web. These advanced AI systems combine language, vision, and tool use to tackle complex tasks. However, a new research paper introduces a benchmark called MMSearch-Plus, designed to push the boundaries of what these AI agents can truly understand and achieve when browsing the internet.
The paper, titled “MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents,” highlights a critical issue with existing multimodal browsing benchmarks. Many current tests can be solved by relatively straightforward methods, often relying heavily on simple image searches and nearby text. This approach, while effective for some tasks, doesn’t truly test the AI’s ability for deep multimodal reasoning, such as understanding subtle visual cues, verifying information from multiple sources, or planning complex, multi-step actions.
What Makes MMSearch-Plus Different?
MMSearch-Plus is a benchmark comprising 311 tasks specifically crafted to demand a high level of multimodal understanding. Unlike previous benchmarks where a single prominent image might hold the answer, MMSearch-Plus tasks require agents to extract multiple weak, localized visual signals. These signals must then be propagated through iterative text and image searches and cross-validated against potential retrieval noise before an answer can be confidently provided.
The researchers introduce a unique curation procedure called “Spatial–Temporal Extrapolation.” This method creates questions that require the AI to extrapolate information from spatial cues (like micro-text, part-level appearance, layouts, or signage) and temporal traces (such as broadcast overlays or seasonal context). The goal is to infer “out-of-image” facts, like events, dates, and venues, that are not explicitly stated in the prompt or directly visible in the image.
Imagine an AI being shown a concert photo from 2025 and asked, “What was the singer’s performance time?” A simple image search might identify the artist, but MMSearch-Plus would require the AI to go further: extracting lyrics, identifying festival signage, resolving the specific event (festival, city, date), and then retrieving and cross-validating official schedules to find the exact performance time. This process demands fine-grained multimodal reasoning and robust provenance checks.
Key Challenges for AI Agents
The benchmark specifically targets three recurring challenges in difficult multimodal information-seeking tasks. First, AI agents must discern authentic sources when search results are unclear or contradictory, addressing noisy or conflicting retrieval. Second, when a holistic match isn’t possible, agents need to reason over specific parts or subregions of an image, often iteratively cropping and re-searching to piece together fragmented clues, which is exhaustive, part-based visual reasoning. Third, agents must extract answers from various media types (images, documents, videos), sustain long reasoning chains, and programmatically use visual tools like cropping or OCR, demonstrating professional multimodal handling and long, tool-augmented chains.
Also Read:
- Accenture Research Unveils MCP-Universe: A Rigorous Benchmark for Evaluating LLM Agents in Real-World Scenarios
- AI Clinical Teams: A New Approach to Diagnosing Patient Problems from Medical Notes
Evaluation and Findings
The researchers evaluated several closed-source and open-source MLLMs using a model-agnostic agent framework. The results revealed significant room for improvement. For instance, the strongest agent tested, o3, achieved only 15.1% accuracy without search and 36.0% with full search rollout. A strong open-source model, Qwen-2.5-VL-72B-Instruct, scored 0.0% without search and only 6.9% after 20 rounds of search.
These findings underscore that current AI systems struggle with reliably reading and localizing micro-signals, deciding when and how to crop and reuse visual evidence, and maintaining verifiable chains of evidence across long-horizon tool use. The benchmark serves as a rigorous stress test, highlighting the need for more sophisticated, tool-augmented multimodal reasoning capabilities in future AI agents.
The authors of the paper are Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong, from The University of Hong Kong (HKU) and Huawei Inc.


