MMSearch-Plus: A New Benchmark Elevates the Challenge for AI Web Browsing

TLDR: MMSearch-Plus is a new benchmark for multimodal AI agents, featuring 311 tasks that require deep visual reasoning, iterative search, and provenance verification. It uses “Spatial-Temporal Extrapolation” to create questions needing inference of “out-of-image” facts from subtle visual cues. Evaluations show current MLLMs, even advanced ones like o3, struggle significantly, achieving only up to 36% accuracy, highlighting major gaps in fine-grained multimodal understanding and long-horizon tool use.

In the rapidly evolving world of artificial intelligence, large multimodal language models (MLLMs) are becoming increasingly adept at navigating the web. These advanced AI systems combine language, vision, and tool use to tackle complex tasks. However, a new research paper introduces a benchmark called MMSearch-Plus, designed to push the boundaries of what these AI agents can truly understand and achieve when browsing the internet.

The paper, titled “MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents,” highlights a critical issue with existing multimodal browsing benchmarks. Many current tests can be solved by relatively straightforward methods, often relying heavily on simple image searches and nearby text. This approach, while effective for some tasks, doesn’t truly test the AI’s ability for deep multimodal reasoning, such as understanding subtle visual cues, verifying information from multiple sources, or planning complex, multi-step actions.

What Makes MMSearch-Plus Different?

MMSearch-Plus is a benchmark comprising 311 tasks specifically crafted to demand a high level of multimodal understanding. Unlike previous benchmarks where a single prominent image might hold the answer, MMSearch-Plus tasks require agents to extract multiple weak, localized visual signals. These signals must then be propagated through iterative text and image searches and cross-validated against potential retrieval noise before an answer can be confidently provided.

The researchers introduce a unique curation procedure called “Spatial–Temporal Extrapolation.” This method creates questions that require the AI to extrapolate information from spatial cues (like micro-text, part-level appearance, layouts, or signage) and temporal traces (such as broadcast overlays or seasonal context). The goal is to infer “out-of-image” facts, like events, dates, and venues, that are not explicitly stated in the prompt or directly visible in the image.

Imagine an AI being shown a concert photo from 2025 and asked, “What was the singer’s performance time?” A simple image search might identify the artist, but MMSearch-Plus would require the AI to go further: extracting lyrics, identifying festival signage, resolving the specific event (festival, city, date), and then retrieving and cross-validating official schedules to find the exact performance time. This process demands fine-grained multimodal reasoning and robust provenance checks.

Key Challenges for AI Agents

The benchmark specifically targets three recurring challenges in difficult multimodal information-seeking tasks. First, AI agents must discern authentic sources when search results are unclear or contradictory, addressing noisy or conflicting retrieval. Second, when a holistic match isn’t possible, agents need to reason over specific parts or subregions of an image, often iteratively cropping and re-searching to piece together fragmented clues, which is exhaustive, part-based visual reasoning. Third, agents must extract answers from various media types (images, documents, videos), sustain long reasoning chains, and programmatically use visual tools like cropping or OCR, demonstrating professional multimodal handling and long, tool-augmented chains.

Also Read:

Evaluation and Findings

The researchers evaluated several closed-source and open-source MLLMs using a model-agnostic agent framework. The results revealed significant room for improvement. For instance, the strongest agent tested, o3, achieved only 15.1% accuracy without search and 36.0% with full search rollout. A strong open-source model, Qwen-2.5-VL-72B-Instruct, scored 0.0% without search and only 6.9% after 20 rounds of search.

These findings underscore that current AI systems struggle with reliably reading and localizing micro-signals, deciding when and how to crop and reuse visual evidence, and maintaining verifiable chains of evidence across long-horizon tool use. The benchmark serves as a rigorous stress test, highlighting the need for more sophisticated, tool-augmented multimodal reasoning capabilities in future AI agents.

The authors of the paper are Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, and Lingpeng Kong, from The University of Hong Kong (HKU) and Huawei Inc.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

MMSearch-Plus: A New Benchmark Elevates the Challenge for AI Web Browsing

What Makes MMSearch-Plus Different?

Key Challenges for AI Agents

Evaluation and Findings

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates