New Benchmark Reveals Multimodal AI's Challenges in Interactive Visual Reasoning

TLDR: The IRIS benchmark evaluates Multimodal Large Language Models (MLLMs) on “think with images” tasks, requiring active image manipulation and tool integration. Current MLLMs perform poorly (top model at 18.68% pass rate), primarily due to visual perception errors. The study highlights the need for better tool-enabled reasoning, with some models benefiting from tools while others struggle with their integration.

Multimodal Large Language Models (MLLMs) are becoming increasingly common in real-world applications, where they process both text and images. Traditionally, these models have operated under a “think about images” approach, treating images as static inputs for understanding and reasoning. However, a new research paper introduces a paradigm shift: “think with images,” where MLLMs actively manipulate and transform visual content to solve complex tasks.

The paper, titled “Evaluating MLLMs on Tool-Enabled Image Perception, Transformation, and Reasoning,” highlights a significant gap in current MLLM evaluation. Most existing benchmarks still focus on static image analysis, failing to assess a model’s ability to interact dynamically with visual information. To address this, researchers from ScaleAI and the University of Illinois at Urbana-Champaign have developed a new benchmark called IRIS (Interactive Reasoning with Images and Systems).

Introducing IRIS: A New Challenge for MLLMs

IRIS is designed to evaluate how well MLLMs can perceive, transform, and reason across complex visual-textual tasks under the “think with images” paradigm. It comprises 1,204 challenging, open-ended vision tasks, split almost evenly between single-turn and multi-turn interactions. These tasks span five diverse domains and come with detailed rubrics for systematic evaluation.

The core design principles of IRIS ensure a rigorous assessment:

Non-trivial visual perception: Critical visual content is often not easily accessible, requiring models to apply image transformations like cropping, editing, or enhancement to extract key details.
Realistic task settings: Prompts and images are crafted to mirror practical, real-world scenarios, moving beyond synthetic or overly simplified cases.
Implicit tool-use requirements: Models must infer when and how to invoke tools based on contextual cues, rather than being explicitly instructed.
Multi-step, compositional reasoning: Tasks demand combining visual transformations with multi-step reasoning, including applying tool sequences and integrating extracted information.

IRIS equips MLLMs with a standardized API for six tools: a Python image processing tool (for manipulations like cropping, editing, and brightness adjustments), a Python interpreter, web search, a calculator, a browser for page text, and a historical weather lookup. The vision tool is particularly central, allowing models to iteratively refine visual inputs.

Key Findings: MLLMs Face Significant Hurdles

The evaluation of 16 representative MLLMs on IRIS revealed that current models struggle considerably. Even the strongest performer, GPT-5-think, achieved an overall pass rate of only 18.68%. A majority of the evaluated models scored below 10%, underscoring the substantial room for improvement in visual-reasoning tasks that require active image manipulation.

Further analysis showed that OpenAI models (GPT-5, GPT-5-think, and o3) generally outperformed others, potentially due to specific training for “think with images” tasks. Multi-turn tasks, which involve multiple conversational turns and more opportunities for error, proved more difficult than single-turn tasks.

Regarding tool use, the study found a positive correlation between proactive and consistent tool use and better performance. The Python image processing tool was the most frequently invoked, highlighting the importance of image manipulation in these tasks. Interestingly, while GPT-5 and GPT-5-think made fewer tool calls than some other models, they often executed multiple operations within a single call, demonstrating higher efficiency.

A detailed error analysis pointed to visual perception errors as the most common failure mode, accounting for 70-80% of mistakes across models. Calculation errors were rare, and reasoning errors occurred occasionally.

An ablation study revealed divergent behaviors: GPT-5 significantly benefited from tool access and strong system prompts, indicating its reliance on iterative edits. In contrast, Gemini-2.5-pro surprisingly performed better without tools, suggesting its native vision capabilities might be hindered by unnecessary tool calls. These findings emphasize that effective tool use is crucial but also highlights differences in how models integrate and benefit from tools.

Also Read:

Looking Ahead

The introduction of IRIS provides critical insights for advancing visual intelligence in MLLMs. It moves beyond passive visual understanding to emphasize dynamic interaction and manipulation of images. The benchmark aims to catalyze the development of MLLMs that can seamlessly integrate image perception, tool use, and reasoning into a unified competency stack, tackling more challenging real-world scenarios. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New Benchmark Reveals Multimodal AI’s Challenges in Interactive Visual Reasoning

Introducing IRIS: A New Challenge for MLLMs

Key Findings: MLLMs Face Significant Hurdles

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Baidu Unveils Next-Generation AI Accelerators and ERNIE 5.0 Model

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates