TLDR: The IRIS benchmark evaluates Multimodal Large Language Models (MLLMs) on “think with images” tasks, requiring active image manipulation and tool integration. Current MLLMs perform poorly (top model at 18.68% pass rate), primarily due to visual perception errors. The study highlights the need for better tool-enabled reasoning, with some models benefiting from tools while others struggle with their integration.
Multimodal Large Language Models (MLLMs) are becoming increasingly common in real-world applications, where they process both text and images. Traditionally, these models have operated under a “think about images” approach, treating images as static inputs for understanding and reasoning. However, a new research paper introduces a paradigm shift: “think with images,” where MLLMs actively manipulate and transform visual content to solve complex tasks.
The paper, titled “Evaluating MLLMs on Tool-Enabled Image Perception, Transformation, and Reasoning,” highlights a significant gap in current MLLM evaluation. Most existing benchmarks still focus on static image analysis, failing to assess a model’s ability to interact dynamically with visual information. To address this, researchers from ScaleAI and the University of Illinois at Urbana-Champaign have developed a new benchmark called IRIS (Interactive Reasoning with Images and Systems).
Introducing IRIS: A New Challenge for MLLMs
IRIS is designed to evaluate how well MLLMs can perceive, transform, and reason across complex visual-textual tasks under the “think with images” paradigm. It comprises 1,204 challenging, open-ended vision tasks, split almost evenly between single-turn and multi-turn interactions. These tasks span five diverse domains and come with detailed rubrics for systematic evaluation.
The core design principles of IRIS ensure a rigorous assessment:
- Non-trivial visual perception: Critical visual content is often not easily accessible, requiring models to apply image transformations like cropping, editing, or enhancement to extract key details.
- Realistic task settings: Prompts and images are crafted to mirror practical, real-world scenarios, moving beyond synthetic or overly simplified cases.
- Implicit tool-use requirements: Models must infer when and how to invoke tools based on contextual cues, rather than being explicitly instructed.
- Multi-step, compositional reasoning: Tasks demand combining visual transformations with multi-step reasoning, including applying tool sequences and integrating extracted information.
IRIS equips MLLMs with a standardized API for six tools: a Python image processing tool (for manipulations like cropping, editing, and brightness adjustments), a Python interpreter, web search, a calculator, a browser for page text, and a historical weather lookup. The vision tool is particularly central, allowing models to iteratively refine visual inputs.
Key Findings: MLLMs Face Significant Hurdles
The evaluation of 16 representative MLLMs on IRIS revealed that current models struggle considerably. Even the strongest performer, GPT-5-think, achieved an overall pass rate of only 18.68%. A majority of the evaluated models scored below 10%, underscoring the substantial room for improvement in visual-reasoning tasks that require active image manipulation.
Further analysis showed that OpenAI models (GPT-5, GPT-5-think, and o3) generally outperformed others, potentially due to specific training for “think with images” tasks. Multi-turn tasks, which involve multiple conversational turns and more opportunities for error, proved more difficult than single-turn tasks.
Regarding tool use, the study found a positive correlation between proactive and consistent tool use and better performance. The Python image processing tool was the most frequently invoked, highlighting the importance of image manipulation in these tasks. Interestingly, while GPT-5 and GPT-5-think made fewer tool calls than some other models, they often executed multiple operations within a single call, demonstrating higher efficiency.
A detailed error analysis pointed to visual perception errors as the most common failure mode, accounting for 70-80% of mistakes across models. Calculation errors were rare, and reasoning errors occurred occasionally.
An ablation study revealed divergent behaviors: GPT-5 significantly benefited from tool access and strong system prompts, indicating its reliance on iterative edits. In contrast, Gemini-2.5-pro surprisingly performed better without tools, suggesting its native vision capabilities might be hindered by unnecessary tool calls. These findings emphasize that effective tool use is crucial but also highlights differences in how models integrate and benefit from tools.
Also Read:
- VQArt-Bench: A New Standard for Evaluating AI’s Understanding of Art
- OmniVideoBench: A New Benchmark for Advanced Audio-Visual AI Understanding
Looking Ahead
The introduction of IRIS provides critical insights for advancing visual intelligence in MLLMs. It moves beyond passive visual understanding to emphasize dynamic interaction and manipulation of images. The benchmark aims to catalyze the development of MLLMs that can seamlessly integrate image perception, tool use, and reasoning into a unified competency stack, tackling more challenging real-world scenarios. For more details, you can read the full research paper here.


