TLDR: A new research paper introduces TreeBench, a diagnostic benchmark designed to evaluate AI models’ ability to understand complex visual scenes by requiring precise object localization and multi-step reasoning. Alongside, TreeVGR is presented as a novel training paradigm that uses reinforcement learning with a unique “dual IoU reward” to improve AI’s visual reasoning capabilities and make its decision-making process more transparent and traceable.
Large Language Models (LLMs) have made incredible strides in text-based reasoning, but when it comes to understanding and interacting with the visual world, they often hit a wall. While models like OpenAI-o3 have started to “think with images” by focusing on specific visual regions, there hasn’t been a comprehensive way to truly evaluate these capabilities. This is where a new research paper introduces a significant step forward: TreeBench, a diagnostic benchmark, and TreeVGR, a novel training approach.
Introducing TreeBench: A New Standard for Visual Reasoning
TreeBench, or Traceable Evidence Evaluation Benchmark, is designed to fill this evaluation gap. It’s built on three core principles:
- Focused Visual Perception: It challenges models to spot subtle targets in busy, real-world scenes, requiring them to understand complex visual hierarchies and distinguish between very similar objects.
- Traceable Evidence: Unlike other benchmarks, TreeBench doesn’t just check if the final answer is correct. It also evaluates the reasoning process itself, using bounding boxes to show exactly what the model is looking at. This makes the AI’s “thinking” transparent and helps diagnose errors.
- Vision-Centric Second-Order Reasoning: This goes beyond simple object identification. TreeBench includes questions that require understanding object interactions, spatial relationships (like inside/outside, above/below), and even perspective shifts.
To create TreeBench, researchers carefully selected 1,000 high-quality images from a large dataset called SA-1B, focusing on scenes with many objects. Eight human experts, specializing in Large Multimodal Models (LMMs), then meticulously crafted 405 challenging visual question-answering pairs. They even used advanced LMMs like OpenAI-o3 and Gemini-2.5-Pro to generate initial questions, which were then refined and cross-verified through three stages of quality control. The result is a benchmark so tough that even the most advanced models struggle, with none reaching 60% accuracy.
TreeVGR: Teaching AI to “Think with Images” More Effectively
Beyond just evaluation, the paper also introduces TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a new training method. Many previous approaches only supervise the final answer, making it hard to understand how the model arrived at its conclusion. TreeVGR, however, supervises both the localization (where the model looks) and the reasoning process simultaneously, using a technique called reinforcement learning.
A key innovation in TreeVGR is its “dual IoU reward” system. IoU, or Intersection-over-Union, measures how well a predicted bounding box aligns with the actual object. This dual reward ensures that the model not only identifies all relevant objects (recall) but also avoids generating unnecessary or incorrect bounding boxes (precision). This explicit supervision of bounding box generation leads to more accurate localizations and explainable reasoning pathways.
Initialized from a powerful base model (Qwen2.5-VL-7B), TreeVGR showed significant improvements across various benchmarks, including V* Bench (+16.8%), MME-RealWorld (+12.6%), and TreeBench itself (+13.4%). This demonstrates that making the AI’s visual reasoning traceable is crucial for advancing its capabilities. The research also found a positive correlation between precise localization and overall performance, especially for perception tasks. However, for more complex reasoning tasks, precise localization is just the first step; higher-level cognitive operations are also needed.
The code for TreeVGR is available at https://github.com/Haochen-Wang409/TreeVGR.
Also Read:
- Enhancing Multilingual Reasoning in AI: A New Approach to Language-Consistent Thinking
- Unmasking AI Vulnerabilities: A New Framework for Trustworthy Robustness Evaluation
Looking Ahead
TreeBench sets a new standard for evaluating how AI models “think with images,” while TreeVGR provides a blueprint for training them to do so more effectively and transparently. While the current TreeVGR implementation uses a 7B parameter model and TreeBench has 405 question-answer pairs, the researchers plan to expand both in the future to further challenge and advance multimodal AI.


