TLDR: The Point-It-Out (PIO) benchmark is introduced to evaluate Vision-Language Models (VLMs) on their ability to perform precise visual grounding for embodied reasoning tasks. Unlike previous benchmarks, PIO uses a three-stage hierarchical evaluation (object localization, task-driven pointing, and visual trace prediction) across diverse real-world scenarios like household, kitchen, driving, and robotics. Findings show that models specifically trained for grounding excel in initial stages, while generalist models perform better in complex multi-step planning, revealing current limitations in VLM’s embodied intelligence and highlighting the need for targeted data to improve grounding capabilities.
Vision-Language Models (VLMs) are becoming increasingly important for embodied AI applications, allowing robots and autonomous systems to understand and interact with the physical world. These models combine the broad knowledge of large language models with the ability to interpret visual inputs, making them promising for tasks like robot manipulation, navigation, and autonomous driving.
However, a significant challenge in developing these systems has been the lack of adequate benchmarks to truly evaluate their ’embodied reasoning’ capabilities. Existing evaluation methods often rely on indirect assessments, such as multiple-choice questions or high-level language-based planning. These approaches don’t fully test a VLM’s ability to precisely ground its understanding back into the visual space—a crucial step for real-world action.
Introducing the Point-It-Out (PIO) Benchmark
To address this gap, researchers have introduced the Point-It-Out (PIO) benchmark. This novel benchmark is designed to systematically assess the embodied reasoning abilities of VLMs by requiring them to generate precise visual groundings, such as points, bounding boxes, or trajectories, directly on images. PIO is unique in offering pixel-level grounding for embodied reasoning across diverse real-world scenarios.
A Hierarchical Approach to Evaluation
PIO employs a hierarchical evaluation protocol, breaking down embodied reasoning into three stages of increasing complexity:
-
Stage 1 (S1): Referred-Object Localization This initial stage focuses on identifying and localizing specific objects in a scene based on language instructions. This could involve simple object detection or more complex localization with constraints like spatial cues, color, or material properties. For example, a VLM might be asked to locate ‘the middle pile of paper cups’ or ‘the handle of the left cup.’
-
Stage 2 (S2): Task-Driven Grounding Building on S1, this stage requires the VLM to determine which object or part of an object is relevant for a given task and pinpoint where to interact with it. Unlike S1, the target might not be explicitly mentioned in the instruction, demanding reasoning about object affordances. An example would be ‘open the top drawer,’ where the model must identify the drawer and then locate its handle.
-
Stage 3 (S3): Visual Trace Prediction The most complex stage, S3 assesses a VLM’s ability to plan and generate a coarse 2D visual trace (a sequence of points) that outlines how a task should be completed. This involves integrating object understanding, affordance reasoning, and temporal planning. Tasks here might include generating a trajectory to ‘wipe a table with a sponge’ or ‘open and close a drawer.’
The benchmark includes over 600 human-annotated question-answer pairs collected from critical domains for embodied intelligence, including indoor environments, kitchen scenarios, driving scenes, and robotic manipulation tasks.
Also Read:
- Improving Robot Navigation with Contextual Textual Descriptions in LLMs
- Decoding How AI Understands the World: A Multimodal Perspective
Key Findings from Extensive Evaluations
The researchers conducted extensive experiments with over ten state-of-the-art VLMs, including models like GPT-4o, Claude-3.7, Gemini 2.0/2.5, MoLMO, and Qwen2.5-VL. Several interesting findings emerged:
-
Models specifically fine-tuned with grounding supervision, such as RoboRefer, MoLMO-7B-D, Gemini-2.5-Pro, and Qwen-2.5-VL, consistently achieved the highest scores in S1 and S2 tasks. This highlights the critical importance of grounding data for precise spatial reasoning.
-
Strong general-purpose models like GPT-4o and Claude-3.7, while excelling in many other benchmarks, underperformed in precise visual grounding tasks within PIO.
-
A clear performance drop was observed across all models from S1 to S2, particularly in tasks requiring localization of ‘object parts’ and understanding ‘affordance’ and ‘contact’ points.
-
S3, which demands coherent visual trace generation, proved to be a significant challenge. Models that performed well in S1 and S2 (like MoLMO and Qwen) often struggled with S3, indicating that strong grounding alone isn’t sufficient for multi-step planning.
-
Conversely, generalist models like Gemini-2.5-Pro and GPT-o3 showed more promising results in S3, generating more reasonable trajectories, suggesting they excel at integrating grounding with complex planning, even without specific trajectory fine-tuning.
These findings underscore that while some VLMs are adept at isolated grounding tasks, others are better at integrating grounding with planning for more complex, multi-step actions. The PIO benchmark provides valuable insights into these capabilities, guiding future research and development in embodied AI. For more details, you can refer to the full research paper: POINT-IT-OUT: BENCHMARKING EMBODIED REASONING FOR VISION LANGUAGE MODELS IN MULTI-STAGE VISUAL GROUNDING.


