TLDR: A new LLM agent system, SpatialAgent, developed by UWIPL ETRI, secured 1st place in the 9th AI City Challenge Track 3. This system leverages a Gemini 2.5-Flash LLM with specialized tools for spatial reasoning, object retrieval, counting, and distance estimation in complex indoor warehouse environments. It offers a data-efficient alternative to traditional MLLM finetuning, achieving 95.86% accuracy on the Physical AI Spatial Intelligence Warehouse benchmark.
Understanding spatial relationships in complex environments has long been a significant hurdle for Multi-modal Large Language Models (MLLMs). While previous approaches often relied on extensive MLLM finetuning, a new data-efficient method has emerged, demonstrating remarkable capabilities in solving challenging spatial question-answering tasks within indoor warehouse scenarios.
Researchers from the University of Washington, Electronics and Telecommunications Research Institute, and National Center for High-performance Computing have developed an innovative LLM agent system, named SpatialAgent. This system integrates multiple specialized tools, allowing the LLM agent to perform advanced spatial reasoning and interact with various API tools to answer intricate spatial questions. This approach stands in contrast to the MLLM-finetuned paradigm, which typically involves lifting 2D images to pseudo 3D point clouds and generating template-based QA pairs for large-scale MLLM finetuning.
The core of the SpatialAgent system is a reasoning LLM, specifically Gemini 2.5-Flash, which acts as an AI agent capable of conducting spatial reasoning, function calling, and question answering. The system is designed to robustly analyze object relationships. When presented with an image, object masks, and a spatial question, the agent first identifies relevant object masks and registers them with its tool API. It then uses a few-shot prompting template to interact with the Gemini model, maintaining a structured message history for multi-turn conversations to guide its reasoning.
During its operation, the agent interacts with a predefined set of spatial APIs through specific commands. These APIs include functions for distance estimation, object inclusion, relative positioning (like left/right), and region queries (e.g., most left, middle). The results from these tool executions are fed back to the LLM, allowing it to iteratively refine its reasoning until a final answer is produced.
For simpler spatial relationships, such as determining if an object is to the left or right, the system utilizes the object mask centroid coordinates. For more complex tasks like distance estimation and determining if an object is inside a specific region, the researchers trained deep learning models. The Distance Estimation Model uses a ResNet-50 backbone and employs a cascaded approach, where a second model is used for more accurate predictions when distances are less than 3 meters. Similarly, an Inclusion Classification Model, also based on ResNet-50, is trained to determine if one object is spatially included within another, particularly for buffer regions.
The effectiveness of the SpatialAgent system was rigorously evaluated on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset. This large-scale synthetic benchmark provides rich multimodal inputs, including RGB-D image pairs, object masks, and natural language QA pairs, categorized into spatial relations, multi-choice selection, distance estimation, and object counting. The system achieved a remarkable 95.86% accuracy on the testing set, securing the 1st place position among all participating teams in the 9th AI City Challenge Track 3.
Also Read:
- Advancing Embodied AI: Introducing EmbRACE-3K for Interactive VLM Training
- Enhancing Logistics Planning with Conversational AI and Verified Intent
This work represents a significant step forward in spatial understanding for AI systems, bridging the gap between perception and high-level reasoning. The SpatialAgent system offers a practical and highly accurate solution for warehouse spatial understanding, paving the way for more intelligent and autonomous systems in complex indoor environments. For more technical details, the code is available at https://github.com/hsiangwei0903/SpatialAgent.


