AI Agent System Achieves Top Performance in Warehouse Spatial Understanding

TLDR: A new LLM agent system, SpatialAgent, developed by UWIPL ETRI, secured 1st place in the 9th AI City Challenge Track 3. This system leverages a Gemini 2.5-Flash LLM with specialized tools for spatial reasoning, object retrieval, counting, and distance estimation in complex indoor warehouse environments. It offers a data-efficient alternative to traditional MLLM finetuning, achieving 95.86% accuracy on the Physical AI Spatial Intelligence Warehouse benchmark.

Understanding spatial relationships in complex environments has long been a significant hurdle for Multi-modal Large Language Models (MLLMs). While previous approaches often relied on extensive MLLM finetuning, a new data-efficient method has emerged, demonstrating remarkable capabilities in solving challenging spatial question-answering tasks within indoor warehouse scenarios.

Researchers from the University of Washington, Electronics and Telecommunications Research Institute, and National Center for High-performance Computing have developed an innovative LLM agent system, named SpatialAgent. This system integrates multiple specialized tools, allowing the LLM agent to perform advanced spatial reasoning and interact with various API tools to answer intricate spatial questions. This approach stands in contrast to the MLLM-finetuned paradigm, which typically involves lifting 2D images to pseudo 3D point clouds and generating template-based QA pairs for large-scale MLLM finetuning.

The core of the SpatialAgent system is a reasoning LLM, specifically Gemini 2.5-Flash, which acts as an AI agent capable of conducting spatial reasoning, function calling, and question answering. The system is designed to robustly analyze object relationships. When presented with an image, object masks, and a spatial question, the agent first identifies relevant object masks and registers them with its tool API. It then uses a few-shot prompting template to interact with the Gemini model, maintaining a structured message history for multi-turn conversations to guide its reasoning.

During its operation, the agent interacts with a predefined set of spatial APIs through specific commands. These APIs include functions for distance estimation, object inclusion, relative positioning (like left/right), and region queries (e.g., most left, middle). The results from these tool executions are fed back to the LLM, allowing it to iteratively refine its reasoning until a final answer is produced.

For simpler spatial relationships, such as determining if an object is to the left or right, the system utilizes the object mask centroid coordinates. For more complex tasks like distance estimation and determining if an object is inside a specific region, the researchers trained deep learning models. The Distance Estimation Model uses a ResNet-50 backbone and employs a cascaded approach, where a second model is used for more accurate predictions when distances are less than 3 meters. Similarly, an Inclusion Classification Model, also based on ResNet-50, is trained to determine if one object is spatially included within another, particularly for buffer regions.

The effectiveness of the SpatialAgent system was rigorously evaluated on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset. This large-scale synthetic benchmark provides rich multimodal inputs, including RGB-D image pairs, object masks, and natural language QA pairs, categorized into spatial relations, multi-choice selection, distance estimation, and object counting. The system achieved a remarkable 95.86% accuracy on the testing set, securing the 1st place position among all participating teams in the 9th AI City Challenge Track 3.

Also Read:

This work represents a significant step forward in spatial understanding for AI systems, bridging the gap between perception and high-level reasoning. The SpatialAgent system offers a practical and highly accurate solution for warehouse spatial understanding, paving the way for more intelligent and autonomous systems in complex indoor environments. For more technical details, the code is available at https://github.com/hsiangwei0903/SpatialAgent.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Agent System Achieves Top Performance in Warehouse Spatial Understanding

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates