TLDR: The NavSpace benchmark introduces six categories of spatial intelligence tasks to systematically evaluate how navigation agents follow human instructions, revealing that current multimodal large language models and lightweight navigation models struggle with dynamic spatial reasoning. The paper proposes SNav, a new spatially intelligent navigation model, which significantly outperforms existing agents on NavSpace and in real-world robot tests, establishing a strong baseline for future advancements in embodied navigation.
In the exciting field of embodied intelligence, where robots learn to interact with the real world, a crucial challenge is enabling them to follow human instructions for navigation. While many existing benchmarks focus on understanding language and visual cues, they often miss a critical component: spatial intelligence. Imagine telling a robot to “walk around the front dining table and find my bag” or “go down to the bottom floor and see what my friends are doing.” These everyday instructions demand a robot to perceive and reason about space, scale, object relationships, and environmental conditions – capabilities that haven’t been systematically evaluated until now.
Researchers have introduced a groundbreaking new benchmark called NavSpace to address this gap. NavSpace is designed specifically to test the spatial intelligence of navigation agents. It features six distinct categories of tasks, comprising 1,228 pairs of trajectories and instructions, all crafted to probe how well robots understand and navigate space. These categories include:
Vertical Perception
This tests a robot’s ability to understand and navigate different floor levels, whether explicitly stated (e.g., “Go to the second floor”) or implied (e.g., “Go to a higher floor” or “Go to the topmost floor”). It requires the robot to identify its current floor and the target floor for effective route planning.
Precise Movement
This category evaluates how accurately an agent can interpret detailed distances and angles in instructions, such as “Turn right 180°, go straight 1 m, turn left 90° and go 5 m.” It demands a keen awareness of spatial scales and the ability to translate these into exact navigation actions.
Viewpoint Shifting
This is a fascinating test of spatial imagination. The robot must be able to switch its perspective, for example, by imagining itself as an object in the room and then navigating based on that object’s viewpoint. This requires long-term memory and reasoning over its entire movement history.
Spatial Relationship
This category focuses on understanding the order and relative positions of multiple objects or rooms. Instructions might involve counting (e.g., “turn left at the third door”) or understanding relationships between objects (e.g., “stop between the two brown sofas”).
Environment State
Here, the agent must perceive the current state of the environment and make decisions based on it. This often involves “if…otherwise…” scenarios, like “if you see the keys, stop, otherwise go to the front door and check.”
Also Read:
- Unlocking Spatial Understanding in AI Through Progressive Training
- TrackVLA++: Advancing Embodied Visual Tracking with Spatial Reasoning and Memory
Space Structure
This assesses the agent’s understanding of spatial layouts and its ability to perform complex navigation behaviors like circling an object, making round trips, or finding extreme locations (e.g., the farthest sofa).
To build NavSpace, the team conducted a questionnaire survey to identify these key spatial intelligence categories. They then used a sophisticated pipeline involving teleoperating agents in a simulated environment to record navigation trajectories, using large language models (like GPT-5) to assist in generating instructions, and finally, human cross-validation to ensure the instructions were accurate and executable.
The evaluation of 22 existing navigation agents on NavSpace, including state-of-the-art navigation models and multimodal large language models (MLLMs) like GPT-5 and Gemini Pro 2.5, revealed some critical insights. Most open-source MLLMs performed poorly, with average success rates below 10%, similar to random chance. Even proprietary MLLMs, while better, still had average success rates below 20%. This suggests that current MLLMs, despite their impressive language and visual understanding, struggle significantly with the dynamic spatial reasoning required for embodied navigation.
Lightweight navigation models also showed limited capabilities. However, navigation large models like NaVid and StreamVLN demonstrated better performance, hinting at preliminary spatial intelligence. Building on these findings, the researchers proposed a new model called SNav. SNav was specifically designed to enhance spatial intelligence by being fine-tuned with specially generated navigation data for cross-floor navigation, precise movement, environment state inference, and spatial relationship understanding.
SNav significantly outperformed all other models on the NavSpace benchmark, establishing a strong baseline for future work. Real-world tests conducted with a quadruped robot, AgiBot Lingxi D1, in office, campus, and outdoor environments further validated SNav’s superior performance across various spatial intelligence categories, excluding vertical perception. These real-world results underscore the practical applicability of SNav’s enhanced spatial reasoning.
The research highlights that existing static spatial intelligence benchmarks don’t fully capture the dynamic action-oriented nature of embodied navigation. It also points out that while MLLMs can sometimes answer spatial questions correctly, they often fail to translate this perception into consistent and accurate navigation actions. The work emphasizes the need for substantial improvements in spatial perception and enhanced inferential mechanisms to translate this perception into effective action decisions for navigation agents. For more details, you can read the full research paper here.


