TLDR: Spatial Policy (SP) is a new framework for robotic manipulation that addresses the lack of spatial awareness in existing visuomotor models. It uses explicit spatial modeling and reasoning through a ‘spatial plan table’ to guide video generation, predict actions, and refine plans with feedback. SP significantly improves robot success rates in complex tasks by enabling more robust and consistent control, achieving an 86.7% average success rate across 11 diverse tasks.
Robotic manipulation has seen significant advancements, especially with models that combine high-level planning with low-level actions. These ‘visuomotor’ methods use visual information to guide robots. However, a key challenge has been the lack of ‘spatial awareness’ – the ability for robots to truly understand and reason about the physical space around them. This limitation often prevents them from effectively translating visual plans into precise actions in complex real-world environments.
To tackle this, researchers have introduced a new framework called Spatial Policy (SP). SP is designed to give robots explicit spatial modeling and reasoning capabilities, making their visuomotor manipulation more robust and reliable. The core idea behind SP is to use a ‘spatial plan table’ that acts as a guide, ensuring that the robot’s visual predictions and actions are spatially consistent.
How Spatial Policy Works
The Spatial Policy framework is built upon three interconnected modules:
1. Spatial-Conditioned Embodied Video Generation: This module is responsible for creating future video trajectories, essentially imagining how a task will unfold. Unlike previous methods that might generate physically implausible scenarios (like a robot arm passing through a wall), SP conditions its video generation on a structured spatial plan table. This table contains atomic actions, directional vectors, and relative distances, ensuring that the imagined future is spatially coherent and aligned with the task.
2. Spatial-Based Action Prediction: Once the spatially grounded video plan is generated, this module translates it into executable actions for the robot. It uses spatial coordinates to capture the fine-grained motion dynamics between the predicted video frames. This allows the robot to infer actions that are consistent with the visual plan and adapt to spatial variations.
3. Spatial Reasoning Feedback Policy: To maintain spatial consistency throughout the task execution, SP includes a feedback mechanism. This module monitors the robot’s progress and, if it detects issues like positional drift or the robot getting stuck, it triggers a ‘dual-stage replanning’. This involves refining the spatial plan table and regenerating new video and action sequences, ensuring the robot can recover from unexpected events and continue towards its goal.
The spatial plan table itself is generated by taking the relative offset between the robot’s end-effector and the target object, and feeding this information into a powerful vision-language model (like GPT-4o). This model then produces a structured plan of sequential subgoals, each detailing an action type, a direction, and a distance.
Also Read:
- Autonomous Robots Navigate Complex Industrial Spaces with Hybrid AI Control
- HOSt3R: Capturing Detailed 3D Hand-Object Interactions from Standard Images
Impressive Results and Robustness
Extensive experiments were conducted using the Meta-World benchmark, a collection of 11 diverse robotic manipulation tasks (such as pushing, pulling, grasping, and inserting). Spatial Policy demonstrated significant improvements over existing methods. Without needing real-time replanning, SP achieved an 83.4% success rate, outperforming all prior baselines. When the replanning mechanism (SP-R) was activated, the performance further increased to an impressive 86.7% average success rate across all tasks.
On particularly challenging tasks, SP achieved an average success rate of 77.5%, a substantial improvement compared to the best-performing baseline, which only managed 29.2%. This highlights SP’s ability to handle complex scenarios where spatial precision is critical. The framework also showed robustness to partial visual occlusions, meaning it could still generate coherent plans and execute tasks even when parts of its visual input were masked.
The research paper, available here, details how this explicit spatial modeling and reasoning is crucial for bridging the gap between visual plans and actionable control in complex robotic environments, making embodied models more practical for real-world applications.


