TLDR: The ‘Spatial Traces’ method enhances Vision-Language-Action (VLA) models by integrating spatial and temporal understanding. It achieves this by projecting visual traces of key points onto depth maps, providing a unified input that captures both ‘where’ and ‘how’ objects move. Experiments show significant performance improvements in robotic manipulation tasks with minimal training data, making it valuable for real-world applications.
The paper introduces a new method called “Spatial Traces” to improve Vision-Language-Action (VLA) models, which are used in robotics and task planning. These models help robots understand visual information and text instructions to perform actions in both virtual and real-world settings. You can find the full research paper here: Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding.
Current VLA models are good at predicting robot movements based on what they see and what they’re told. However, they often struggle with a comprehensive understanding of space (where things are) and time (the sequence of events or past interactions). Some models, like SpatialVLA, have tried to add spatial understanding using depth images, while others, like TraceVLA, have focused on temporal information by using visual “traces” of movements. The key innovation of Spatial Traces is that it combines both.
The Spatial Traces Approach
The core idea behind Spatial Traces is to project visual traces of important points (like a robot’s gripper) onto depth maps. A depth map provides information about how far away objects are. By overlaying these traces onto the depth map, the model gets a single, unified visual input that contains both spatial (from the depth map) and temporal (from the traces) information. This allows the VLA model to understand not just where things are, but also how they have moved over time.
For example, if a robot needs to pick up a spoon, the depth map shows its position in 3D space, and the traces show the path the gripper has taken leading up to that moment. This combined information helps the robot make more informed decisions. The resulting model that uses this technique is called ST-VLA.
How It Works
The process involves several steps. First, the model takes in current and past visual observations, along with a text instruction. A depth estimation model predicts a depth map from the current observation. Simultaneously, a trace predictor identifies and tracks key points from a sequence of previous observations, creating visual traces. These traces are then applied to the depth map, effectively “drawing” the movement history onto the spatial representation. These combined visual embeddings are then fed into the VLA model, along with the language instruction, to predict the next action.
The researchers found that the way these traces are applied to the depth map matters. They experimented with different strategies and found that assigning each trace pixel the depth of the nearest object in the current frame was most effective. This makes the traces more distinct and helps the model focus on them as important temporal cues.
Also Read:
- SkillNav: Teaching AI to Navigate by Mastering Atomic Skills
- IRL-VLA: Enhancing Autonomous Driving Policies Through Reward World Models
Experimental Results and Impact
The Spatial Traces method was tested in a virtual environment called SimplerEnv, using tasks inspired by real-world robot manipulations from the Bridge dataset. The results showed significant improvements. The ST-VLA model increased the mean number of successfully solved tasks by 4% compared to SpatialVLA and a substantial 19% compared to TraceVLA. This demonstrates the benefit of integrating both spatial and temporal information.
A particularly important finding is that this enhancement can be achieved with very little training data. The ST-VLA model was fine-tuned using only 52 training trajectories, which is a minimal amount for complex robotic tasks. This makes the approach highly valuable for real-world applications where collecting large datasets is often difficult and expensive.
The study also explored how the length of the interaction history (how many past observations are used to create traces) affects performance. Longer histories, specifically using 30 previous images, generally led to more stable and better results, especially for tasks requiring a strong understanding of spatial relationships.
In conclusion, Spatial Traces offers a promising advancement for VLA models by providing a unified way to understand both the spatial layout of an environment and the temporal dynamics of interactions. This leads to more capable and efficient robots, even with limited training data, paving the way for more robust real-world robotic applications.


