TLDR: FALCON is a novel vision-language-action (VLA) model that enhances robot capabilities by integrating robust 3D spatial understanding. It addresses the limitations of existing VLA models, which often struggle with 3D environments due to their 2D foundations. FALCON achieves this through an Embodied Spatial Model that extracts rich 3D information from RGB images (and optionally depth/pose) and a Spatial-Enhanced Action Head that directly uses these spatial tokens for precise action generation. This approach leads to state-of-the-art performance in both simulated and real-world tasks, demonstrating superior generalization, adaptability, and robustness in complex manipulation scenarios.
Robots are becoming increasingly capable, understanding natural language instructions and performing complex tasks. This progress is largely thanks to Vision-Language-Action (VLA) models, which combine visual and linguistic understanding to guide robot actions. However, a significant challenge remains: while robots operate in a 3D physical world, many advanced VLA models are built on 2D image processing, leading to a ‘spatial reasoning gap’. This gap limits their ability to generalize to new environments and adapt to changes in object size, height, or clutter.
Existing attempts to integrate 3D information into VLA models often fall short. Some require specialized 3D sensors, which are expensive and difficult to deploy, and the models struggle to transfer their knowledge when these specific inputs aren’t available. Other methods inject weak 3D cues that lack detailed geometric information or disrupt the crucial vision-language alignment, leading to degraded performance, especially in tasks requiring precise spatial reasoning.
Introducing FALCON: A New Paradigm for Spatial Intelligence
A recent research paper, “From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors”, introduces FALCON (From Spatial to Action), a novel approach designed to overcome these limitations. FALCON injects rich 3D spatial tokens directly into the robot’s action decision-making process, providing robust 3D spatial understanding from standard RGB images alone.
FALCON’s design focuses on three key contributions:
1. Strong Geometric Priors: It leverages insights from spatial foundation models to deliver comprehensive 3D spatial information, even when only RGB camera input is available. This ensures robust performance without relying on specialized 3D sensors.
2. Embodied Spatial Model (ESM): This component is highly flexible. It can optionally integrate additional 3D modalities like depth maps or camera poses when they are available, further enhancing accuracy. Crucially, it does so without requiring any retraining or architectural changes, ensuring excellent ‘modality transferability’ – the ability to perform well across different types of input.
3. Spatial-Enhanced Action Head: Instead of forcing spatial information into the vision-language backbone (which can disrupt language reasoning), FALCON’s action head directly consumes these spatial tokens. This approach is inspired by how the brain separates high-level reasoning from fine-grained motor control, allowing the VLM to maintain its semantic understanding while the action head benefits from precise spatial cues.
How FALCON Works
At its core, FALCON combines a 2D Vision-Language Model (VLM) for semantic understanding with the Embodied Spatial Model (ESM) for 3D structural features. The VLM processes visual observations and language instructions to understand the task. In parallel, the ESM extracts detailed 3D geometric information from the scene. These two streams of information – semantic understanding from the VLM and 3D spatial awareness from the ESM – are then fused within the Spatial-Enhanced Action Head. This fusion, primarily through a simple yet effective element-wise addition, guides the robot to generate precise actions.
Also Read:
- Teaching Robots to Recover: A New Approach to Handling Unexpected Situations
- Enhancing VLM Agent Intelligence Through Explicit World Model Reasoning
Impressive Performance in Real and Simulated Worlds
FALCON has been rigorously evaluated across various benchmarks, including three simulation environments (CALVIN and SimplerEnv) and eleven real-world tasks. The results consistently show that FALCON achieves state-of-the-art performance, significantly outperforming existing VLA methods. It demonstrates superior spatial understanding in challenging scenarios, such as manipulating objects in cluttered scenes, adapting to unseen objects or backgrounds, and handling variations in object scale and height.
For instance, in the CALVIN benchmark, FALCON achieved the highest success rates in long-horizon, language-conditioned manipulation tasks. In real-world tests, it showed remarkable robustness, correctly placing objects even when other models struggled with varying sizes or heights. The ability of FALCON to effectively utilize additional geometric information (like depth and camera poses) when available, while still performing strongly with only RGB input, highlights its adaptability and practical utility.
In conclusion, FALCON represents a significant step forward in generalist robotics. By effectively bridging the 2D-3D spatial reasoning gap, it empowers robots with more robust 3D spatial understanding, leading to more precise, adaptable, and generalizable manipulation capabilities in the real world.


