TLDR: 3DThinker is a novel AI framework that empowers Vision-Language Models (VLMs) to perform 3D spatial reasoning from limited 2D images by intrinsically forming 3D mental representations. Unlike previous methods, it doesn’t require explicit 3D data or external tools. Its two-stage training process first aligns the VLM’s internal 3D latent space with a 3D foundation model and then refines this ‘3D mentaling’ through outcome-based reinforcement learning. This approach significantly enhances spatial understanding, interpretability, and outperforms existing baselines across various benchmarks, demonstrating strong generalization capabilities.
Recent advancements in artificial intelligence, particularly in Vision-Language Models (VLMs), have opened up new possibilities across various multimodal tasks. However, a significant hurdle remains: enabling these AI systems to truly understand and reason about 3D spatial relationships when only presented with limited 2D views. This challenge is crucial for applications like embodied AI and autonomous driving, where machines need to interact with the real 3D world based on what they see.
Current reasoning methods often fall short. They typically rely on pure text descriptions or basic 2D visual cues, which have limited capacity for complex spatial layouts. Some approaches try to enhance inputs with auxiliary data like depth maps or 3D coordinates, but these often require extensive manual annotations or external tools, limiting their real-world applicability and introducing additional computational overhead.
Introducing 3DThinker: Thinking with 3D Mental Imagery
To bridge this gap, researchers have proposed a novel framework called 3DThinker. This framework allows VLMs to effectively leverage the rich geometric information embedded within images to perform 3D spatial reasoning, much like humans do. What makes 3DThinker unique is its ability to enable 3D mental imagery during reasoning without any prior 3D input or reliance on explicitly labeled 3D data for training.
The core idea is to allow the VLM to intrinsically form 3D mental representations. Instead of just processing text or 2D images, 3DThinker generates compact latent embeddings, referred to as ‘3D special tokens,’ that closely emulate the mental 3D scenes humans intuitively imagine during spatial reasoning.
How 3DThinker Works: A Two-Stage Training Approach
3DThinker’s training consists of two main stages:
1. Supervised Training (Stage 1): In this initial stage, the VLM is trained to align its internally generated 3D latent representations with the features from a specialized 3D foundation model, such as VGGT. This alignment process teaches the VLM to understand and form coherent 3D mental images from 2D inputs. To ensure the model maintains its ability to generate coherent text while forming these 3D mental images, both a 3D latent alignment loss and a cross-entropy loss for textual coherence are used.
2. Reinforced Spatial Mentaling (Stage 2): After the supervised training, the framework moves to a reinforcement learning stage. Here, the entire reasoning process is optimized solely based on outcome signals. This means the model refines its underlying 3D mental imagery by learning from the success or failure of its final answers, without needing explicit annotations for intermediate steps. Rewards are designed to encourage correct formatting, accurate answers, and further optimize the 3D visual tokens by comparing them with VGGT features.
A crucial component is a ‘projector’ that transforms the VLM-generated 3D latent embeddings into a compatible feature space for alignment with the 3D foundation model. This allows the model to recover 3D representations, like point clouds, from its latent space, significantly enhancing the interpretability of the reasoning process.
Also Read:
- How AI Learns to Reason Spatially with Minimal Information
- SCENECOT: Enabling Step-by-Step Grounded Reasoning in 3D AI Models
Key Contributions and Performance
3DThinker is the first framework to introduce the concept of ‘thinking with 3D mentaling’ without relying on densely labeled training data. Its two-stage training scheme fosters intrinsic geometry awareness without external priors. The ability to recover 3D representations from the latent space also addresses the interpretability challenge often found in large reasoning models.
Extensive experiments across multiple benchmarks, including MindCube-Tiny and Ego3D-Bench, demonstrate that 3DThinker consistently outperforms strong baselines. It shows significant performance gains, sometimes more than doubling the accuracy on certain tasks, and even surpasses advanced closed-source models. Importantly, 3DThinker exhibits strong generalization capabilities across different base VLMs and datasets, proving its effectiveness even on data it wasn’t specifically trained on.
This innovative approach offers a new perspective towards unifying 3D representations into multimodal reasoning, paving the way for AI systems with a more profound understanding of our 3D world. You can read the full research paper for more technical details and results here: Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views.


