TLDR: DeepThink3D is a new framework that significantly improves large language models’ (LLMs) ability to perform complex reasoning in 3D environments. It uses a two-stage optimization process, Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), to teach LLMs to generate more accurate and executable code for interacting with 3D scenes. The framework also augments training data with more complex questions, leading to superior performance, better interpretability, and higher code reliability compared to previous methods in 3D situated reasoning tasks.
Large Language Models (LLMs) have shown remarkable capabilities in understanding and generating human language. However, when it comes to navigating and reasoning within complex 3D physical environments, these models often face significant hurdles. A new research paper introduces DeepThink3D, a novel framework designed to enhance LLMs’ ability to perform intricate reasoning tasks in 3D scenes by leveraging programmatic reasoning.
Current approaches to 3D situated reasoning, where agents need to understand and interact with 3D environments from a first-person perspective, often fall short. Many rely on end-to-end multimodal training, which can struggle with generalization to new environments, lack transparency in decision-making, and depend heavily on expensive, annotated data. Other methods that use LLMs to generate code for interacting with 3D environments through APIs also encounter challenges, such as weak reasoning abilities (often mixing reasoning with acting) and generating code that is not always executable or correct.
DeepThink3D addresses these limitations by introducing a structured, two-stage optimization approach to systematically improve how LLMs generate and use code. The core idea is that complex reasoning requires not just selecting the right tools, but also composing them into logical and executable sequences.
How DeepThink3D Works
The framework begins by processing a 3D scene using visual perception modules to identify objects and their categories. This information, along with the task question and API documentation, is fed into the LLM. The LLM then breaks down the complex question into a series of reasoning steps, translating each step into executable Python code that interacts with the 3D scene via specific APIs.
The APIs available to the LLM include:
- Scene Description (SD): Provides a complete overview of all objects in the 3D scene.
- Object Filtering (OF): Filters objects based on their category (e.g., finding all ‘tables’).
- Object Querying by Relation (OQR): Identifies objects based on their spatial relationships to a reference object or the agent (e.g., ‘objects on the table’, ‘table to my left’).
- Object Information Querying (OIQ): Retrieves detailed attributes of an object, such as color, shape, material, size, or distance from the agent.
A crucial part of DeepThink3D is its two-layer loop for code refinement during training. If the generated code fails or produces an incorrect answer, the LLM receives feedback and iteratively refines its reasoning and code until a correct solution is found or a maximum number of attempts is reached. This iterative process is key to generating high-quality training data.
Two-Stage Optimization
DeepThink3D employs two main optimization stages:
1. Reasoning-Oriented Supervised Fine-Tuning (SFT): This stage trains the LLM on successful reasoning paths and code generations. It teaches the model how to reason step-by-step, analyze feedback, and correct itself iteratively. This helps the model decompose complex tasks and develop a more structured problem-solving approach.
2. Execution-Oriented Direct Preference Optimization (DPO): Building on SFT, DPO refines the model’s ability to generate executable and correct code. It learns by comparing pairs of programs: a ‘chosen’ program that successfully generates the correct answer, and ‘rejected’ programs that either fail to execute or produce incorrect results. This direct comparison helps the model prioritize solutions that are not only logically sound but also practically executable.
Data Augmentation for Enhanced Reasoning
Recognizing that many existing 3D reasoning datasets, like SQA3D, contain relatively simple questions, DeepThink3D also incorporates an LLM-based data augmentation strategy. It combines simpler questions from the SQA3D dataset into more complex, multi-step questions, thereby creating more challenging training data. This augmentation significantly enhances the model’s ability to handle deeper reasoning tasks.
Also Read:
- Embodied-R1: Advancing Robotic Manipulation with Reinforced Visual Reasoning
- How Neuro-Symbolic AI Boosts Reasoning in Language Models
Performance and Impact
Evaluated on the SQA3D dataset, DeepThink3D achieved the highest accuracy among various methods, including both end-to-end multimodal models and other LLM-based approaches. The framework demonstrates superior reasoning capability and code reliability, with over 75% of correctly answered questions solved on the first execution. This highlights the effectiveness of the SFT and DPO strategies in improving the robustness and efficiency of the reasoning-code generation pipeline.
Ablation studies confirmed that both SFT and DPO are vital for the model’s performance, and the data augmentation significantly boosts its ability to tackle complex scenarios. While the framework shows promising results, the authors acknowledge limitations, such as reliance on the quality of underlying visual perception models and occasional deviations in LLM reasoning from human intuition.
DeepThink3D represents a significant step forward in enabling LLMs to understand and interact with the 3D world more effectively, offering a scalable and interpretable solution for embodied reasoning. You can read the full research paper here.


