TLDR: 3Dify is a novel framework that enables users to generate complex 3D computer graphics (3D-CG) solely through natural language instructions. Built on the open-source Dify platform, it integrates Large Language Models (LLMs) with advanced technologies like Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). The system automates Digital Content Creation (DCC) tools such as Blender and Unreal Engine, incorporates an interactive feedback loop for refining generated images, and supports the use of local LLMs to enhance security and reduce costs. This allows for efficient, flexible, and accessible 3D content creation without manual modeling.
Creating intricate 3D computer graphics (3D-CG) has traditionally been a complex and time-consuming endeavor, often requiring specialized skills and extensive manual effort. However, a new framework called 3Dify is set to change this by enabling users to generate detailed 3D content using simple natural language instructions, powered by Large Language Models (LLMs).
Developed by researchers from Nagoya University and Kyushu University, 3Dify aims to democratize 3D-CG production, making it accessible even to non-experts. The framework is built upon Dify, an open-source platform for AI application development, and integrates cutting-edge LLM technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG).
How 3Dify Works: A Seamless Workflow
The process of generating 3D content with 3Dify is designed to be intuitive and iterative:
First, users simply input their desired 3D image description in natural language. For example, “Create a desktop gaming PC model with side panel removed, keeping all internal components fully visible.”
Next, an LLM presents multiple 2D image candidates as pre-visualizations. Users can select the images closest to their vision, and the LLM learns from these selections to generate new, refined candidates. This feedback loop continues until the user is satisfied with a pre-visualization that closely matches their intent.
Finally, based on the refined pre-visualization, a Digital Content Creation (DCC) tool, such as Blender or Unreal Engine, automatically creates the corresponding 3D image. The remarkable aspect here is that users do not need to perform any manual 3D modeling operations within the DCC tools; 3Dify automates the entire process.
Key Innovations and Features
3Dify stands out with several distinctive features:
Dify-based Implementation: By extending Dify, an open-source platform, 3Dify can rapidly adopt the latest AI technologies and easily switch between various LLM models from providers like OpenAI, Anthropic, and Google. Its open-source nature also ensures long-term maintainability and extensibility.
Automated DCC Tool Operation: 3Dify employs two primary methods to control DCC tools. The Model Context Protocol (MCP) provides a simple and secure way for LLM agents to interact with applications. For tools or functions not supporting MCP, 3Dify utilizes the Computer-Using Agent (CUA) method, which allows LLMs to directly operate graphical user interfaces (GUIs) through screenshots, using specialized models like UI-TARS.
Retrieval-Augmented Generation (RAG): To enhance its generation capabilities and maintainability, 3Dify uses RAG. This allows LLMs to reference external information, such as DCC tool manuals and documentation, improving functional coverage and adaptability to software updates.
Image-Selection Feedback Loop: This interactive mechanism allows users to iteratively refine the generated images. By selecting preferred candidates, the LLM automatically recognizes variable patterns and applies them to subsequent generations, ensuring the final output aligns closely with user preferences.
Support for Local LLMs: Users can integrate locally deployed LLMs, leveraging their own computational resources. This reduces costs associated with external API calls and allows for the use of custom-developed models, while also preventing data leakage of sensitive information to external services.
Extensibility: Beyond 3D-CG production, 3Dify’s use of CUA enables it to access and automate a wide range of features within DCC tools, including game development and animation creation, making it a versatile framework for broader applications.
Under the Hood: Multiple LLM Agents and Smart Interactions
The framework’s sophisticated architecture involves three distinct LLM agents:
Visualizer LLM: Responsible for generating the initial 2D pre-visualization images and refining them based on user feedback.
Planner LLM: Analyzes the refined pre-visualization, predicts the necessary variations for the 3D model, extracts procedural parameters, and communicates the procedure to the Manager LLM.
Manager LLM: Receives instructions from the Planner LLM and directly operates the DCC tool to create the 3D-CG. It can also interact with the user for clarification if needed.
The system also uses Dify’s Chatflow feature to manage complex, multi-turn interactions and dynamic workflows, ensuring smooth communication between agents and the user.
Also Read:
- AI Agents Accelerate New Alloy Development for 3D Printing
- Automating Building Code Review with AI Agents and BIM
Demonstration and Future Outlook
In a demonstration, 3Dify successfully generated a 3D model of a desktop PC in Blender from a single natural language prompt. Further instructions, such as making case fans glow, were also successfully executed. While challenges remain, such as maintaining spatial coherence with numerous objects and complex instructions, the framework shows immense promise. The current demonstration primarily utilized MCP for automation, highlighting the potential for even greater accuracy and versatility when integrating visual information through CUA.
3Dify represents a significant step forward in procedural 3D-CG generation, offering an efficient and flexible approach to creating complex 3D content. By combining the power of LLMs with automated DCC tool operations and interactive feedback, it paves the way for a future where 3D design is as simple as describing your vision in words. You can find more details about this innovative framework in the full research paper: 3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG.


