TLDR: URDF-Anything is a new framework that uses a 3D Multimodal Large Language Model (MLLM) to automatically create functional digital twins of articulated objects from visual observations. It jointly predicts geometric part segmentation and kinematic parameters, leveraging a special ‘[SEG]’ token mechanism for precise results. The method significantly outperforms existing approaches in segmentation, parameter prediction, and physical executability, demonstrating strong generalization capabilities for robotic simulation and embodied AI.
Creating accurate digital versions of real-world objects, especially those with moving parts like doors or drawers, is crucial for training robots and building smart AI systems. Traditionally, this process has been very time-consuming, often requiring manual modeling or complex multi-step procedures. However, a new framework called URDF-Anything is changing this by offering an automated, end-to-end solution.
Developed by researchers including Zhe Li, Xiang Bai, and Shanghang Zhang, URDF-Anything introduces an innovative approach to automatically reconstruct articulated objects. It takes visual information, such as images, and transforms it into a functional digital twin in the URDF (Unified Robot Description Format) format. This format is widely used in robotics for defining the structure and movement of objects in simulations.
The core of URDF-Anything is a sophisticated 3D Multimodal Large Language Model (MLLM). This advanced AI model can understand both visual data (like 3D point clouds generated from images) and text instructions. Unlike previous methods that might separate the tasks of identifying object parts and figuring out how they move, URDF-Anything tackles both simultaneously. It uses an autoregressive prediction framework, meaning it generates information step-by-step, to jointly optimize how it segments the object’s geometry and predicts its kinematic parameters.
A key innovation in this framework is a special mechanism involving a ‘[SEG]’ token. This token allows the MLLM to directly interact with the 3D point cloud features. As the model predicts the symbolic structure of an object (like link names and joint types), it also emits these ‘[SEG]’ tokens. These tokens act as markers, guiding the system to precisely segment the point cloud into individual parts. This tight coupling ensures that the predicted movement parameters are perfectly consistent with the reconstructed shapes of the object’s parts.
The process begins by converting visual observations (single or multiple images) into a dense 3D point cloud. This point cloud, which represents the entire object, is then fed into the 3D MLLM along with text instructions. The MLLM then generates a structured output that includes all the necessary URDF components: the type of joints (e.g., revolute for rotation, prismatic for sliding), their positions and orientations, and how different parts are connected. Simultaneously, the ‘[SEG]’ tokens enable the geometric segmentation of the object into its distinct links.
Finally, the segmented point clouds for each part are converted into 3D mesh models, and all the predicted kinematic information is assembled into a complete URDF XML file. This file can then be directly used in physics simulators, allowing for realistic robotic training and embodied AI world building.
Experiments conducted on both simulated and real-world datasets have shown that URDF-Anything significantly outperforms existing methods. It achieved a 17% improvement in geometric segmentation accuracy (mIoU) and reduced kinematic parameter prediction errors by an average of 29%. Crucially, the digital twins generated by URDF-Anything were 50% more physically executable in simulators compared to baselines, meaning they behaved more realistically. The framework also demonstrated excellent generalization, performing well even on objects it hadn’t seen during training.
Also Read:
- GenDexHand: Automating Dexterous Hand Simulation for Robotics
- FoldPath: Continuous Motion for Industrial Robots
This work represents a significant step forward in automating the creation of digital twins for articulated objects. By providing an efficient and robust solution, URDF-Anything enhances the ability to transfer insights from simulations to real-world robotic applications, paving the way for more advanced and capable AI systems. You can find the full research paper here.


