TLDR: TIRE (Track, Inpaint, Resplat) is a novel method for subject-driven 3D and 4D content generation. It addresses the challenge of maintaining a subject’s identity across different viewpoints by first tracking regions needing modification, then progressively inpainting these areas using a personalized 2D model, and finally reconstructing a consistent 3D asset. This three-stage pipeline significantly improves identity preservation and geometry quality compared to existing methods, offering a more personalized and realistic generation experience.
Creating realistic 3D and 4D (3D over time) digital content that truly captures the unique identity of a subject from just a few images or a video has been a significant hurdle in the world of generative AI. While current methods excel at photorealism and efficiency, they often struggle to maintain the specific look and feel of a subject when viewed from different angles or over time. Imagine generating a 3D model of your pet from a single photo, only for its side or back views to appear distorted or with incorrect colors. This challenge, known as ‘subject-driven’ or ‘personalized’ generation, is crucial for enhancing user experience and enabling impactful applications.
Addressing the Identity Preservation Gap
Existing 3D/4D generation techniques, often guided by text prompts or single images, tend to hallucinate the appearance of unobserved viewpoints. This can lead to inconsistencies, such as a cat appearing with a blueish tone on its originally hidden regions, as illustrated in some state-of-the-art models. These methods either involve time-consuming optimization processes or suffer from systematic errors in color and appearance due to biases in training data.
To tackle these issues, researchers Shuhong Zheng, Ashkan Mirzaei, and Igor Gilitschenski have introduced a novel method called TIRE, which stands for Track, Inpaint, Resplat. TIRE is designed to significantly improve identity preservation in 3D/4D generation by progressively infilling textures.
TIRE: A Three-Stage Approach to Subject-Driven Generation
TIRE takes an initial 3D asset, typically generated by an existing model, and refines it to ensure the subject’s identity is maintained across all views. The method is broken down into three coordinated stages:
1. Track: Identifying Areas for Infilling
The first step involves identifying which regions in the unobserved viewpoints need to be modified or ‘infilled’. TIRE achieves this by treating a sequence of rendered multi-view observations as a video. It then uses a video tracking model, CoTracker, to establish correspondences between the original ‘source view’ (the input image/video) and the ‘target views’ (other angles). Interestingly, TIRE employs a clever technique called ‘backward tracking’. Instead of tracking from the source view outwards, it tracks from the target views back to the source. This approach produces more accurate and better-shaped masks for infilling, avoiding grainy or suboptimal results that can occur with forward tracking. This stage effectively leverages powerful 2D video tracking tools to solve a 3D problem.
2. Inpaint: Filling Gaps While Preserving Identity
Once the masks for infilling are identified, the Inpaint stage takes over. This stage faces two main challenges: faithfully preserving the subject’s identity and effectively inpainting views far from the original source. TIRE addresses these by:
- Personalizing a pre-trained 2D inpainting model (like Stable Diffusion) to be ‘subject-driven’ using LoRA weights, ensuring the new content matches the subject’s identity.
- Employing a ‘progressive’ inpainting strategy. It starts by inpainting viewpoints close to the source view, using these as ‘anchor viewpoints’ to guide the inpainting of progressively farther views. This reduces the difficulty for the model when dealing with significantly different perspectives. For instance, it might first inpaint views at ±20 degrees, then use those refined views to help with ±90 degrees, and so on.
3. Resplat: Reconstructing Consistent 3D
The final stage, Resplat, is responsible for taking the refined 2D observations and projecting them back into a consistent 3D representation. Since the inpainting process happens on individual 2D frames, there’s a risk of introducing inconsistencies across views. TIRE mitigates this by using a multi-view diffusion model to refine the consistency of these observations before ‘resplatting’ the pixels into 3D Gaussians. This mask-aware refinement process ensures that the final 3D/4D asset is not only identity-preserving but also geometrically sound, with fewer artifacts.
Also Read:
- Chunk-GRPO: A New Approach to Text-to-Image Generation
- Video-As-Prompt: A Unified Framework for Semantic Video Generation
Demonstrated Superiority
Extensive experiments show that TIRE significantly outperforms state-of-the-art methods in identity preservation for both 3D and 4D generation. Qualitative comparisons reveal that TIRE-generated assets maintain a more faithful appearance of the subject across different viewpoints and also exhibit enhanced geometry quality with fewer ghosting artifacts. The method’s general applicability means it can be integrated with various existing 3D/4D generation pipelines, acting as a valuable ‘plug-in’ solution to improve personalization.
A user study involving 18 volunteers also indicated a subjective preference for TIRE’s results in terms of overall quality, even without explicitly informing participants about the focus on subject-driven generation. Furthermore, VLM-based evaluations confirmed TIRE’s superior subject consistency across multiple aspects like shape, color, texture, and facial features.
For a deeper dive into the technical specifics, you can read the full research paper here.
TIRE represents an important step forward in making 3D and 4D content creation more personalized and accurate, allowing for greater creative expression and more realistic digital subjects.


