TLDR: ReconViaGen is a novel framework that combines the strengths of 3D reconstruction and diffusion-based generative AI to produce accurate and complete 3D object models from multiple input images. It addresses the limitations of existing methods by integrating reconstruction priors to guide the generative process, ensuring both plausible completeness and high consistency with the original views through global and local conditioning, and a unique rendering-aware refinement mechanism.
Creating accurate and complete 3D models of objects from multiple images has long been a fundamental challenge in computer vision. Traditional methods often struggle when images have limited overlap, occlusions, or sparse coverage, leading to 3D reconstructions with missing parts, holes, or blurred details. While recent advancements in generative AI, particularly diffusion-based 3D models, can ‘hallucinate’ invisible parts to create plausible complete 3D structures, they often suffer from inconsistency with the actual input images due to their stochastic nature.
A new research paper titled “RECONVIAGEN: TOWARDSACCURATEMULTI-VIEW 3D OBJECTRECONSTRUCTION VIAGENERATION” by Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, and Xiaoguang Han introduces ReconViaGen, a novel framework designed to overcome these limitations. ReconViaGen innovatively integrates the strengths of both 3D reconstruction and diffusion-based generation, aiming for both completeness and high accuracy consistent with input views.
The Core Problem and ReconViaGen’s Solution
The authors identified two key reasons why existing diffusion-based 3D generative methods fall short in achieving high consistency: first, an inadequacy in building and leveraging connections across multiple views when extracting image features; and second, poor control over the iterative denoising process during local detail generation, which can lead to plausible but inconsistent fine geometric and texture details.
ReconViaGen addresses these issues through a sophisticated three-stage pipeline:
1. Reconstruction-based Conditioning: The framework starts by using a powerful, pre-trained 3D reconstructor (VGGT) to extract rich reconstruction priors from the multi-view input images. These priors are aggregated into two types of conditions: a ‘global geometry condition’ for understanding the overall shape and a set of ‘local per-view conditions’ for capturing detailed appearance from each individual view. These conditions are crucial for guiding the subsequent generative process.
2. Coarse-to-Fine Generation: ReconViaGen employs a state-of-the-art 3D generative model (TRELLIS) that operates in a coarse-to-fine manner. The global geometry condition guides the generation of the object’s coarse structure, ensuring overall accuracy. Subsequently, the local per-view conditions are used to generate fine-grained geometric and textural details, making sure they align with what’s visible in each input image.
3. Rendering-aware Velocity Compensation: To further guarantee pixel-level alignment and consistency, ReconViaGen introduces a unique refinement mechanism during the inference stage. This ‘rendering-aware velocity compensation’ actively corrects the diffusion model’s predictions by comparing rendered images of the generated 3D model with the actual input images. It uses various similarity metrics to guide the denoising process, ensuring that the final 3D model is highly consistent with the original views in every detail.
Also Read:
- TIRE: A New Approach to Preserving Subject Identity in 3D and 4D Content Generation
- PhysWorld: Creating Accurate and Fast World Models for Deformable Objects
Experimental Validation and Impact
Extensive experiments conducted on challenging datasets like Dora-bench and OmniObject3D demonstrate that ReconViaGen achieves state-of-the-art performance. It consistently outperforms existing methods in terms of image-reconstruction consistency (measured by metrics like PSNR, SSIM, LPIPS), geometry accuracy (Chamfer Distance), and shape completeness (F-score). The paper also includes ablation studies, which confirm the individual effectiveness of each proposed component: the global geometry condition, the per-view condition, and the rendering-aware velocity compensation.
The ability of ReconViaGen to process an arbitrary number of input images from any viewpoint, even in-the-wild scenarios or from generated multi-view images, highlights its robustness and practical applicability. This work represents a significant step forward in 3D computer vision, offering a reliable solution for creating complete and accurate 3D models from multi-view images, which has wide-ranging applications in VR, AR, and 3D modeling. For more technical details, you can refer to the full research paper here: RECONVIAGEN: TOWARDSACCURATEMULTI-VIEW 3D OBJECTRECONSTRUCTION VIAGENERATION.


