TLDR: SceneGen is a new AI model that can create complete 3D scenes, including multiple objects with their geometry, textures, and spatial positions, from just one input image and object masks. It achieves this in a single, efficient feedforward pass without needing complex optimization or asset retrieval. The model also demonstrates improved generation quality when provided with multiple input images, despite being trained solely on single-image inputs, making it a significant advancement for 3D content generation in VR/AR and embodied AI.
The creation of immersive digital environments for applications like virtual reality (VR), augmented reality (AR), and embodied AI has driven significant interest in 3D content generation. While previous efforts often focused on generating individual 3D objects, the more complex task of synthesizing entire 3D scenes, complete with multiple objects, accurate geometry, textures, and spatial relationships, has remained a significant challenge.
Existing methods typically fall into two categories: retrieval-based approaches, which use large language models to plan layouts and then pull matching 3D assets from libraries, and two-stage approaches, which first generate individual assets and then use optimization techniques to refine the scene structure. Both methods have limitations, with retrieval-based methods being constrained by available assets and two-stage approaches suffering from inefficiency and potential error accumulation due to iterative optimization.
Introducing SceneGen: A Novel Approach to 3D Scene Generation
Researchers Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie from Shanghai Jiao Tong University have introduced SceneGen, a groundbreaking framework designed to overcome these challenges. SceneGen is a novel model that takes a single scene image and its corresponding object masks as input and efficiently generates multiple 3D assets with coherent geometry, texture, and spatial arrangement in a single feedforward pass. This means it doesn’t require complex optimization steps or asset retrieval from existing libraries, making it remarkably efficient.
SceneGen’s contributions are significant:
- It simultaneously produces multiple 3D assets with geometry and texture from a single image and object masks, without needing optimization or asset retrieval.
- It features a novel aggregation module that integrates local and global scene information from visual and geometric encoders. Coupled with a position head, this allows for the generation of 3D assets and their relative spatial positions in one pass.
- The framework is directly extensible to multi-image input scenarios, surprisingly improving generation performance even though it’s trained solely on single-image inputs.
- Extensive evaluations confirm its efficiency and robust generation capabilities.
How SceneGen Works
The SceneGen framework operates in three key stages:
First, a **feature extraction module** uses off-the-shelf visual and geometric encoders to extract both asset-level and scene-level features from the input image and masks. This provides a comprehensive understanding of individual objects and the overall scene context.
Next, a **feature aggregation module** integrates these extracted features. This module includes local attention blocks to refine individual asset details and global attention blocks to incorporate scene context and facilitate interactions between assets. This ensures that the generated objects have plausible geometric topologies and spatial arrangements.
Finally, an **output module** decodes the aggregated features. It uses a dedicated position head to predict the spatial locations (translation, rotation, and scale) of assets relative to a query asset. Additionally, off-the-shelf sparse structure and structured latents decoders are used to generate the geometry and texture of each 3D asset.
SceneGen is trained on the 3D-FUTURE dataset, which contains photorealistic scene renderings with instance masks and asset annotations. The training process uses a composite loss function that ensures accurate asset generation, correct relative spatial arrangements, and physically plausible object placements by minimizing collisions.
Also Read:
- SimGenHOI: Crafting Physically Realistic Humanoid-Object Interactions with AI
- LENS: Unifying Language Understanding and Pixel Segmentation
Performance and Scalability
Quantitative and qualitative evaluations demonstrate that SceneGen significantly outperforms previous methods in terms of both generation quality and efficiency. It can generate textured scenes with four assets in approximately two minutes on a single A100 GPU, offering a strong balance between quality and speed.
Remarkably, despite being trained exclusively on single-image samples, SceneGen exhibits inherent multi-view compatibility. When provided with multiple images of the same scene from different viewpoints, the model can integrate this complementary information to produce 3D assets with more complete geometry and finer texture details, further validating its practicality and scalability.
While SceneGen represents a significant leap forward, the researchers acknowledge limitations, such as its current generalization to non-indoor scenes and occasional challenges with precise contact relationships between objects. Future work aims to address these by constructing larger, more diverse datasets and incorporating explicit physical priors.
SceneGen offers a novel and efficient solution for high-quality 3D content generation, paving the way for advancements in practical applications across various downstream tasks.


