SceneGen: Instant 3D Scene Creation from a Single Photo

TLDR: SceneGen is a new AI model that can create complete 3D scenes, including multiple objects with their geometry, textures, and spatial positions, from just one input image and object masks. It achieves this in a single, efficient feedforward pass without needing complex optimization or asset retrieval. The model also demonstrates improved generation quality when provided with multiple input images, despite being trained solely on single-image inputs, making it a significant advancement for 3D content generation in VR/AR and embodied AI.

The creation of immersive digital environments for applications like virtual reality (VR), augmented reality (AR), and embodied AI has driven significant interest in 3D content generation. While previous efforts often focused on generating individual 3D objects, the more complex task of synthesizing entire 3D scenes, complete with multiple objects, accurate geometry, textures, and spatial relationships, has remained a significant challenge.

Existing methods typically fall into two categories: retrieval-based approaches, which use large language models to plan layouts and then pull matching 3D assets from libraries, and two-stage approaches, which first generate individual assets and then use optimization techniques to refine the scene structure. Both methods have limitations, with retrieval-based methods being constrained by available assets and two-stage approaches suffering from inefficiency and potential error accumulation due to iterative optimization.

Introducing SceneGen: A Novel Approach to 3D Scene Generation

Researchers Yanxu Meng, Haoning Wu, Ya Zhang, and Weidi Xie from Shanghai Jiao Tong University have introduced SceneGen, a groundbreaking framework designed to overcome these challenges. SceneGen is a novel model that takes a single scene image and its corresponding object masks as input and efficiently generates multiple 3D assets with coherent geometry, texture, and spatial arrangement in a single feedforward pass. This means it doesn’t require complex optimization steps or asset retrieval from existing libraries, making it remarkably efficient.

SceneGen’s contributions are significant:

It simultaneously produces multiple 3D assets with geometry and texture from a single image and object masks, without needing optimization or asset retrieval.
It features a novel aggregation module that integrates local and global scene information from visual and geometric encoders. Coupled with a position head, this allows for the generation of 3D assets and their relative spatial positions in one pass.
The framework is directly extensible to multi-image input scenarios, surprisingly improving generation performance even though it’s trained solely on single-image inputs.
Extensive evaluations confirm its efficiency and robust generation capabilities.

How SceneGen Works

The SceneGen framework operates in three key stages:

First, a **feature extraction module** uses off-the-shelf visual and geometric encoders to extract both asset-level and scene-level features from the input image and masks. This provides a comprehensive understanding of individual objects and the overall scene context.

Next, a **feature aggregation module** integrates these extracted features. This module includes local attention blocks to refine individual asset details and global attention blocks to incorporate scene context and facilitate interactions between assets. This ensures that the generated objects have plausible geometric topologies and spatial arrangements.

Finally, an **output module** decodes the aggregated features. It uses a dedicated position head to predict the spatial locations (translation, rotation, and scale) of assets relative to a query asset. Additionally, off-the-shelf sparse structure and structured latents decoders are used to generate the geometry and texture of each 3D asset.

SceneGen is trained on the 3D-FUTURE dataset, which contains photorealistic scene renderings with instance masks and asset annotations. The training process uses a composite loss function that ensures accurate asset generation, correct relative spatial arrangements, and physically plausible object placements by minimizing collisions.

Also Read:

Performance and Scalability

Quantitative and qualitative evaluations demonstrate that SceneGen significantly outperforms previous methods in terms of both generation quality and efficiency. It can generate textured scenes with four assets in approximately two minutes on a single A100 GPU, offering a strong balance between quality and speed.

Remarkably, despite being trained exclusively on single-image samples, SceneGen exhibits inherent multi-view compatibility. When provided with multiple images of the same scene from different viewpoints, the model can integrate this complementary information to produce 3D assets with more complete geometry and finer texture details, further validating its practicality and scalability.

While SceneGen represents a significant leap forward, the researchers acknowledge limitations, such as its current generalization to non-indoor scenes and occasional challenges with precise contact relationships between objects. Future work aims to address these by constructing larger, more diverse datasets and incorporating explicit physical priors.

SceneGen offers a novel and efficient solution for high-quality 3D content generation, paving the way for advancements in practical applications across various downstream tasks.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

SceneGen: Instant 3D Scene Creation from a Single Photo

Introducing SceneGen: A Novel Approach to 3D Scene Generation

How SceneGen Works

Performance and Scalability

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates