Crafting Immersive Experiences: Generating Viewpoint-Specific Audio-Visuals from Full 360-Degree Environments

TLDR: Con360-AV is a novel framework that enables controllable audio-visual generation for specific viewpoints within a 360-degree environment. It addresses the limitation of existing models by leveraging panoramic saliency maps, bounding-box-aware signed distance maps, and global scene captions. This allows the model to generate videos and audios that are spatially aware and coherently influenced by off-camera events, significantly enhancing realism and immersion in applications like VR and interactive media.

The world of immersive media, particularly 360-degree videos, is constantly evolving, and with it, the demand for more realistic and controllable audio-visual experiences. While generative AI models have made significant strides in creating sounding videos, a key challenge has remained: how to generate content for a specific viewpoint that is still aware of events happening outside that view, within the broader 360-degree environment.

Imagine watching a 360-degree video of a band playing in a circle. If your viewpoint focuses on the guitarist, current models might struggle to accurately spatialise the sound of the drummer who is off-screen. This limitation restricts the creation of truly immersive experiences where off-camera events coherently influence what you see and hear.

Introducing Con360-AV: A New Approach to Immersive Audio-Visual Generation

Researchers from Sapienza University of Rome have introduced a novel framework called Con360-AV (Controllable 360° context Audio-Visual generation) to address this gap. This work is the first of its kind to offer a framework for controllable audio-visual generation that leverages the full 360-degree spatial information of a scene. The core idea is to provide a diffusion model with powerful conditioning signals derived from the entire panoramic space, enabling it to generate viewpoint-specific videos and audios that are influenced by the unseen environmental context.

How Con360-AV Works: Three Key Conditioning Signals

Con360-AV integrates three distinct types of information to achieve its spatial awareness and controllability:

1. 360° Saliency Maps: These maps identify visually important regions across the entire 360-degree scene. By highlighting significant objects and actions, whether they are currently in the target viewpoint or off-screen, the model gains an understanding of where attention should be directed.

2. Bounding Box-Aware Signed Distance (BASD) Maps: This signal precisely defines the target viewpoint. By identifying centroids of prominent regions from the saliency maps and projecting their bounding boxes onto the 360-degree frame, a BASD map is computed. This map provides a geometric guide, telling the model the exact location and boundaries of the content it needs to generate for the chosen viewpoint.

3. 360° Scene Caption: To provide high-level semantic context, a detailed description of the entire 360-degree scene is generated. This involves processing multiple perspective views from the 360-degree video and using a large language model to synthesize a single, coherent caption that summarizes the complete environment and its dynamics over time.

These three conditioning signals are then integrated into a dual U-Net architecture, which consists of pre-trained audio and video diffusion models. A new, trainable control module processes these signals and injects them into the generation process, ensuring that the generated video and audio for a specific field-of-view remain consistent with the broader, off-screen environment. Temporal synchronization between audio and video is also maintained throughout this process.

Also Read:

Demonstrated Effectiveness and Future Potential

The effectiveness of Con360-AV was demonstrated through experiments conducted on the Sphere360 dataset, a large collection of 360-degree videos with corresponding audio. The results showed significant improvements over a baseline model that lacked spatial control. Con360-AV achieved better spatial alignment between generated content and the target viewpoint, as well as improved audio fidelity and overall video quality.

Crucially, the model proved capable of consistently taking off-screen spatial information into account during generation, producing semantically aligned results for different viewpoints of the same scene. This means that if a sound source is off-screen, the generated audio will still accurately reflect its presence and spatial characteristics.

The implications of Con360-AV are far-reaching. It could pave the way for enriching virtual reality (VR) experiences with realistic off-camera sounds, simulating ambient audio for 360-degree videos with unprecedented accuracy, and enabling interactive storytelling where the soundscape dynamically adapts to the user’s viewpoint. Furthermore, by conditioning generation on the complete 360-degree context, the method gains explicit control over events at the viewpoint boundary, such as characters entering or exiting the scene, while guaranteeing spatio-temporal coherence across dynamic camera movements.

For more technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Crafting Immersive Experiences: Generating Viewpoint-Specific Audio-Visuals from Full 360-Degree Environments

Introducing Con360-AV: A New Approach to Immersive Audio-Visual Generation

How Con360-AV Works: Three Key Conditioning Signals

Demonstrated Effectiveness and Future Potential

Gen AI News and Updates

Generative AI Powers Next-Gen Autonomous Emergency Response

C3-Diff: Enhancing Spatial Gene Expression Maps with AI and Histology

Enhancing Text Legibility in AI-Generated Videos with Synthetic Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates