spot_img
HomeResearch & DevelopmentCrafting Immersive Experiences: Generating Viewpoint-Specific Audio-Visuals from Full 360-Degree...

Crafting Immersive Experiences: Generating Viewpoint-Specific Audio-Visuals from Full 360-Degree Environments

TLDR: Con360-AV is a novel framework that enables controllable audio-visual generation for specific viewpoints within a 360-degree environment. It addresses the limitation of existing models by leveraging panoramic saliency maps, bounding-box-aware signed distance maps, and global scene captions. This allows the model to generate videos and audios that are spatially aware and coherently influenced by off-camera events, significantly enhancing realism and immersion in applications like VR and interactive media.

The world of immersive media, particularly 360-degree videos, is constantly evolving, and with it, the demand for more realistic and controllable audio-visual experiences. While generative AI models have made significant strides in creating sounding videos, a key challenge has remained: how to generate content for a specific viewpoint that is still aware of events happening outside that view, within the broader 360-degree environment.

Imagine watching a 360-degree video of a band playing in a circle. If your viewpoint focuses on the guitarist, current models might struggle to accurately spatialise the sound of the drummer who is off-screen. This limitation restricts the creation of truly immersive experiences where off-camera events coherently influence what you see and hear.

Introducing Con360-AV: A New Approach to Immersive Audio-Visual Generation

Researchers from Sapienza University of Rome have introduced a novel framework called Con360-AV (Controllable 360° context Audio-Visual generation) to address this gap. This work is the first of its kind to offer a framework for controllable audio-visual generation that leverages the full 360-degree spatial information of a scene. The core idea is to provide a diffusion model with powerful conditioning signals derived from the entire panoramic space, enabling it to generate viewpoint-specific videos and audios that are influenced by the unseen environmental context.

How Con360-AV Works: Three Key Conditioning Signals

Con360-AV integrates three distinct types of information to achieve its spatial awareness and controllability:

1. 360° Saliency Maps: These maps identify visually important regions across the entire 360-degree scene. By highlighting significant objects and actions, whether they are currently in the target viewpoint or off-screen, the model gains an understanding of where attention should be directed.

2. Bounding Box-Aware Signed Distance (BASD) Maps: This signal precisely defines the target viewpoint. By identifying centroids of prominent regions from the saliency maps and projecting their bounding boxes onto the 360-degree frame, a BASD map is computed. This map provides a geometric guide, telling the model the exact location and boundaries of the content it needs to generate for the chosen viewpoint.

3. 360° Scene Caption: To provide high-level semantic context, a detailed description of the entire 360-degree scene is generated. This involves processing multiple perspective views from the 360-degree video and using a large language model to synthesize a single, coherent caption that summarizes the complete environment and its dynamics over time.

These three conditioning signals are then integrated into a dual U-Net architecture, which consists of pre-trained audio and video diffusion models. A new, trainable control module processes these signals and injects them into the generation process, ensuring that the generated video and audio for a specific field-of-view remain consistent with the broader, off-screen environment. Temporal synchronization between audio and video is also maintained throughout this process.

Also Read:

Demonstrated Effectiveness and Future Potential

The effectiveness of Con360-AV was demonstrated through experiments conducted on the Sphere360 dataset, a large collection of 360-degree videos with corresponding audio. The results showed significant improvements over a baseline model that lacked spatial control. Con360-AV achieved better spatial alignment between generated content and the target viewpoint, as well as improved audio fidelity and overall video quality.

Crucially, the model proved capable of consistently taking off-screen spatial information into account during generation, producing semantically aligned results for different viewpoints of the same scene. This means that if a sound source is off-screen, the generated audio will still accurately reflect its presence and spatial characteristics.

The implications of Con360-AV are far-reaching. It could pave the way for enriching virtual reality (VR) experiences with realistic off-camera sounds, simulating ambient audio for 360-degree videos with unprecedented accuracy, and enabling interactive storytelling where the soundscape dynamically adapts to the user’s viewpoint. Furthermore, by conditioning generation on the complete 360-degree context, the method gains explicit control over events at the viewpoint boundary, such as characters entering or exiting the scene, while guaranteeing spatio-temporal coherence across dynamic camera movements.

For more technical details, you can refer to the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -