TLDR: Con360-AV is a novel framework that enables controllable audio-visual generation for specific viewpoints within a 360-degree environment. It addresses the limitation of existing models by leveraging panoramic saliency maps, bounding-box-aware signed distance maps, and global scene captions. This allows the model to generate videos and audios that are spatially aware and coherently influenced by off-camera events, significantly enhancing realism and immersion in applications like VR and interactive media.
The world of immersive media, particularly 360-degree videos, is constantly evolving, and with it, the demand for more realistic and controllable audio-visual experiences. While generative AI models have made significant strides in creating sounding videos, a key challenge has remained: how to generate content for a specific viewpoint that is still aware of events happening outside that view, within the broader 360-degree environment.
Imagine watching a 360-degree video of a band playing in a circle. If your viewpoint focuses on the guitarist, current models might struggle to accurately spatialise the sound of the drummer who is off-screen. This limitation restricts the creation of truly immersive experiences where off-camera events coherently influence what you see and hear.
Introducing Con360-AV: A New Approach to Immersive Audio-Visual Generation
Researchers from Sapienza University of Rome have introduced a novel framework called Con360-AV (Controllable 360° context Audio-Visual generation) to address this gap. This work is the first of its kind to offer a framework for controllable audio-visual generation that leverages the full 360-degree spatial information of a scene. The core idea is to provide a diffusion model with powerful conditioning signals derived from the entire panoramic space, enabling it to generate viewpoint-specific videos and audios that are influenced by the unseen environmental context.
How Con360-AV Works: Three Key Conditioning Signals
Con360-AV integrates three distinct types of information to achieve its spatial awareness and controllability:
1. 360° Saliency Maps: These maps identify visually important regions across the entire 360-degree scene. By highlighting significant objects and actions, whether they are currently in the target viewpoint or off-screen, the model gains an understanding of where attention should be directed.
2. Bounding Box-Aware Signed Distance (BASD) Maps: This signal precisely defines the target viewpoint. By identifying centroids of prominent regions from the saliency maps and projecting their bounding boxes onto the 360-degree frame, a BASD map is computed. This map provides a geometric guide, telling the model the exact location and boundaries of the content it needs to generate for the chosen viewpoint.
3. 360° Scene Caption: To provide high-level semantic context, a detailed description of the entire 360-degree scene is generated. This involves processing multiple perspective views from the 360-degree video and using a large language model to synthesize a single, coherent caption that summarizes the complete environment and its dynamics over time.
These three conditioning signals are then integrated into a dual U-Net architecture, which consists of pre-trained audio and video diffusion models. A new, trainable control module processes these signals and injects them into the generation process, ensuring that the generated video and audio for a specific field-of-view remain consistent with the broader, off-screen environment. Temporal synchronization between audio and video is also maintained throughout this process.
Also Read:
- Unpacking the Progress in Text-to-Video Generation: A Survey of Models and Benchmarks
- MetaFind: Intelligent 3D Asset Retrieval for Coherent Virtual Worlds
Demonstrated Effectiveness and Future Potential
The effectiveness of Con360-AV was demonstrated through experiments conducted on the Sphere360 dataset, a large collection of 360-degree videos with corresponding audio. The results showed significant improvements over a baseline model that lacked spatial control. Con360-AV achieved better spatial alignment between generated content and the target viewpoint, as well as improved audio fidelity and overall video quality.
Crucially, the model proved capable of consistently taking off-screen spatial information into account during generation, producing semantically aligned results for different viewpoints of the same scene. This means that if a sound source is off-screen, the generated audio will still accurately reflect its presence and spatial characteristics.
The implications of Con360-AV are far-reaching. It could pave the way for enriching virtual reality (VR) experiences with realistic off-camera sounds, simulating ambient audio for 360-degree videos with unprecedented accuracy, and enabling interactive storytelling where the soundscape dynamically adapts to the user’s viewpoint. Furthermore, by conditioning generation on the complete 360-degree context, the method gains explicit control over events at the viewpoint boundary, such as characters entering or exiting the scene, while guaranteeing spatio-temporal coherence across dynamic camera movements.
For more technical details, you can refer to the full research paper here.


