TLDR: The research introduces a novel approach to 3D Sound Event Localization and Detection (SELD) in videos, a complex task involving identifying sound events, tracking their activity, and estimating their 3D positions. The authors enhance a standard SELD architecture by integrating pre-trained, language-aligned models—CLAP for audio and OWL-ViT for visual inputs—into a custom Cross-Modal Conformer. This method leverages semantic information to overcome data limitations of traditional SELD. Through extensive pre-training on synthetic datasets and engineering refinements, their approach achieved second place in the DCASE 2025 Challenge Task 3 (Track B), demonstrating significant performance gains in localizing and classifying sound events in stereo video content.
In the rapidly evolving field of artificial intelligence, understanding and interpreting our surroundings goes beyond just seeing. Imagine a system that not only sees what’s happening in a video but also accurately hears and pinpoints where sounds are coming from, even estimating their distance. This complex task is known as 3D Sound Event Localization and Detection (3D SELD), and it’s crucial for applications ranging from human-robot interaction to security monitoring and immersive media production.
Traditionally, SELD systems rely heavily on multi-channel audio inputs, which can limit their ability to benefit from the vast amounts of data used in large-scale pre-training. This often makes it challenging for these systems to grasp the semantic meaning behind sounds and their visual context. A recent research paper, Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos, introduces an innovative approach to overcome these limitations, significantly advancing the capabilities of 3D SELD in regular video content.
A New Multimodal Approach
The researchers, Davide Berghi and Philip J. B. Jackson from the University of Surrey, CVSSP, propose enhancing a standard SELD architecture by integrating powerful, pre-trained, language-aligned models. For audio inputs, they leverage CLAP (Contrastive Language-Audio Pre-training), and for visual inputs, they use OWL-ViT, a model specifically designed for visual grounding tasks like object detection. These models provide rich semantic information, helping the SELD system understand not just ‘where’ a sound is, but also ‘what’ it is and its relationship to the visual scene.
The core of their innovation lies in a modified Conformer module, which they call the Cross-Modal Conformer (CMC). This specialized module is designed for multimodal fusion, effectively combining the extracted audio and visual embeddings with the SELD system’s own embeddings. This allows for a deeper integration of spatial, temporal, and semantic reasoning, which are all critical for accurately localizing sound events in a 3D space.
Overcoming Data Challenges with Synthetic Pre-training
One of the significant hurdles in developing robust SELD systems is the availability of large, diverse datasets. The authors tackled this by curating extensive synthetic audio and audio-visual datasets for model pre-training. They used tools like SpatialScaper to generate realistic soundscapes and SELDVisualSynth to create corresponding videos, enriching the data with class-relevant images and diverse indoor environments. To further boost the training data size and model robustness, they employed data augmentation techniques such as left-right channel swapping for audio and corresponding video frame flipping.
The model’s training involved several stages: initial pre-training on a large synthetic audio dataset, followed by training on a synthetic audio-visual dataset, and finally fine-tuning on real-world data from the DCASE2025 Task3 Stereo SELD Dataset. This multi-stage approach, combined with specialized acoustic features like Inter-channel Level Difference (ILD) and short-term power of the autocorrelation (stpACC) for distance estimation, proved highly effective.
Also Read:
- Exploring the Nuances of Voice: Insights from the First Voice Timbre Attribute Detection Challenge
- Falcon3-Audio: Achieving Top-Tier Audio-Language Understanding with Data Efficiency
Achieving Top Performance
The effectiveness of their method was rigorously tested in the DCASE 2025 Challenge Task 3 (Track B), a prestigious competition in acoustic scene and event analysis. Their approach, which also included engineering refinements like a weighted loss function for on/off-screen predictions, visual post-processing based on human keypoint detection, and model ensembling, achieved an impressive second-place ranking. This outstanding result underscores the power of integrating language-aligned models and extensive pre-training for complex multimodal tasks like 3D SELD.
This research marks a significant step forward in enabling AI systems to perceive and understand our world more holistically, bridging the gap between what they see and what they hear. Future work will delve deeper into understanding the specific contributions of each modality and refining the architectural design for even greater performance.


