TLDR: Researchers have developed DASM, a novel AI model for open-vocabulary sound event detection. Unlike traditional systems, DASM can identify sounds it hasn’t been explicitly trained on by using multi-modal queries (text or audio). It employs a dual-stream decoder for precise event recognition and temporal localization, demonstrating superior generalization and accuracy across various datasets.
Imagine a world where artificial intelligence can identify any sound, not just those it has been specifically trained to recognize. This is the ambitious goal tackled by a new research paper titled “Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries.” Traditional sound event detection (SED) systems are often limited to a predefined set of sounds, meaning they struggle to identify novel or unexpected audio events in real-world scenarios.
The researchers, Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, and Ian McLoughlin, introduce a groundbreaking framework called the Detect Any Sound Model (DASM). DASM is designed to overcome the limitations of closed-set SED by enabling open-vocabulary detection, allowing it to identify sounds it has never encountered during its training phase. This is achieved by formulating SED as a frame-level retrieval task, where the system matches audio features against query vectors.
How DASM Works
DASM’s innovative approach is guided by multi-modal queries, meaning users can prompt the system using either text descriptions (e.g., “sound of a barking dog”) or even audio clips containing the target sound. This flexibility is a significant leap forward, as it makes the system highly adaptable to various applications.
The model comprises three main components:
-
Audio Encoder: This part processes the incoming audio, converting it into a detailed sequence of features, capturing the nuances of the sound.
-
Query Generation Module: Powered by a pre-trained CLAP (Contrastive Language-Audio Pretraining) model, this module takes the text or audio query and transforms it into a query vector. This vector acts as the ‘fingerprint’ of the sound DASM needs to detect.
-
Dual-Stream Decoder: This is the brain of DASM, explicitly decoupling two critical aspects of sound detection: event recognition and temporal localization. A ‘cross-modality event decoder’ determines if a sound event is present in an audio clip by fusing the query and audio features. Simultaneously, a ‘context network’ models the temporal dependencies, pinpointing exactly when the sound occurs within the audio stream.
A clever inference-time attention masking strategy is also employed. This strategy allows DASM to leverage semantic relationships between known (base) and unknown (novel) sound classes, significantly improving its ability to generalize to new sounds. For instance, if it knows what a “gunshot” is, it can better infer what a “fusillade” might sound like, even if it hasn’t been explicitly trained on the latter.
Also Read:
- Bridging the Gap: How AI Models Learn Across Different Data Types
- Crafting Immersive Soundscapes from Text: A New Method for Binaural Audio Generation
Impressive Performance
The researchers conducted extensive experiments on benchmark datasets like AudioSet Strong and DESED. DASM demonstrated remarkable performance, particularly in open-vocabulary scenarios. On the AudioSet Strong dataset, DASM significantly outperformed existing CLAP-based methods in detecting novel classes, showing a substantial improvement in accuracy. Even in closed-set scenarios (detecting sounds it was trained on), DASM surpassed traditional baselines.
Perhaps most impressively, in cross-dataset zero-shot evaluation on DESED, DASM achieved a high score, even exceeding a supervised baseline model. This highlights DASM’s strong generalization ability, meaning it can effectively transfer its knowledge to new datasets without needing additional training.
The study also revealed that even a few minutes of audio are sufficient to construct effective audio queries, making the system practical for scenarios with limited audio resources. The dual-stream decoder and the clip-level predictions were also shown to be crucial for the model’s high performance.
This research marks a significant step towards creating more versatile and intelligent sound event detection systems that can adapt to the vast and ever-changing soundscapes of the real world. For more details, you can read the full research paper here: Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries.


