New AI Model Detects Any Sound Using Text or Audio Queries

TLDR: Researchers have developed DASM, a novel AI model for open-vocabulary sound event detection. Unlike traditional systems, DASM can identify sounds it hasn’t been explicitly trained on by using multi-modal queries (text or audio). It employs a dual-stream decoder for precise event recognition and temporal localization, demonstrating superior generalization and accuracy across various datasets.

Imagine a world where artificial intelligence can identify any sound, not just those it has been specifically trained to recognize. This is the ambitious goal tackled by a new research paper titled “Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries.” Traditional sound event detection (SED) systems are often limited to a predefined set of sounds, meaning they struggle to identify novel or unexpected audio events in real-world scenarios.

The researchers, Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, and Ian McLoughlin, introduce a groundbreaking framework called the Detect Any Sound Model (DASM). DASM is designed to overcome the limitations of closed-set SED by enabling open-vocabulary detection, allowing it to identify sounds it has never encountered during its training phase. This is achieved by formulating SED as a frame-level retrieval task, where the system matches audio features against query vectors.

How DASM Works

DASM’s innovative approach is guided by multi-modal queries, meaning users can prompt the system using either text descriptions (e.g., “sound of a barking dog”) or even audio clips containing the target sound. This flexibility is a significant leap forward, as it makes the system highly adaptable to various applications.

The model comprises three main components:

Audio Encoder: This part processes the incoming audio, converting it into a detailed sequence of features, capturing the nuances of the sound.
Query Generation Module: Powered by a pre-trained CLAP (Contrastive Language-Audio Pretraining) model, this module takes the text or audio query and transforms it into a query vector. This vector acts as the ‘fingerprint’ of the sound DASM needs to detect.
Dual-Stream Decoder: This is the brain of DASM, explicitly decoupling two critical aspects of sound detection: event recognition and temporal localization. A ‘cross-modality event decoder’ determines if a sound event is present in an audio clip by fusing the query and audio features. Simultaneously, a ‘context network’ models the temporal dependencies, pinpointing exactly when the sound occurs within the audio stream.

A clever inference-time attention masking strategy is also employed. This strategy allows DASM to leverage semantic relationships between known (base) and unknown (novel) sound classes, significantly improving its ability to generalize to new sounds. For instance, if it knows what a “gunshot” is, it can better infer what a “fusillade” might sound like, even if it hasn’t been explicitly trained on the latter.

Also Read:

Impressive Performance

The researchers conducted extensive experiments on benchmark datasets like AudioSet Strong and DESED. DASM demonstrated remarkable performance, particularly in open-vocabulary scenarios. On the AudioSet Strong dataset, DASM significantly outperformed existing CLAP-based methods in detecting novel classes, showing a substantial improvement in accuracy. Even in closed-set scenarios (detecting sounds it was trained on), DASM surpassed traditional baselines.

Perhaps most impressively, in cross-dataset zero-shot evaluation on DESED, DASM achieved a high score, even exceeding a supervised baseline model. This highlights DASM’s strong generalization ability, meaning it can effectively transfer its knowledge to new datasets without needing additional training.

The study also revealed that even a few minutes of audio are sufficient to construct effective audio queries, making the system practical for scenarios with limited audio resources. The dual-stream decoder and the clip-level predictions were also shown to be crucial for the model’s high performance.

This research marks a significant step towards creating more versatile and intelligent sound event detection systems that can adapt to the vast and ever-changing soundscapes of the real world. For more details, you can read the full research paper here: Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

New AI Model Detects Any Sound Using Text or Audio Queries

How DASM Works

Impressive Performance

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates