AISTAT Lab's Advanced System for Language-Based Audio Retrieval in DCASE 2025

TLDR: The AISTAT Lab’s DCASE 2025 Task 6 system for language-based audio retrieval uses a dual encoder architecture enhanced with contrastive learning, distillation loss, and LLM-based data augmentation (back-translation, LLM mix). A novel cluster-based auxiliary classification task further refines the model. The system achieved a mAP@16 of 46.62 with its best single model and 48.83 with an ensemble, demonstrating significant improvements in aligning audio and text modalities.

In the rapidly evolving field of artificial intelligence, the ability to search and retrieve audio content using natural language descriptions is becoming increasingly vital. This capability, known as language-based audio retrieval, underpins applications ranging from content-based multimedia search to advanced audio annotation. The DCASE 2025 Task 6 challenge specifically focuses on pushing the boundaries of this technology, requiring models to understand the nuanced semantic relationships between free-form text and complex audio signals.

A team from AISTAT Lab, comprising Hyun Jun Kim, Hyeong Yong Choi, and Changwon Lim, has submitted an innovative system to this challenge. Their approach builds upon a dual encoder architecture, a common framework where audio and text are processed by separate encoders and then aligned in a shared semantic space. This allows the system to effectively match textual queries with relevant audio recordings and vice versa.

Core Methodologies for Enhanced Retrieval

The AISTAT Lab system incorporates several advanced techniques to boost its performance and generalization capabilities:

Contrastive Learning: At its foundation, the system uses contrastive learning. This technique trains the model to bring corresponding audio-text pairs closer together in the embedding space while pushing non-corresponding pairs further apart. Imagine teaching the system to recognize that a description of “birds chirping” belongs with an audio clip of birds, but not with an audio clip of a car engine.

Distillation Loss: Traditional audio retrieval datasets often assume a simple binary match between an audio clip and its caption. However, real-world audio can be complex, with captions potentially describing multiple aspects or overlapping concepts. To address this, the team adopted a distillation loss approach. This method leverages “soft correspondence probabilities” from an ensemble of pre-trained models. Instead of a hard “yes/no” match, it provides a nuanced probability, helping the model learn more subtle audio-text relationships and improve its ability to generalize to new data.

Cluster-Based Classification: A novel addition to their system is an auxiliary classification task based on clustering. The researchers clustered all captions in the dataset into semantically similar groups, essentially identifying latent topics or patterns within the text. They then added classification heads to both the audio and text encoders, training them to predict these cluster labels. This encourages the audio encoder to learn representations that are inherently aligned with the semantic categories of the captions, leading to a more fine-grained understanding between audio and text.

Data Augmentation with Large Language Models (LLMs): To overcome data scarcity and enhance the diversity of captions, the team employed powerful LLMs, specifically GPT-4o. Two key augmentation techniques were used. One was Back-translation, where original English captions were translated into another language and then back into English. This process generates new captions that retain the original meaning but feature varied linguistic expressions. The other technique was LLM Mix, which involved combining two audio signals and then using GPT-4o to intelligently merge their corresponding captions, creating entirely new audio-caption pairs. These methods significantly expanded the dataset with diverse examples.

Experimental Setup and Results

The system was trained using a combination of datasets, including ClothoV2.1, AudioCaps, and WavCaps. Various audio embedding models like PaSST, EAT, and BEATs were explored, alongside the RoBERTa language model for text embeddings. The training process involved initial pretraining, followed by finetuning with distillation and augmentation, and finally re-finetuning with the cluster-guided classification.

The results demonstrated the effectiveness of their combined strategies. Among individual systems, PaSST consistently showed strong performance. However, the most significant gains were achieved through an ensemble of four systems, which reached an impressive mAP@16 of 48.83 on the Clotho development test split. This highlights the power of combining different model configurations and techniques to achieve superior results.

Also Read:

Conclusion

The AISTAT Lab’s submission to the DCASE 2025 Task 6 challenge showcases a robust and innovative system for language-based audio retrieval. By integrating a dual encoder architecture with contrastive learning, distillation loss, LLM-based data augmentation, and a novel cluster-based auxiliary classification task, the team achieved notable performance improvements. This work contributes significantly to advancing the field of cross-modal understanding between audio and text. For more details, you can refer to the full technical report here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AISTAT Lab’s Advanced System for Language-Based Audio Retrieval in DCASE 2025

Core Methodologies for Enhanced Retrieval

Experimental Setup and Results

Conclusion

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates