TLDR: The AISTAT Lab’s DCASE 2025 Task 6 system for language-based audio retrieval uses a dual encoder architecture enhanced with contrastive learning, distillation loss, and LLM-based data augmentation (back-translation, LLM mix). A novel cluster-based auxiliary classification task further refines the model. The system achieved a mAP@16 of 46.62 with its best single model and 48.83 with an ensemble, demonstrating significant improvements in aligning audio and text modalities.
In the rapidly evolving field of artificial intelligence, the ability to search and retrieve audio content using natural language descriptions is becoming increasingly vital. This capability, known as language-based audio retrieval, underpins applications ranging from content-based multimedia search to advanced audio annotation. The DCASE 2025 Task 6 challenge specifically focuses on pushing the boundaries of this technology, requiring models to understand the nuanced semantic relationships between free-form text and complex audio signals.
A team from AISTAT Lab, comprising Hyun Jun Kim, Hyeong Yong Choi, and Changwon Lim, has submitted an innovative system to this challenge. Their approach builds upon a dual encoder architecture, a common framework where audio and text are processed by separate encoders and then aligned in a shared semantic space. This allows the system to effectively match textual queries with relevant audio recordings and vice versa.
Core Methodologies for Enhanced Retrieval
The AISTAT Lab system incorporates several advanced techniques to boost its performance and generalization capabilities:
Contrastive Learning: At its foundation, the system uses contrastive learning. This technique trains the model to bring corresponding audio-text pairs closer together in the embedding space while pushing non-corresponding pairs further apart. Imagine teaching the system to recognize that a description of “birds chirping” belongs with an audio clip of birds, but not with an audio clip of a car engine.
Distillation Loss: Traditional audio retrieval datasets often assume a simple binary match between an audio clip and its caption. However, real-world audio can be complex, with captions potentially describing multiple aspects or overlapping concepts. To address this, the team adopted a distillation loss approach. This method leverages “soft correspondence probabilities” from an ensemble of pre-trained models. Instead of a hard “yes/no” match, it provides a nuanced probability, helping the model learn more subtle audio-text relationships and improve its ability to generalize to new data.
Cluster-Based Classification: A novel addition to their system is an auxiliary classification task based on clustering. The researchers clustered all captions in the dataset into semantically similar groups, essentially identifying latent topics or patterns within the text. They then added classification heads to both the audio and text encoders, training them to predict these cluster labels. This encourages the audio encoder to learn representations that are inherently aligned with the semantic categories of the captions, leading to a more fine-grained understanding between audio and text.
Data Augmentation with Large Language Models (LLMs): To overcome data scarcity and enhance the diversity of captions, the team employed powerful LLMs, specifically GPT-4o. Two key augmentation techniques were used. One was Back-translation, where original English captions were translated into another language and then back into English. This process generates new captions that retain the original meaning but feature varied linguistic expressions. The other technique was LLM Mix, which involved combining two audio signals and then using GPT-4o to intelligently merge their corresponding captions, creating entirely new audio-caption pairs. These methods significantly expanded the dataset with diverse examples.
Experimental Setup and Results
The system was trained using a combination of datasets, including ClothoV2.1, AudioCaps, and WavCaps. Various audio embedding models like PaSST, EAT, and BEATs were explored, alongside the RoBERTa language model for text embeddings. The training process involved initial pretraining, followed by finetuning with distillation and augmentation, and finally re-finetuning with the cluster-guided classification.
The results demonstrated the effectiveness of their combined strategies. Among individual systems, PaSST consistently showed strong performance. However, the most significant gains were achieved through an ensemble of four systems, which reached an impressive mAP@16 of 48.83 on the Clotho development test split. This highlights the power of combining different model configurations and techniques to achieve superior results.
Also Read:
- SightSound-R1: Transferring Advanced Reasoning from Vision to Audio AI Models
- Precision Audio Alignment: Leveraging Cross-Attention and Confidence Weighting
Conclusion
The AISTAT Lab’s submission to the DCASE 2025 Task 6 challenge showcases a robust and innovative system for language-based audio retrieval. By integrating a dual encoder architecture with contrastive learning, distillation loss, LLM-based data augmentation, and a novel cluster-based auxiliary classification task, the team achieved notable performance improvements. This work contributes significantly to advancing the field of cross-modal understanding between audio and text. For more details, you can refer to the full technical report here.


