TLDR: WEALY is a new, fully reproducible pipeline for audio-based lyrics matching. It uses Whisper decoder embeddings to extract “lyrics-aware” representations directly from raw audio, eliminating the need for text transcriptions. Trained with contrastive learning on musical version identification tasks, WEALY outperforms traditional transcription-based methods and offers a robust, scalable, and multilingual solution for music information retrieval. The research emphasizes the importance of specific loss functions, pooling strategies, and the benefits of Whisper’s multilingual capabilities, also showing improvements when fused with audio content-based models.
Audio-based lyrics matching is a fascinating area in music information retrieval, offering applications from copyright protection to music discovery and creative assistance. Imagine being able to find songs with similar lyrical themes without needing a written transcription, or identifying potential copyright infringements based purely on the audio. However, existing methods often struggle with being reproducible and lack consistent benchmarks, making it hard for researchers to build upon previous work.
A new research paper introduces WEALY (Whisper Embeddings for Audio-based LYrics matching), a fully reproducible pipeline designed to tackle these challenges. WEALY leverages the powerful decoder embeddings from the Whisper model to perform lyrics matching directly from audio, removing the dependency on text data or pre-existing transcriptions. This approach establishes robust and transparent baselines for the field, and its developers have made both the code and model checkpoints publicly available to ensure transparency and reproducibility. You can find the full research paper here: LEVERAGING WHISPER EMBEDDINGS FOR AUDIO-BASED LYRICS MATCHING.
How WEALY Works
WEALY operates through a two-stage pipeline: feature extraction and feature adaptation. The first stage involves processing raw audio to extract what the researchers call “lyrics-aware Whisper latents.” Unlike some prior methods, WEALY works directly on the raw audio mixture, avoiding the need for vocal source separation. The audio is converted to mono, resampled, and then split into 30-second overlapping chunks. For each chunk, Whisper extracts log-mel spectrograms, and its decoder generates hidden representations (latents) that capture the semantic content of the lyrics. These latents are then concatenated to form a comprehensive representation of the song’s lyrical content.
The second stage, feature adaptation, takes these lyrical latents and feeds them into a transformer-based architecture. To manage computational efficiency and expose the model to diverse parts of a track, random subsequences of a fixed length (1500 tokens) are sampled from the latents. These subsequences are then processed by a stack of transformer encoder blocks, which learn contextualized representations. A technique called Generalized Mean (GeM) pooling is applied to condense these representations into a single vector, suitable for similarity computation. Finally, a linear projection maps this vector into a compact semantic embedding space. The model is trained using a contrastive learning approach called NT-Xent loss, which encourages embeddings of different versions of the same song to cluster together while pushing apart those from different songs.
Key Findings and Advantages
Extensive experiments on standard datasets like DiscogsVI-YT, SHS100k-v2, and LyricCovers2.0 demonstrate that WEALY consistently outperforms traditional transcription-based methods. For instance, methods relying on TF-IDF or Sentence-BERT with ASR transcriptions showed limited retrieval quality compared to WEALY. The research also highlighted several critical design choices through ablation studies:
- The NT-Xent loss function proved significantly more effective than other loss functions like triplet loss or CLEWS loss.
- GeM pooling consistently outperformed simpler pooling strategies like simple averaging or using a CLS token, emphasizing its role in capturing informative temporal regions.
- The multilingual capabilities of Whisper are a significant strength. Restricting the decoding to English only led to a noticeable drop in performance, indicating that multilingual cues within the latents are valuable for retrieval.
Furthermore, the study explored multimodal fusion, combining WEALY (lyrics-aware) with CLEWS (audio-content) models. A simple late-fusion approach, combining distances from both models, led to clear improvements in Musical Version Identification (MVI) tasks. This suggests that lyrical and audio cues provide complementary information, and their combination can lead to more robust music information retrieval systems.
Also Read:
- Unpacking the Open ASR Leaderboard: Accuracy and Efficiency in Speech Recognition
- Evaluating Long Audio Understanding in Language Models: Introducing AUDIOMARATHON
Conclusion
WEALY represents a significant step forward in audio-based lyrics matching. By providing a fully reproducible, end-to-end pipeline that leverages Whisper decoder embeddings, it offers a reliable benchmark for future research. Its ability to extract lyrics-aware representations directly from raw audio, without relying on intermediate transcriptions, makes it robust, scalable, and effective across multiple languages. This work underscores the immense potential of speech technologies for various music information retrieval tasks, including version identification, copyright detection, and music discovery.


