WEALY: A Reproducible Pipeline for Lyrics Matching from Audio

TLDR: WEALY is a new, fully reproducible pipeline for audio-based lyrics matching. It uses Whisper decoder embeddings to extract “lyrics-aware” representations directly from raw audio, eliminating the need for text transcriptions. Trained with contrastive learning on musical version identification tasks, WEALY outperforms traditional transcription-based methods and offers a robust, scalable, and multilingual solution for music information retrieval. The research emphasizes the importance of specific loss functions, pooling strategies, and the benefits of Whisper’s multilingual capabilities, also showing improvements when fused with audio content-based models.

Audio-based lyrics matching is a fascinating area in music information retrieval, offering applications from copyright protection to music discovery and creative assistance. Imagine being able to find songs with similar lyrical themes without needing a written transcription, or identifying potential copyright infringements based purely on the audio. However, existing methods often struggle with being reproducible and lack consistent benchmarks, making it hard for researchers to build upon previous work.

A new research paper introduces WEALY (Whisper Embeddings for Audio-based LYrics matching), a fully reproducible pipeline designed to tackle these challenges. WEALY leverages the powerful decoder embeddings from the Whisper model to perform lyrics matching directly from audio, removing the dependency on text data or pre-existing transcriptions. This approach establishes robust and transparent baselines for the field, and its developers have made both the code and model checkpoints publicly available to ensure transparency and reproducibility. You can find the full research paper here: LEVERAGING WHISPER EMBEDDINGS FOR AUDIO-BASED LYRICS MATCHING.

How WEALY Works

WEALY operates through a two-stage pipeline: feature extraction and feature adaptation. The first stage involves processing raw audio to extract what the researchers call “lyrics-aware Whisper latents.” Unlike some prior methods, WEALY works directly on the raw audio mixture, avoiding the need for vocal source separation. The audio is converted to mono, resampled, and then split into 30-second overlapping chunks. For each chunk, Whisper extracts log-mel spectrograms, and its decoder generates hidden representations (latents) that capture the semantic content of the lyrics. These latents are then concatenated to form a comprehensive representation of the song’s lyrical content.

The second stage, feature adaptation, takes these lyrical latents and feeds them into a transformer-based architecture. To manage computational efficiency and expose the model to diverse parts of a track, random subsequences of a fixed length (1500 tokens) are sampled from the latents. These subsequences are then processed by a stack of transformer encoder blocks, which learn contextualized representations. A technique called Generalized Mean (GeM) pooling is applied to condense these representations into a single vector, suitable for similarity computation. Finally, a linear projection maps this vector into a compact semantic embedding space. The model is trained using a contrastive learning approach called NT-Xent loss, which encourages embeddings of different versions of the same song to cluster together while pushing apart those from different songs.

Key Findings and Advantages

Extensive experiments on standard datasets like DiscogsVI-YT, SHS100k-v2, and LyricCovers2.0 demonstrate that WEALY consistently outperforms traditional transcription-based methods. For instance, methods relying on TF-IDF or Sentence-BERT with ASR transcriptions showed limited retrieval quality compared to WEALY. The research also highlighted several critical design choices through ablation studies:

The NT-Xent loss function proved significantly more effective than other loss functions like triplet loss or CLEWS loss.
GeM pooling consistently outperformed simpler pooling strategies like simple averaging or using a CLS token, emphasizing its role in capturing informative temporal regions.
The multilingual capabilities of Whisper are a significant strength. Restricting the decoding to English only led to a noticeable drop in performance, indicating that multilingual cues within the latents are valuable for retrieval.

Furthermore, the study explored multimodal fusion, combining WEALY (lyrics-aware) with CLEWS (audio-content) models. A simple late-fusion approach, combining distances from both models, led to clear improvements in Musical Version Identification (MVI) tasks. This suggests that lyrical and audio cues provide complementary information, and their combination can lead to more robust music information retrieval systems.

Also Read:

Conclusion

WEALY represents a significant step forward in audio-based lyrics matching. By providing a fully reproducible, end-to-end pipeline that leverages Whisper decoder embeddings, it offers a reliable benchmark for future research. Its ability to extract lyrics-aware representations directly from raw audio, without relying on intermediate transcriptions, makes it robust, scalable, and effective across multiple languages. This work underscores the immense potential of speech technologies for various music information retrieval tasks, including version identification, copyright detection, and music discovery.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

WEALY: A Reproducible Pipeline for Lyrics Matching from Audio

How WEALY Works

Key Findings and Advantages

Conclusion

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates