Unlocking Vietnamese Music: New Dataset and Models Boost Lyrics Transcription Accuracy

TLDR: Researchers have created VietLyrics, the first large-scale dataset of 647 hours of Vietnamese songs with aligned lyrics, to address the challenges of Automatic Lyrics Transcription (ALT) for the Vietnamese language. By fine-tuning Whisper models on this new dataset, they achieved significantly better performance (20.52% WER for lowercase lyrics) compared to existing multilingual ALT systems like LyricWhiz, which struggled with Vietnamese’s tonal complexity and dialects. The VietLyrics dataset and the fine-tuned models are publicly released to foster further research in Vietnamese music computing.

Automatic Lyrics Transcription (ALT) is becoming increasingly vital in the modern music industry, enabling deep learning models to convert vocal recordings into written text. This technology enhances user experience on streaming platforms, assists in music analysis, and supports subtitle creation, promoting accessibility for the hearing impaired and non-native speakers.

Despite significant advancements in Western music computing, Vietnamese music research, particularly in lyrics transcription, has lagged. This is primarily due to the unique challenges posed by Vietnamese, such as its tonal complexity and diverse regional dialects, coupled with a critical lack of large-scale, high-quality datasets. Existing solutions often struggle with frequent transcription errors, hallucinations in non-vocal segments, and misidentification of the language.

Introducing VietLyrics: A New Foundation for Vietnamese ALT

To address these challenges, a team of researchers from the National University of Singapore has curated and released VietLyrics, the first large-scale Vietnamese ALT dataset. This comprehensive dataset comprises 647 hours of Vietnamese songs, complete with line-level aligned lyrics and rich metadata, including AI-predicted gender and genre information.

The creation of VietLyrics involved scraping approximately 647.1 hours of audio and accompanying lyrics from zingmp3.vn, a popular Vietnamese music streaming site. The data collection process included rigorous filtering to ensure the retention of predominantly Vietnamese songs, removal of duplicates, and standardization of audio sample rates. Crucially, the dataset and its associated scraping code are publicly released, adhering to Vietnamese Intellectual Property Law by providing author attributions and enabling future research to recreate and benchmark the dataset.

Analysis of VietLyrics reveals its diversity across genres, instruments, and regional dialects, featuring songs from over 4,000 unique artists. However, it also highlighted challenges, such as the lack of standardized genre classification tools for Vietnamese music and a skewed distribution towards Northern dialects and male-sung songs, indicating areas for future research.

Advancing Transcription with Fine-Tuned Whisper Models

Recognizing the limitations of current state-of-the-art multilingual ALT systems like LyricWhiz, which performed poorly on Vietnamese audio, the researchers focused on fine-tuning the Whisper model architecture. Whisper, known for its strong multilingual performance, was chosen as the base for developing a dedicated ALT system for Vietnamese.

Three variants of the Whisper model (small, medium, and large-v2) were fine-tuned using the VietLyrics dataset. The training involved segmenting audio and lyrics into 30-second chunks and carefully pre-processing the data to enhance training stability. The fine-tuned models were evaluated against baselines like LyricWhiz and PhoWhisper-large, an ASR model previously fine-tuned on Vietnamese audio.

The results demonstrated a significant improvement in performance. While LyricWhiz achieved a Word Error Rate (WER) of 49.22% and PhoWhisper-large reached 38.3%, the fine-tuned Whisper-large-v2 model achieved a substantially lower WER of 24.61% (case-sensitive) and an even better 20.52% for lowercase lyrics. The Character Error Rate (CER) also indicated high accuracy, suggesting the model can predict partial words effectively, with errors mainly stemming from minor character or diacritic issues.

Qualitative analysis confirmed the Whisper-large-v2 model’s robust performance across various instruments, music genres, and styles. While it performed well across most gender and dialect categories, a slight underperformance was noted for female voices with Southern dialects, pointing to potential areas for further refinement.

Interestingly, experiments with source separation (isolating vocals) showed minimal improvement, leading to the decision to exclude these complex pre-processing steps from the final approach. Furthermore, the fine-tuned Whisper models effectively suppressed hallucinations during non-vocal segments, a common issue in ASR models.

Also Read:

Future Implications

The public release of the VietLyrics dataset and the fine-tuned Whisper models marks a significant step forward for Vietnamese music computing research. This work not only provides a crucial resource for the community but also demonstrates a highly effective approach for Automatic Lyrics Transcription in low-resource languages and music. It is hoped that these contributions will inspire further innovation and improvements in the field.

For more details, you can refer to the full research paper: VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Vietnamese Music: New Dataset and Models Boost Lyrics Transcription Accuracy

Introducing VietLyrics: A New Foundation for Vietnamese ALT

Advancing Transcription with Fine-Tuned Whisper Models

Future Implications

Gen AI News and Updates

Accelerating ML Hardware Design: A New Benchmark and AI Models for FPGA Resource Estimation

Unlocking Advanced Visual Reasoning in AI with Long Grounded Thoughts

Bridging the Linguistic Divide: New Dataset Advances NLP for Nigeria’s Minority Languages

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates