spot_img
HomeResearch & DevelopmentUnlocking Vietnamese Music: New Dataset and Models Boost Lyrics...

Unlocking Vietnamese Music: New Dataset and Models Boost Lyrics Transcription Accuracy

TLDR: Researchers have created VietLyrics, the first large-scale dataset of 647 hours of Vietnamese songs with aligned lyrics, to address the challenges of Automatic Lyrics Transcription (ALT) for the Vietnamese language. By fine-tuning Whisper models on this new dataset, they achieved significantly better performance (20.52% WER for lowercase lyrics) compared to existing multilingual ALT systems like LyricWhiz, which struggled with Vietnamese’s tonal complexity and dialects. The VietLyrics dataset and the fine-tuned models are publicly released to foster further research in Vietnamese music computing.

Automatic Lyrics Transcription (ALT) is becoming increasingly vital in the modern music industry, enabling deep learning models to convert vocal recordings into written text. This technology enhances user experience on streaming platforms, assists in music analysis, and supports subtitle creation, promoting accessibility for the hearing impaired and non-native speakers.

Despite significant advancements in Western music computing, Vietnamese music research, particularly in lyrics transcription, has lagged. This is primarily due to the unique challenges posed by Vietnamese, such as its tonal complexity and diverse regional dialects, coupled with a critical lack of large-scale, high-quality datasets. Existing solutions often struggle with frequent transcription errors, hallucinations in non-vocal segments, and misidentification of the language.

Introducing VietLyrics: A New Foundation for Vietnamese ALT

To address these challenges, a team of researchers from the National University of Singapore has curated and released VietLyrics, the first large-scale Vietnamese ALT dataset. This comprehensive dataset comprises 647 hours of Vietnamese songs, complete with line-level aligned lyrics and rich metadata, including AI-predicted gender and genre information.

The creation of VietLyrics involved scraping approximately 647.1 hours of audio and accompanying lyrics from zingmp3.vn, a popular Vietnamese music streaming site. The data collection process included rigorous filtering to ensure the retention of predominantly Vietnamese songs, removal of duplicates, and standardization of audio sample rates. Crucially, the dataset and its associated scraping code are publicly released, adhering to Vietnamese Intellectual Property Law by providing author attributions and enabling future research to recreate and benchmark the dataset.

Analysis of VietLyrics reveals its diversity across genres, instruments, and regional dialects, featuring songs from over 4,000 unique artists. However, it also highlighted challenges, such as the lack of standardized genre classification tools for Vietnamese music and a skewed distribution towards Northern dialects and male-sung songs, indicating areas for future research.

Advancing Transcription with Fine-Tuned Whisper Models

Recognizing the limitations of current state-of-the-art multilingual ALT systems like LyricWhiz, which performed poorly on Vietnamese audio, the researchers focused on fine-tuning the Whisper model architecture. Whisper, known for its strong multilingual performance, was chosen as the base for developing a dedicated ALT system for Vietnamese.

Three variants of the Whisper model (small, medium, and large-v2) were fine-tuned using the VietLyrics dataset. The training involved segmenting audio and lyrics into 30-second chunks and carefully pre-processing the data to enhance training stability. The fine-tuned models were evaluated against baselines like LyricWhiz and PhoWhisper-large, an ASR model previously fine-tuned on Vietnamese audio.

The results demonstrated a significant improvement in performance. While LyricWhiz achieved a Word Error Rate (WER) of 49.22% and PhoWhisper-large reached 38.3%, the fine-tuned Whisper-large-v2 model achieved a substantially lower WER of 24.61% (case-sensitive) and an even better 20.52% for lowercase lyrics. The Character Error Rate (CER) also indicated high accuracy, suggesting the model can predict partial words effectively, with errors mainly stemming from minor character or diacritic issues.

Qualitative analysis confirmed the Whisper-large-v2 model’s robust performance across various instruments, music genres, and styles. While it performed well across most gender and dialect categories, a slight underperformance was noted for female voices with Southern dialects, pointing to potential areas for further refinement.

Interestingly, experiments with source separation (isolating vocals) showed minimal improvement, leading to the decision to exclude these complex pre-processing steps from the final approach. Furthermore, the fine-tuned Whisper models effectively suppressed hallucinations during non-vocal segments, a common issue in ASR models.

Also Read:

Future Implications

The public release of the VietLyrics dataset and the fine-tuned Whisper models marks a significant step forward for Vietnamese music computing research. This work not only provides a crucial resource for the community but also demonstrates a highly effective approach for Automatic Lyrics Transcription in low-resource languages and music. It is hoped that these contributions will inspire further innovation and improvements in the field.

For more details, you can refer to the full research paper: VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -