spot_img
HomeResearch & DevelopmentAutomating Song Data Preparation for AI Generation

Automating Song Data Preparation for AI Generation

TLDR: SongPrep is a new framework that automates the preprocessing of song data for AI song generation models. It handles source separation, structure analysis, and lyric recognition, converting raw songs into structured, training-ready datasets. The paper also introduces SongPrepE2E, an end-to-end model that improves upon SongPrep’s multi-stage pipeline by leveraging large language models for more accurate and efficient structured lyric recognition, leading to higher quality AI-generated songs.

The field of Artificial Intelligence Generated Content (AIGC) is rapidly advancing, with song generation emerging as a particularly exciting area. However, a significant hurdle in developing high-quality AI song models has been the laborious and costly process of preparing vast amounts of song data. Traditionally, this requires extensive manual labeling of lyrics and structural information, which is both time-consuming and expensive.

To tackle this challenge, researchers have introduced SongPrep, an innovative automated preprocessing framework designed to streamline the preparation of song data. This framework automates crucial steps such as separating different audio sources (like vocals and instruments), analyzing the song’s structure, and recognizing lyrics. The output is structured data that can be directly used to train AI models for generating songs.

Beyond the preprocessing framework, the team also developed SongPrepE2E, an end-to-end model for structured lyrics recognition. Unlike multi-stage pipelines, SongPrepE2E can analyze the entire song’s structure and lyrics, providing precise timestamps without needing separate source separation. It achieves this by leveraging the full song’s context and pretrained semantic knowledge from large language models, resulting in lower error rates for both diarization (identifying who sings when) and word recognition.

How SongPrep Works

The SongPrep framework processes raw song data through a sequence of modules. First, it uses a model like Demucs for source separation, breaking down a song into its core components: vocals, drums, bass, and other instruments. This separation is crucial because different analysis modules require specific tracks. For instance, lyric recognition primarily focuses on the vocal track.

Next, the structure analysis module employs an enhanced All-In-One model. This model is responsible for identifying musical segments such as intros, verses, choruses, bridges, and outros. The researchers improved this module by retraining it on a bilingual dataset, refining the label set to seven clear categories, and integrating a Dual-Path RNN (DPRNN) block to better capture global song structure. These modifications significantly reduced the Diarization Error Rate (DER), which measures the accuracy of structural labels and their timings.

Following structure analysis, the lyric recognition module transcribes lyrics from the vocal track, specifically focusing on vocal segments within verses, choruses, and bridges. This involves using an ASR (Automatic Speech Recognition) system, like a fine-tuned Zipformer-based model, and an improved WER-FIX algorithm to ensure high-quality lyric texts. A word alignment module then calibrates the results, preventing instrumental sections from being mislabeled as vocal parts.

Introducing SongPrepE2E: The End-to-End Solution

While SongPrep offers significant improvements, its multi-stage nature can lead to lower inference efficiency and a loss of contextual information when audio is split into short chunks. To overcome these limitations, SongPrepE2E was developed. This end-to-end system integrates MuCodec, which discretizes audio into tokens, with a large language model (LLM) such as Qwen2-7B. Trained on data curated by SongPrep, SongPrepE2E can directly extract structured lyric transcriptions from full-length songs, offering better accuracy and deployment efficiency.

Also Read:

Performance and Impact

The effectiveness of SongPrep and SongPrepE2E was evaluated using a new dataset called SSLD-200, comprising 200 manually annotated Chinese and English songs. Experiments showed that the improved structure analysis module reduced DER, and the enhanced lyric recognition module achieved a lower Word Error Rate (WER). SongPrepE2E consistently outperformed the multi-stage SongPrep pipeline in terms of both text recognition accuracy and Real Time Factor (RTF), indicating faster processing.

Crucially, the impact of SongPrepE2E was also assessed in downstream song generation tasks. When the Levo model, a song generation AI, was trained with data processed by SongPrepE2E, the generated songs showed marked improvements. Subjective evaluations revealed higher scores for Musicality Structure, Lyric Matching Degree, and overall Subjective Bias compared to songs generated using data from a baseline pipeline. This demonstrates that the high-quality, structured data produced by SongPrepE2E enables AI models to generate songs that more closely resemble human-produced music.

In conclusion, SongPrep and SongPrepE2E represent a significant step forward in automating the complex process of preparing song data for AI. By providing accurate structural information and lyrics, these frameworks pave the way for more advanced and human-like AI-generated music. You can find more details about this research in the original paper: SONGPREP: A PREPROCESSING FRAMEWORK AND END-TO-END MODEL FOR FULL-SONG STRUCTURE PARSING AND LYRICS TRANSCRIPTION.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -