Automating Song Data Preparation for AI Generation

TLDR: SongPrep is a new framework that automates the preprocessing of song data for AI song generation models. It handles source separation, structure analysis, and lyric recognition, converting raw songs into structured, training-ready datasets. The paper also introduces SongPrepE2E, an end-to-end model that improves upon SongPrep’s multi-stage pipeline by leveraging large language models for more accurate and efficient structured lyric recognition, leading to higher quality AI-generated songs.

The field of Artificial Intelligence Generated Content (AIGC) is rapidly advancing, with song generation emerging as a particularly exciting area. However, a significant hurdle in developing high-quality AI song models has been the laborious and costly process of preparing vast amounts of song data. Traditionally, this requires extensive manual labeling of lyrics and structural information, which is both time-consuming and expensive.

To tackle this challenge, researchers have introduced SongPrep, an innovative automated preprocessing framework designed to streamline the preparation of song data. This framework automates crucial steps such as separating different audio sources (like vocals and instruments), analyzing the song’s structure, and recognizing lyrics. The output is structured data that can be directly used to train AI models for generating songs.

Beyond the preprocessing framework, the team also developed SongPrepE2E, an end-to-end model for structured lyrics recognition. Unlike multi-stage pipelines, SongPrepE2E can analyze the entire song’s structure and lyrics, providing precise timestamps without needing separate source separation. It achieves this by leveraging the full song’s context and pretrained semantic knowledge from large language models, resulting in lower error rates for both diarization (identifying who sings when) and word recognition.

How SongPrep Works

The SongPrep framework processes raw song data through a sequence of modules. First, it uses a model like Demucs for source separation, breaking down a song into its core components: vocals, drums, bass, and other instruments. This separation is crucial because different analysis modules require specific tracks. For instance, lyric recognition primarily focuses on the vocal track.

Next, the structure analysis module employs an enhanced All-In-One model. This model is responsible for identifying musical segments such as intros, verses, choruses, bridges, and outros. The researchers improved this module by retraining it on a bilingual dataset, refining the label set to seven clear categories, and integrating a Dual-Path RNN (DPRNN) block to better capture global song structure. These modifications significantly reduced the Diarization Error Rate (DER), which measures the accuracy of structural labels and their timings.

Following structure analysis, the lyric recognition module transcribes lyrics from the vocal track, specifically focusing on vocal segments within verses, choruses, and bridges. This involves using an ASR (Automatic Speech Recognition) system, like a fine-tuned Zipformer-based model, and an improved WER-FIX algorithm to ensure high-quality lyric texts. A word alignment module then calibrates the results, preventing instrumental sections from being mislabeled as vocal parts.

Introducing SongPrepE2E: The End-to-End Solution

While SongPrep offers significant improvements, its multi-stage nature can lead to lower inference efficiency and a loss of contextual information when audio is split into short chunks. To overcome these limitations, SongPrepE2E was developed. This end-to-end system integrates MuCodec, which discretizes audio into tokens, with a large language model (LLM) such as Qwen2-7B. Trained on data curated by SongPrep, SongPrepE2E can directly extract structured lyric transcriptions from full-length songs, offering better accuracy and deployment efficiency.

Also Read:

Performance and Impact

The effectiveness of SongPrep and SongPrepE2E was evaluated using a new dataset called SSLD-200, comprising 200 manually annotated Chinese and English songs. Experiments showed that the improved structure analysis module reduced DER, and the enhanced lyric recognition module achieved a lower Word Error Rate (WER). SongPrepE2E consistently outperformed the multi-stage SongPrep pipeline in terms of both text recognition accuracy and Real Time Factor (RTF), indicating faster processing.

Crucially, the impact of SongPrepE2E was also assessed in downstream song generation tasks. When the Levo model, a song generation AI, was trained with data processed by SongPrepE2E, the generated songs showed marked improvements. Subjective evaluations revealed higher scores for Musicality Structure, Lyric Matching Degree, and overall Subjective Bias compared to songs generated using data from a baseline pipeline. This demonstrates that the high-quality, structured data produced by SongPrepE2E enables AI models to generate songs that more closely resemble human-produced music.

In conclusion, SongPrep and SongPrepE2E represent a significant step forward in automating the complex process of preparing song data for AI. By providing accurate structural information and lyrics, these frameworks pave the way for more advanced and human-like AI-generated music. You can find more details about this research in the original paper: SONGPREP: A PREPROCESSING FRAMEWORK AND END-TO-END MODEL FOR FULL-SONG STRUCTURE PARSING AND LYRICS TRANSCRIPTION.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating Song Data Preparation for AI Generation

How SongPrep Works

Introducing SongPrepE2E: The End-to-End Solution

Performance and Impact

Gen AI News and Updates

New Research Highlights Critical Need for AI Content Guardrails in Enterprises

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates