TLDR: LAPS-Diff is a new diffusion-based model for Singing Voice Synthesis (SVS) specifically designed for low-resource Bollywood Hindi. It improves SVS quality by integrating language-aware embeddings, a style encoder, and a novel pitch loss function, along with musical and contextual priors. Evaluations show it significantly outperforms state-of-the-art models in generating natural and expressive singing voices, even with limited data.
The field of Singing Voice Synthesis (SVS) has seen remarkable progress, especially with the advent of diffusion-based models. However, a significant challenge remains in accurately capturing the nuances of vocal style, genre-specific pitch variations, and language-dependent characteristics, particularly in situations where limited data is available. Addressing this critical gap, researchers Sandipan Dhar, Mayank Gupta, and Preeti Rao from the Indian Institute of Technology, Bombay, have introduced a novel framework called LAPS-Diff.
LAPS-Diff is a sophisticated diffusion model that integrates language-aware embeddings and a unique vocal-style guided learning mechanism. It is specifically tailored for the Bollywood Hindi singing style, a genre known for its rich melodic and cultural influences. To develop and test this model, the team curated a dedicated Hindi SVS dataset, which is a significant contribution given the scarcity of such resources for Indian languages.
A core innovation of LAPS-Diff lies in its enriched lyrics representation. The model leverages pre-trained language models like IndicBERT and XPhoneBERT to extract detailed word and phone-level embeddings. These embeddings are then combined with traditional music score embeddings, providing a more comprehensive understanding of the lyrical content.
Beyond linguistic understanding, LAPS-Diff places a strong emphasis on capturing the expressive elements of singing. It incorporates a style encoder and a pitch extraction model to compute specific style and pitch losses during training. This ensures that the synthesized singing not only sounds natural but also accurately reflects the vocal style and intricate pitch variations crucial to the Bollywood genre. Notably, the researchers introduced a novel pitch loss function that considers the linear correlation between predicted and ground-truth pitch contours, aiming to better preserve the melodic shape.
To further refine the acoustic feature generation, LAPS-Diff utilizes pre-trained MERT and IndicWav2Vec models. These models extract musical and contextual embeddings, serving as conditional priors that guide the diffusion process, leading to more precise and detailed mel-spectrograms—the visual representation of sound that the model generates.
The performance of LAPS-Diff was rigorously evaluated against the state-of-the-art DiffSinger model using both objective and subjective measures on their constrained Hindi Bollywood dataset. The results demonstrate that LAPS-Diff significantly improves the quality of the generated singing samples. Objective metrics showed superior performance in capturing speaker characteristics, spectral alignment, and voiced/unvoiced region accuracy. Subjective evaluations, including Mean Opinion Score (MOS) tests conducted with human listeners, confirmed that LAPS-Diff produces more natural and expressive Hindi singing voices, even with limited training data.
An ablation study was also conducted to highlight the effectiveness of each integrated component, showing how language-aware features, musical priors, and style/pitch-guided losses collectively contribute to the model’s enhanced performance. Visual analyses of content embeddings and pitch contours further illustrated LAPS-Diff’s ability to closely match ground truth data, especially in handling complex pitch transitions and preserving harmonic structures.
Also Read:
- AI Model Learns to Compose and Perform Classical Piano with Expressive Nuances
- GuideSep: Empowering Users in Music Separation with Generative AI
This research marks a significant step forward for Singing Voice Synthesis, particularly for low-resource languages and specific musical genres like Bollywood Hindi. The LAPS-Diff framework offers a robust solution for generating high-quality, expressive singing voices, paving the way for future advancements in multilingual SVS and broader generalization across diverse vocal styles. For more in-depth details, you can refer to the full research paper available at arXiv:2507.04966.


