CoMelSinger: Advancing Zero-Shot Singing Synthesis with Precise Melody Control

TLDR: CoMelSinger is a new zero-shot singing voice synthesis (SVS) framework that uses discrete tokens to generate expressive singing for unseen voices. It addresses the problem of “prosody leakage” from acoustic prompts by employing a coarse-to-fine contrastive learning strategy, which disentangles pitch information from timbre. Additionally, a Singing Voice Transcription (SVT) module provides frame-level pitch guidance. This approach leads to significant improvements in pitch accuracy, timbre consistency, and the ability to synthesize singing for new speakers without prior training.

Singing Voice Synthesis (SVS) has become a cornerstone of modern creative audio technologies, powering everything from virtual idols to game soundtracks. The goal of SVS is to transform musical inputs like lyrics and pitch sequences into expressive, high-quality vocal performances. While significant strides have been made in generating singing voices for known singers, the challenge of zero-shot SVS—creating singing for unseen voices without prior training—remains a complex hurdle.

Recent advancements in text-to-speech (TTS) systems, particularly those using discrete acoustic tokens and large language model-style architectures, have shown impressive zero-shot capabilities. These systems can synthesize speech for new speakers by using a short audio snippet as a ‘prompt’ to capture the speaker’s unique voice characteristics. However, directly applying these techniques to singing synthesis introduces a unique problem: prosody leakage.

Prosody leakage occurs when the acoustic prompt, intended to convey only timbre (the unique quality of a voice), inadvertently encodes pitch and rhythmic information. This entanglement means the system struggles to precisely control the melody based on the musical score, as the prompt’s inherent prosody interferes. This issue is particularly critical in singing, where exact pitch and timing are paramount for a faithful musical performance.

Introducing CoMelSinger

To overcome these challenges, researchers have developed CoMelSinger, a novel zero-shot SVS framework. CoMelSinger is designed to provide structured and disentangled melody control within a discrete token-based modeling approach. It builds upon the non-autoregressive MaskGCT architecture, adapting it to handle musical inputs like lyrics and pitch tokens, thereby enhancing melody conditioning while preserving the in-context learning capabilities for zero-shot generation.

The CoMelSinger framework operates in two main stages: a Text-to-Semantic (T2S) module that converts lyrics into semantic tokens, and a Semantic-to-Acoustic (S2A) module that generates acoustic tokens based on these semantic tokens, a regulated pitch sequence, and an acoustic prompt. These acoustic tokens are then used to reconstruct the final singing waveform.

Disentangling Melody and Timbre

A core innovation in CoMelSinger is its coarse-to-fine contrastive learning strategy. This mechanism is specifically designed to suppress prosody leakage by explicitly regularizing pitch redundancy between the acoustic prompt and the melody input. In simpler terms, it teaches the model to separate the speaker’s voice quality (timbre) from the melodic information, ensuring that the melody is dictated by the musical score, not by the prompt’s inherent pitch.

This strategy works on two levels: at the sequence level, it encourages the synthesized singing to maintain overall pitch contour consistency across different prompts from the same singer. At the frame level, it aligns fine-grained acoustic features with local pitch variations, reinforcing precise pitch fidelity. This hierarchical approach ensures both global melodic structure and local pitch accuracy are preserved.

Pitch Guidance with SVT

CoMelSinger further incorporates a lightweight, encoder-only Singing Voice Transcription (SVT) module. This module acts as an auxiliary supervisor during training, aligning acoustic tokens with pitch and duration sequences at a fine-grained, frame-level. By predicting pitch sequences directly from acoustic representations and comparing them to the ground truth, the SVT module provides explicit guidance, encouraging the generated singing to accurately follow the intended melody and rhythm.

Also Read:

Efficient Adaptation and Performance

The model is fine-tuned using Low-Rank Adaptation (LoRA), an efficient technique that updates only a small subset of parameters while keeping most of the original model frozen. This allows CoMelSinger to adapt effectively to the nuances of singing voice synthesis with limited data, leveraging prior knowledge from large-scale speech pretraining without extensive computational cost.

Extensive experiments on both seen and unseen singers demonstrate CoMelSinger’s superior performance. It achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability compared to other state-of-the-art SVS baselines. For instance, in zero-shot scenarios, CoMelSinger maintains high speaker similarity and accurate pitch trajectories, a balance often difficult to achieve in existing systems.

This research marks a significant step forward in zero-shot singing voice synthesis, offering a framework that provides precise melody control and robust generalization to unseen voices. For more technical details, you can refer to the full research paper: CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CoMelSinger: Advancing Zero-Shot Singing Synthesis with Precise Melody Control

Introducing CoMelSinger

Disentangling Melody and Timbre

Pitch Guidance with SVT

Efficient Adaptation and Performance

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates