TLDR: CoMelSinger is a new zero-shot singing voice synthesis (SVS) framework that uses discrete tokens to generate expressive singing for unseen voices. It addresses the problem of “prosody leakage” from acoustic prompts by employing a coarse-to-fine contrastive learning strategy, which disentangles pitch information from timbre. Additionally, a Singing Voice Transcription (SVT) module provides frame-level pitch guidance. This approach leads to significant improvements in pitch accuracy, timbre consistency, and the ability to synthesize singing for new speakers without prior training.
Singing Voice Synthesis (SVS) has become a cornerstone of modern creative audio technologies, powering everything from virtual idols to game soundtracks. The goal of SVS is to transform musical inputs like lyrics and pitch sequences into expressive, high-quality vocal performances. While significant strides have been made in generating singing voices for known singers, the challenge of zero-shot SVS—creating singing for unseen voices without prior training—remains a complex hurdle.
Recent advancements in text-to-speech (TTS) systems, particularly those using discrete acoustic tokens and large language model-style architectures, have shown impressive zero-shot capabilities. These systems can synthesize speech for new speakers by using a short audio snippet as a ‘prompt’ to capture the speaker’s unique voice characteristics. However, directly applying these techniques to singing synthesis introduces a unique problem: prosody leakage.
Prosody leakage occurs when the acoustic prompt, intended to convey only timbre (the unique quality of a voice), inadvertently encodes pitch and rhythmic information. This entanglement means the system struggles to precisely control the melody based on the musical score, as the prompt’s inherent prosody interferes. This issue is particularly critical in singing, where exact pitch and timing are paramount for a faithful musical performance.
Introducing CoMelSinger
To overcome these challenges, researchers have developed CoMelSinger, a novel zero-shot SVS framework. CoMelSinger is designed to provide structured and disentangled melody control within a discrete token-based modeling approach. It builds upon the non-autoregressive MaskGCT architecture, adapting it to handle musical inputs like lyrics and pitch tokens, thereby enhancing melody conditioning while preserving the in-context learning capabilities for zero-shot generation.
The CoMelSinger framework operates in two main stages: a Text-to-Semantic (T2S) module that converts lyrics into semantic tokens, and a Semantic-to-Acoustic (S2A) module that generates acoustic tokens based on these semantic tokens, a regulated pitch sequence, and an acoustic prompt. These acoustic tokens are then used to reconstruct the final singing waveform.
Disentangling Melody and Timbre
A core innovation in CoMelSinger is its coarse-to-fine contrastive learning strategy. This mechanism is specifically designed to suppress prosody leakage by explicitly regularizing pitch redundancy between the acoustic prompt and the melody input. In simpler terms, it teaches the model to separate the speaker’s voice quality (timbre) from the melodic information, ensuring that the melody is dictated by the musical score, not by the prompt’s inherent pitch.
This strategy works on two levels: at the sequence level, it encourages the synthesized singing to maintain overall pitch contour consistency across different prompts from the same singer. At the frame level, it aligns fine-grained acoustic features with local pitch variations, reinforcing precise pitch fidelity. This hierarchical approach ensures both global melodic structure and local pitch accuracy are preserved.
Pitch Guidance with SVT
CoMelSinger further incorporates a lightweight, encoder-only Singing Voice Transcription (SVT) module. This module acts as an auxiliary supervisor during training, aligning acoustic tokens with pitch and duration sequences at a fine-grained, frame-level. By predicting pitch sequences directly from acoustic representations and comparing them to the ground truth, the SVT module provides explicit guidance, encouraging the generated singing to accurately follow the intended melody and rhythm.
Also Read:
- ChiReSSD: A Generative AI Approach to Reconstruct Disordered Speech in Children
- KSDiff: Enhancing Facial Animation with Disentangled Speech and Keyframe Awareness
Efficient Adaptation and Performance
The model is fine-tuned using Low-Rank Adaptation (LoRA), an efficient technique that updates only a small subset of parameters while keeping most of the original model frozen. This allows CoMelSinger to adapt effectively to the nuances of singing voice synthesis with limited data, leveraging prior knowledge from large-scale speech pretraining without extensive computational cost.
Extensive experiments on both seen and unseen singers demonstrate CoMelSinger’s superior performance. It achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability compared to other state-of-the-art SVS baselines. For instance, in zero-shot scenarios, CoMelSinger maintains high speaker similarity and accurate pitch trajectories, a balance often difficult to achieve in existing systems.
This research marks a significant step forward in zero-shot singing voice synthesis, offering a framework that provides precise melody control and robust generalization to unseen voices. For more technical details, you can refer to the full research paper: CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance.


