TLDR: Vevo2 is a new framework that unifies controllable speech and singing voice generation. It uses two novel audio tokenizers—a notation-free prosody tokenizer and a low-frame-rate content-style tokenizer—along with unified pre-training and multi-objective post-training. This enables versatile control over text, prosody, style, and timbre, leading to mutual benefits for both modalities and unique applications like converting humming or instrumental melodies into singing.
Generating human voices that are both natural and controllable, especially for expressive forms like singing, has long been a complex challenge in audio generation. Researchers have made significant strides in speech generation, particularly with zero-shot text-to-speech (TTS) systems, largely due to the availability of vast speech datasets. However, singing voice generation, which demands precise control over elements like melody, has remained a more difficult area.
A new research paper introduces Vevo2, a groundbreaking unified framework designed to bridge the gap between controllable speech and singing voice generation. The core idea behind Vevo2 is that speech and singing voice learning can mutually benefit from a single, integrated model. This approach allows the abundance of speech data to enhance singing voice generation, while the inherent expressiveness of singing can improve expressive speech generation and prosody following capabilities.
Addressing Key Challenges
Building such a unified system presents several hurdles. Traditional singing voice datasets often rely on extensive, expert annotations like detailed music notation, which are scarce and not ideal for unified modeling. Furthermore, achieving precise control over various voice attributes—such as text (lyrics), prosody (melody), style (accent, emotion), and timbre (speaker identity)—within a single system is crucial.
Vevo2 tackles these challenges by introducing two innovative audio tokenizers:
- Prosody Tokenizer: This tokenizer operates at a low frame rate and is trained to reconstruct the chromagram of raw audio. Crucially, it is “music-notation-free,” meaning it can extract prosody and melody from speech, singing, and even instrumental sounds without needing expert annotations. This significantly improves scalability and flexibility, even allowing it to capture melodies from non-human sounds.
- Content-Style Tokenizer: Also operating at a low frame rate (12.5 Hz), this tokenizer encodes linguistic content, prosody, and style for both speech and singing. It achieves robust timbre disentanglement by reconstructing both chromagram features and hidden features from ASR models like Whisper. This tokenizer acts similarly to “semantic tokens” in speech generation but is adapted for both speech and singing.
The Vevo2 Architecture
The Vevo2 framework employs a two-stage architecture, similar to advanced zero-shot speech generation systems. It consists of:
- Auto-Regressive (AR) Content-Style Modeling Stage: This stage is responsible for enabling control over text, prosody, and style. During pre-training, Vevo2 uses both Explicit Prosody Learning (EPL), where prosody tokens are explicit inputs, and Implicit Prosody Learning (IPL), where prosody is learned from text alone. By randomly applying both strategies, the model learns prosody in a more unified way, effectively bridging speech and singing characteristics.
- Flow-Matching (FM) Acoustic Modeling Stage: This stage allows for fine-grained timbre control, converting the content-style tokens into Mel spectrograms, which are then used by a vocoder to generate the final audio waveform.
Enhancing Performance with Post-Training
While the pre-trained AR model shows good versatility, its stability, especially in following text and prosody, can be further improved. Vevo2 introduces a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. This process significantly enhances the model’s controllability and improves its generalization to out-of-distribution data, such as instrumental sounds used for melody control in singing voice synthesis.
Also Read:
- Dynamic Model Merging with Natural Niches for Enhanced AI Performance
- Enhancing Time Series Forecasting with Deep LLM Integration: Introducing Multi-layer Steerable Embedding Fusion
Versatile Control at Inference Time
Vevo2 offers flexible control mechanisms during inference, enabling a wide array of synthesis, conversion, and editing tasks for both speech and singing. These include text-to-speech, singing voice synthesis, voice conversion, speech editing, and singing lyric editing. The framework also uniquely supports applications like humming-to-singing and instrument-to-singing, where a hummed or instrumental melody can be transformed into a singing voice with specific lyrics and a target singer’s timbre.
Furthermore, Vevo2 introduces two additional inference-time controls:
- Duration Control: By manipulating the length of prosody tokens, Vevo2 can effectively control the total duration of the generated output, a feature often challenging for auto-regressive models.
- Pitch Region Control: Users can adjust the pitch region of the generated voice by shifting the pitch of the source waveform before prosody token extraction, which is particularly useful for voice and singing voice conversion tasks to achieve higher speaker similarity.
Experimental results consistently demonstrate that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. The framework’s effectiveness across diverse tasks highlights its strong generalization ability and versatility. For more technical details, you can refer to the full research paper here.


