Vevo2: A Unified Approach to Controllable Speech and Singing Voice Generation

TLDR: Vevo2 is a new framework that unifies controllable speech and singing voice generation. It uses two novel audio tokenizers—a notation-free prosody tokenizer and a low-frame-rate content-style tokenizer—along with unified pre-training and multi-objective post-training. This enables versatile control over text, prosody, style, and timbre, leading to mutual benefits for both modalities and unique applications like converting humming or instrumental melodies into singing.

Generating human voices that are both natural and controllable, especially for expressive forms like singing, has long been a complex challenge in audio generation. Researchers have made significant strides in speech generation, particularly with zero-shot text-to-speech (TTS) systems, largely due to the availability of vast speech datasets. However, singing voice generation, which demands precise control over elements like melody, has remained a more difficult area.

A new research paper introduces Vevo2, a groundbreaking unified framework designed to bridge the gap between controllable speech and singing voice generation. The core idea behind Vevo2 is that speech and singing voice learning can mutually benefit from a single, integrated model. This approach allows the abundance of speech data to enhance singing voice generation, while the inherent expressiveness of singing can improve expressive speech generation and prosody following capabilities.

Addressing Key Challenges

Building such a unified system presents several hurdles. Traditional singing voice datasets often rely on extensive, expert annotations like detailed music notation, which are scarce and not ideal for unified modeling. Furthermore, achieving precise control over various voice attributes—such as text (lyrics), prosody (melody), style (accent, emotion), and timbre (speaker identity)—within a single system is crucial.

Vevo2 tackles these challenges by introducing two innovative audio tokenizers:

Prosody Tokenizer: This tokenizer operates at a low frame rate and is trained to reconstruct the chromagram of raw audio. Crucially, it is “music-notation-free,” meaning it can extract prosody and melody from speech, singing, and even instrumental sounds without needing expert annotations. This significantly improves scalability and flexibility, even allowing it to capture melodies from non-human sounds.
Content-Style Tokenizer: Also operating at a low frame rate (12.5 Hz), this tokenizer encodes linguistic content, prosody, and style for both speech and singing. It achieves robust timbre disentanglement by reconstructing both chromagram features and hidden features from ASR models like Whisper. This tokenizer acts similarly to “semantic tokens” in speech generation but is adapted for both speech and singing.

The Vevo2 Architecture

The Vevo2 framework employs a two-stage architecture, similar to advanced zero-shot speech generation systems. It consists of:

Auto-Regressive (AR) Content-Style Modeling Stage: This stage is responsible for enabling control over text, prosody, and style. During pre-training, Vevo2 uses both Explicit Prosody Learning (EPL), where prosody tokens are explicit inputs, and Implicit Prosody Learning (IPL), where prosody is learned from text alone. By randomly applying both strategies, the model learns prosody in a more unified way, effectively bridging speech and singing characteristics.
Flow-Matching (FM) Acoustic Modeling Stage: This stage allows for fine-grained timbre control, converting the content-style tokens into Mel spectrograms, which are then used by a vocoder to generate the final audio waveform.

Enhancing Performance with Post-Training

While the pre-trained AR model shows good versatility, its stability, especially in following text and prosody, can be further improved. Vevo2 introduces a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. This process significantly enhances the model’s controllability and improves its generalization to out-of-distribution data, such as instrumental sounds used for melody control in singing voice synthesis.

Also Read:

Versatile Control at Inference Time

Vevo2 offers flexible control mechanisms during inference, enabling a wide array of synthesis, conversion, and editing tasks for both speech and singing. These include text-to-speech, singing voice synthesis, voice conversion, speech editing, and singing lyric editing. The framework also uniquely supports applications like humming-to-singing and instrument-to-singing, where a hummed or instrumental melody can be transformed into a singing voice with specific lyrics and a target singer’s timbre.

Furthermore, Vevo2 introduces two additional inference-time controls:

Duration Control: By manipulating the length of prosody tokens, Vevo2 can effectively control the total duration of the generated output, a feature often challenging for auto-regressive models.
Pitch Region Control: Users can adjust the pitch region of the generated voice by shifting the pitch of the source waveform before prosody token extraction, which is particularly useful for voice and singing voice conversion tasks to achieve higher speaker similarity.

Experimental results consistently demonstrate that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. The framework’s effectiveness across diverse tasks highlights its strong generalization ability and versatility. For more technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Vevo2: A Unified Approach to Controllable Speech and Singing Voice Generation

Addressing Key Challenges

The Vevo2 Architecture

Enhancing Performance with Post-Training

Versatile Control at Inference Time

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates