spot_img
HomeResearch & DevelopmentAdvancing Voice Editing and Text-to-Speech with Cross-Attentive Mamba

Advancing Voice Editing and Text-to-Speech with Cross-Attentive Mamba

TLDR: MA VE is a new AI model for high-fidelity voice editing and zero-shot text-to-speech (TTS). It combines Mamba state-space models for efficient audio processing with cross-attention for precise text alignment. MA VE achieves state-of-the-art performance in speech editing and competitive zero-shot TTS, outperforming existing models like VoiceCraft and FluentSpeech in quality and significantly reducing memory usage during inference.

A new research paper introduces MA VE (Mamba with Cross-Attention for Voice Editing and Synthesis), a groundbreaking AI model that significantly advances text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis. Developed by Baher Mohammad, Magauiya Zhussip, and Stamatios Lefkimmiatis, MA VE leverages a unique combination of Mamba state-space models and cross-attention mechanisms to achieve impressive results.

Traditional approaches to speech synthesis and editing often face trade-offs. Autoregressive models, while offering high fidelity, can be computationally expensive due to their quadratic complexity. Non-autoregressive models, on the other hand, prioritize speed but sometimes struggle with maintaining temporal coherence and precise prosodic control, especially in real-world audio environments. MA VE aims to overcome these limitations by integrating the best of both worlds.

The core innovation of MA VE lies in its architecture. It replaces the self-attention mechanisms found in Transformer-based decoders with structured state-space sequences (SSMs) from the Mamba model. This change allows for linear-complexity modeling of dependencies between acoustic tokens, making it much more efficient for processing long audio sequences. Crucially, MA VE incorporates a cross-attention module that dynamically aligns augmented text inputs with acoustic tokens. This enables the model to “edit” speech with remarkable precision, guided by the textual information provided.

MA VE is notable for being the first successful application of a structured state-space model to text-conditional speech generation, specifically for speech editing and zero-shot TTS. In evaluations on the challenging RealEdit benchmark, MA VE achieved human-parity naturalness in speech editing. It also surpassed leading state-of-the-art models like VoiceCraft and FluentSpeech in both speaker similarity and naturalness, all without requiring any post-processing steps.

Beyond its quality improvements, MA VE also offers significant efficiency gains. During inference, it requires approximately six times less memory compared to Transformer-based VoiceCraft. This memory reduction, combined with its ability to perform single-pass generation, makes MA VE a more scalable and practical solution for high-quality speech generation tasks.

For zero-shot TTS, where the model synthesizes speech in a target speaker’s voice from a brief audio reference, MA VE also demonstrated superior performance. It exceeded VoiceCraft in both speaker similarity and naturalness, again without needing multiple inference runs or complex post-processing. The model achieves this by leveraging in-context learning, using a short reference utterance to guide the autoregressive decoder in synthesizing speech with the target speaker’s acoustic, prosodic, and timbral characteristics.

The researchers conducted extensive experiments to validate MA VE’s capabilities. In human evaluations on speech editing, 57.2% of listeners rated MA VE-edited speech as perceptually equal to the original, indicating that the edits are often indistinguishable from the source audio. For zero-shot TTS, MA VE consistently outperformed VoiceCraft in naturalness and intelligibility, especially for medium-length utterances.

An ablation study further highlighted the importance of MA VE’s hybrid design. Neither a pure Transformer-based encoder-decoder nor a standalone Mamba decoder achieved optimal performance, underscoring that the synergistic integration of Mamba for efficient audio modeling and cross-attention for precise text-audio alignment is key to MA VE’s success.

Also Read:

This innovative architecture establishes a new standard for flexible, high-fidelity voice editing and synthesis. The full research paper can be found here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -