TLDR: FasterVoiceGrad is a novel voice conversion model that dramatically accelerates the conversion process by simultaneously distilling both the main diffusion model and the content encoder. It introduces Adversarial Diffusion Conversion Distillation (ADCD), a training method that performs distillation during the conversion process, leading to 6.6-6.9 times faster GPU performance and competitive speech quality and speaker similarity compared to its predecessor, FastVoiceGrad. This innovation makes high-quality, one-step diffusion-based voice conversion significantly more efficient and practical.
Voice conversion (VC) is a fascinating technology that allows us to transform one person’s voice into another’s while keeping the original message intact. Imagine being able to speak in any voice you choose for applications ranging from personalized digital assistants to entertainment. While this technology holds immense promise, achieving high-quality and natural-sounding voice conversions has been a significant challenge, especially when it comes to speed.
Recent advancements in deep generative models, particularly diffusion models, have brought remarkable improvements in speech quality and speaker similarity for voice conversion. Models like VoiceGrad have demonstrated impressive results. However, a major drawback of these diffusion-based models is their slow conversion process, which often requires many iterative steps to generate the final output. This makes them less practical for real-time applications compared to other methods that can convert voices in a single step.
To address this speed limitation, FastVoiceGrad was introduced. This model successfully distilled the multi-step VoiceGrad into a one-step diffusion model, significantly reducing the number of sampling steps. While a great leap forward, FastVoiceGrad still relied on a computationally intensive component called the content encoder. This encoder is crucial for separating the speaker’s unique identity from the linguistic content of their speech. The high computational cost of this content encoder meant that FastVoiceGrad’s overall conversion speed was still limited, preventing it from reaching its full potential.
Introducing FasterVoiceGrad: A New Era of Speed and Efficiency
Researchers from NTT, Inc., Japan, have now proposed an innovative solution called FasterVoiceGrad. This new model takes the concept of speed a step further by simultaneously distilling not only the main reverse diffusion process but also the content encoder itself. The key to this breakthrough is a novel training approach called Adversarial Diffusion Conversion Distillation (ADCD).
Unlike previous methods that performed distillation during a ‘reconstruction’ process (where the model tries to recreate the original input), FasterVoiceGrad’s ADCD performs distillation directly during the ‘conversion’ process. This crucial change prevents the model from simply learning to copy its input and instead forces it to genuinely learn the voice conversion task. By doing so, FasterVoiceGrad can replace the heavy, frozen content encoder with a much faster, trainable convolutional neural network (CNN) based encoder.
How FasterVoiceGrad Achieves Its Speed and Quality
FasterVoiceGrad incorporates several clever techniques within its ADCD framework:
- Conversion Adversarial Loss: This helps the model generate converted speech that sounds natural and indistinguishable from real speech.
- Conversion Score Distillation Loss: This ensures that the student model (FasterVoiceGrad) closely mimics the high-quality output of the teacher model (VoiceGrad) during conversion.
- Reconversion Score Distillation: To ensure that the linguistic content is perfectly preserved during conversion, the model performs a second conversion and applies distillation, reinforcing content accuracy.
- Inverse Score Distillation: This unique component enhances speaker similarity by not only bringing the converted voice closer to the target speaker but also actively pushing it away from other speakers, creating a stronger, more distinct speaker identity.
Remarkable Performance Gains
Experimental evaluations have shown that FasterVoiceGrad delivers impressive results. Compared to FastVoiceGrad, it achieves a remarkable speedup of 6.6 to 6.9 times faster on a GPU and 1.8 times faster on a CPU. This significant increase in speed makes real-time voice conversion much more feasible.
In terms of performance, FasterVoiceGrad maintains competitive voice conversion quality. Objective metrics for speech quality and speaker similarity are on par with or even slightly better than FastVoiceGrad. Subjective listening tests also indicate that FasterVoiceGrad outperforms FastVoiceGrad in speech quality. While there’s a slight perceptual gap in speaker similarity compared to objective scores, the researchers note that this could be due to subtle remnants of the source speech that are perceptible to humans but not easily detected by current neural speaker encoders. This opens up exciting avenues for future research in improving speaker encoder technology.
The versatility of FasterVoiceGrad was also demonstrated across different datasets, including VCTK and LibriTTS, showing consistent performance and speed improvements. This indicates its robustness and potential for broad application.
Also Read:
- Efficient Voice Conversion: A New Discriminator Reduces Training Time and Memory
- Vevo2: A Unified Approach to Controllable Speech and Singing Voice Generation
The Future of Voice Conversion
FasterVoiceGrad represents a significant step forward in making high-quality, diffusion-based voice conversion faster and more efficient. By intelligently distilling both the core diffusion model and the content encoder during the conversion process, it addresses a critical bottleneck in previous models. This research paves the way for advanced and practical voice conversion applications, including accent conversion and truly real-time voice transformation. You can read the full research paper here.


