TLDR: A new method called Vocoder-Projected Feature Discriminator (VPFD) significantly improves the efficiency of training high-quality voice conversion and text-to-speech models. By using intermediate features from a pretrained and frozen vocoder with a single upsampling step, VPFD achieves comparable performance to previous methods while reducing training time by 9.6 times and memory consumption by 11.4 times. This makes advanced speech generation more accessible.
Generating high-quality, natural-sounding speech from text (Text-to-Speech, TTS) or transforming one voice into another (Voice Conversion, VC) are complex tasks. Traditionally, these systems often work in two stages: first, they generate acoustic features like mel spectrograms, and then a “vocoder” converts these features into actual audio waveforms. While effective, ensuring the generated acoustic features are realistic is crucial for high-quality speech.
A common technique to improve the realism of these acoustic features is through adversarial training, using a Generative Adversarial Network (GAN). In this setup, a generator creates acoustic features, and a discriminator tries to tell apart the real features from the generated ones. A previous method, the Vocoder Waveform Discriminator (VWD), took this a step further by converting the acoustic features into full waveforms using a vocoder and then applying the discriminator in the time domain. This approach was shown to be very effective, improving speech quality and stability in training.
However, the VWD method had a significant drawback: converting acoustic features to full waveforms involves a large “upsampling” step, often by 256 times. This process is computationally intensive, leading to substantial time and memory overheads during training. For instance, the paper notes that VWD training could take 47.0 hours and consume 66.3 GB of memory.
Introducing the Vocoder-Projected Feature Discriminator (VPFD)
To overcome these limitations while retaining the benefits of adversarial training in the time domain, researchers proposed the Vocoder-Projected Feature Discriminator (VPFD). The core idea behind VPFD is to use intermediate features from the vocoder for discrimination, rather than waiting for the full waveform output. This means the upsampling step can be significantly reduced, for example, from 256 times down to just 8 times, or even less.
The key questions the researchers addressed were: how much can upsampling be reduced, and how should the vocoder feature extractor be handled during training? Through extensive experiments, they found that a single upsampling step (L=1) was both necessary and sufficient. This means the system only needs to upsample the features by a small factor, like 8 times, instead of the full 256 times, to capture the essential periodic structures crucial for waveform representation.
Furthermore, the study highlighted the importance of using a pretrained and frozen vocoder feature extractor. This means the vocoder component that extracts the intermediate features is trained once beforehand and then its parameters are kept fixed during the main training process. This strategy proved crucial for achieving optimal performance across various metrics.
Also Read:
- Vevo2: A Unified Approach to Controllable Speech and Singing Voice Generation
- OmniCache: Enhancing Diffusion Transformer Efficiency Through Trajectory-Aware Caching
Significant Efficiency Gains
The results of implementing VPFD were remarkable. When applied to diffusion-based voice conversion distillation, VPFD achieved comparable voice conversion performance to the original VWD method. Crucially, it reduced the training time by a factor of 9.6 and memory consumption by a factor of 11.4. For example, training time dropped from 47.0 hours to just 4.9 hours, and memory usage from 66.3 GB to 5.8 GB. These efficiency gains make advanced voice generation models more accessible and practical for researchers and developers with limited computational resources.
The effectiveness of VPFD was validated across different datasets, including VCTK and LibriTTS, demonstrating its generalizability. Subjective evaluations also confirmed that the speech quality and speaker similarity achieved with VPFD were on par with the more resource-intensive VWD.
This innovative approach offers a promising direction for future research in text-to-speech and voice conversion, enabling the development of high-quality speech generation systems with significantly reduced computational demands. You can read the full research paper here.


