TLDR: QTTS is a new text-to-speech system that uses a multi-codebook audio codec (QDAC) to capture more detailed speech information, overcoming limitations of single-codebook methods. It offers two generation strategies: one for high-quality synthesis by preserving complex dependencies (Hierarchy Parallel) and another for faster inference (Delay Multihead). Experiments show QTTS delivers higher quality, better expressiveness, and efficient performance, especially for challenging audio like singing.
Text-to-speech (TTS) technology has made significant strides, allowing computers to generate human-like speech from written text. However, many existing systems, especially those that generate speech step-by-step (autoregressive models), often rely on simplified audio representations. These ‘single-codebook’ methods can lose important details, making it difficult to capture subtle nuances like natural speaking rhythms, unique speaker voices, or the complexities of singing.
Imagine trying to recreate a detailed painting with only a few basic colors; you’d miss out on the fine shading and textures. Similarly, single-codebook TTS models struggle with ‘fine-grained details,’ leading to less expressive or natural-sounding speech, particularly in challenging scenarios like generating singing voices or music.
To address these limitations, researchers have introduced a new framework called QTTS. This innovative system is built upon a novel audio codec named QDAC, which stands for Quantization-Decoupled Audio Codec. The core idea behind QDAC is to improve how audio information is compressed and represented. Unlike previous methods that might mix different types of audio features, QDAC is designed to separate them effectively. It uses an advanced training approach that combines an autoregressive speech recognition network with a generative adversarial network (GAN). This allows QDAC to disentangle semantic features (what is being said) from other acoustic attributes (like how it’s said, or the speaker’s unique voice).
Specifically, QDAC ensures that the most crucial linguistic information, such as phonemes (the basic units of sound in speech), is captured in the first ‘codebook’ or layer of its representation. This leaves the subsequent codebooks free to focus on modeling the finer, more subtle acoustic details, leading to a much richer and more efficient audio representation.
Two Paths to Better Speech
QTTS models these detailed audio representations using two clever strategies, offering a trade-off between synthesis quality and inference speed:
The first strategy is the Hierarchical Parallel architecture. This approach is designed for achieving the highest possible audio quality. It uses a ‘dual-autoregressive’ structure, meaning it generates audio tokens (discrete units of sound) both sequentially over time and hierarchically across different codebooks. This ensures that when the model generates a sound, it has full contextual awareness of all previously generated sounds and all preceding layers of detail. This meticulous process helps capture intricate dependencies between different layers of audio information, resulting in highly natural and faithful speech synthesis. While this method prioritizes quality, its sequential nature means it can be slower for real-time applications.
The second strategy is the Delay Multihead approach, which focuses on accelerating inference speed. Instead of waiting for an entire layer of audio tokens to be generated, this method uses a parallel prediction mechanism with a fixed delay. This means that the model can start predicting tokens for the next layer of detail after only a small, fixed number of steps from the current layer. This parallelization significantly speeds up the generation process, making it more suitable for real-time applications. Although it sacrifices some of the exhaustive contextual modeling found in the Hierarchical Parallel approach, it still maintains essential local context, providing a good balance between speed and quality.
Impressive Results
Experiments have shown that QTTS significantly outperforms previous text-to-speech models, especially those relying on single-codebook representations. It achieves higher synthesis quality and does a better job of preserving expressive content, such as the naturalness of speech and the unique characteristics of a speaker’s voice, even when generating speech for voices it hasn’t encountered before (zero-shot speaker transfer).
The research demonstrates that increasing the number of codebooks in QDAC leads to substantially better audio reconstruction quality, with a 16-codebook setup achieving near-lossless audio. This highlights the critical role of high-fidelity audio compression in producing high-quality speech synthesis. Furthermore, QTTS shows strong performance in metrics like word error rate (WER), speaker similarity, and mean opinion score (MOS), which measures perceived audio quality.
The team also focused on optimizing the inference speed of QTTS. By adapting existing large language model (LLM) inference frameworks, they managed to achieve a very low ‘Time-To-First-Token’ (TTFT) and high throughput, meaning the system can start generating speech quickly and process a large amount of audio efficiently.
Also Read:
- Enhancing Speech Clarity: A New Approach Using AI to Understand Human Preferences
- Unlocking Long Conversations: How Dynamic Parameter Memory Enhances AI’s Emotional Understanding
Looking Ahead
The development of QTTS marks a significant step forward in text-to-speech technology. By explicitly modeling compression through multi-codebook architectures and employing sophisticated decoding strategies, QTTS delivers richer, more accurate, and more expressive speech. The researchers believe that scaling up compression via multi-codebook modeling is a promising direction for future high-fidelity, general-purpose speech and audio generation.
Future work may explore extending QTTS to more complex scenarios, such as generating singing voices, music, or speech in multiple languages, and investigating alternative non-autoregressive methods to further improve inference speed without compromising quality. You can read the full research paper here.


