spot_img
HomeResearch & DevelopmentBoosting Speech Generation Efficiency with Frame-Stacked Local Transformers

Boosting Speech Generation Efficiency with Frame-Stacked Local Transformers

TLDR: This research paper introduces Frame-Stacked Local Transformers (LTs) to enhance the efficiency and quality of multi-codebook speech generation in large language models (LLMs). It investigates two LT architectures, autoregressive and MaskGIT-based, which capture intra-timestep dependencies better than traditional parallel prediction. By combining these LTs with frame stacking, where the primary transformer predicts multiple frames jointly, the models achieve significant speedups (up to 5.5x) while maintaining or improving audio fidelity, speaker similarity, and naturalness. The paper provides practical guidelines for selecting decoding strategies based on desired trade-offs between computational efficiency and synthesis quality.

Recent advancements in large language models (LLMs) have brought remarkable improvements to speech generation, particularly in text-to-speech (TTS) systems. These models create highly natural-sounding speech by predicting sequences of discrete acoustic codes. However, unlike text, which is a simple one-dimensional sequence, acoustic representations are more complex, structured as a matrix where each timestep requires predicting multiple codebook entries. This multi-codebook structure introduces unique challenges, as acoustic tokens exhibit dependencies not just over time but also within each timestep.

Traditionally, a common approach called “parallel prediction” simultaneously predicts all codebooks for a given timestep, assuming they are independent. While efficient, this method often overlooks the intricate dependencies among codebooks, leading to a reduction in speech quality and the introduction of artifacts. To overcome this limitation, researchers have explored hierarchical strategies that incorporate an auxiliary component known as a Local Transformer (LT). The LT is specifically designed to capture these crucial intra-timestep dependencies, refining the predictions made by the primary transformer.

A new research paper, FRAME-STACKED LOCAL TRANSFORMERS FOR EFFICIENT MULTI-CODEBOOK SPEECH GENERATION, by Roy Fejgin, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Ryan Langman, Jaehyeon Kim, Subhankar Ghosh, Shehzeen Hussain, and Jason Li from NVIDIA, systematically investigates two primary LT architectures: an autoregressive (AR) transformer and a MaskGIT-based transformer. Both designs are further enhanced by a technique called frame stacking, where the main transformer predicts multiple frames at once, and the LT then decodes their corresponding codebooks. This innovative approach aims to significantly boost generation speed without sacrificing the perceived quality of the speech.

Exploring Local Transformer Architectures

The study delves into two distinct ways the Local Transformer can operate:

  • Autoregressive Local Transformer (AR LT): This variant generates codebook entries sequentially within each frame. It predicts each new codebook by considering the previously generated ones and the hidden state from the primary decoder. This method effectively captures causal dependencies, which are inherent in some audio quantization techniques, and consistently yields better results than parallel prediction, though it introduces a latency proportional to the number of codebooks.

  • MaskGIT Local Transformer: Inspired by MaskGIT, this LT begins with a fully masked sequence and iteratively unmasks it through prediction. It uses non-causal self-attention, allowing for bidirectional dependency modeling between codebooks within a frame. A key advantage is that it can predict multiple tokens in parallel across iterations, offering a flexible trade-off between speed and quality. The number of iterations can be smaller than the total number of codebooks, leading to faster inference.

The Power of Frame Stacking

Frame stacking is a crucial innovation introduced in this work. It leverages the hierarchical structure of the primary decoder and the LT to improve efficiency. Instead of predicting a single frame, the primary decoder is trained to predict a stack of ‘S’ frames (S × N codebooks) in one step. The LT then takes this hidden state and decodes all S × N codebooks. This method significantly increases generation speed because the LT is much smaller than the primary decoder and operates on a shorter sequence, reducing the computational burden on the main model, which typically has to attend to the entire generation history.

Also Read:

Key Findings and Practical Guidelines

The researchers conducted extensive experiments, evaluating models based on metrics like Word Error Rate (WER) for text adherence, Speaker Similarity (SSIM), Fréchet Distance (FD) for distribution matching, and UTMOSv2 for speech quality. Their findings highlight several important conclusions:

  • Improved Fidelity: LT-based models consistently achieved lower (better) Fréchet Distances compared to parallel-sampled models, indicating that iterative sampling generates a distribution closer to the ground truth. This supports the hypothesis that iterative sampling is vital for capturing intra-codebook dependencies and improving audio fidelity.

  • Speed and Quality Balance: At a frame stacking factor of 1 (no stacking), LT models outperformed the baseline in SSIM, MOS, and FD at similar speeds. With a stacking factor of 2, the AR LT achieved a 2.1x speedup, and the MaskGIT LT achieved a 3.1x speedup over the unstacked parallel baseline, while maintaining or improving quality metrics. Parallel sampling at higher stacking factors showed substantial degradation.

  • Aggressive Speedups: At a stacking factor of 4, speedups reached 2.9x for AR LT and 5.5x for MaskGIT LT. While this came with some trade-offs in speaker similarity for unseen speakers and MOS for MaskGIT, the FDs remained better than the baseline, demonstrating significant throughput gains.

Based on these results, the paper offers practical guidelines for selecting decoding strategies:

  • For scenarios where audio quality is paramount, the non-frame-stacked configuration with an autoregressive local transformer is recommended.

  • To achieve a good balance between quality and computational complexity, a frame stacking factor of 2 with an autoregressive LT is ideal, offering a 2.1x speedup without compromising quality. MaskGIT also performs well with a 3.1x speedup.

  • For maximum speedup, especially in applications not requiring zero-shot functionality, a high stacking factor (e.g., 4) with either an AR or MaskGIT LT is suggested.

In conclusion, this research demonstrates that iterative multi-codebook prediction using Local Transformers is crucial for capturing intra-codebook dependencies, leading to improved audio fidelity in LLM-based speech generation. Furthermore, combining LTs with frame stacking significantly enhances throughput, offering a powerful approach to efficient and high-quality speech synthesis without the need to retrain lower frame rate codec models.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -