TLDR: A new method called Differentiable Reward Optimization (DiffRO) significantly improves AI-powered text-to-speech (TTS) systems. Unlike traditional methods, DiffRO directly calculates rewards from speech tokens, making training faster and more efficient. It also introduces a Multi-Task Reward (MTR) model, which enhances pronunciation accuracy and enables zero-shot control over emotional and quality attributes in synthesized speech, as demonstrated by improved WER and emotional expression in experiments.
Text-to-Speech (TTS) systems, which convert written text into spoken audio, have seen significant advancements with the rise of large language models (LLMs). These systems aim to generate speech that is not only clear and natural but also capable of conveying various emotions and adhering to specific instructions. However, a major hurdle in training these advanced TTS models has been the complexity and computational cost of incorporating human feedback, a process known as Reinforcement Learning from Human Feedback (RLHF).
A new research paper introduces a novel approach called Differentiable Reward Optimization (DiffRO) that promises to streamline and enhance the training of LLM-based TTS systems. Traditionally, RLHF for TTS involves converting discrete speech tokens into full audio waveforms to calculate rewards, which is computationally intensive and slows down the training process. DiffRO bypasses this by directly computing rewards from the neural codec tokens themselves, significantly reducing the computational burden.
Simplifying the Training Process
One of the key innovations of DiffRO is its use of the Gumbel-Softmax technique. This technical trick makes the reward function ‘differentiable,’ meaning the system can directly optimize its language model using standard backpropagation methods. This is a major improvement over conventional RLHF, which often requires more complex and slower reinforcement learning loops like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO).
Introducing the Multi-Task Reward Model
Beyond just improving efficiency, DiffRO also introduces a Multi-Task Reward (MTR) model. This model is designed to provide comprehensive feedback from various perspectives. It incorporates several ‘downstream tasks’ such as Automatic Speech Recognition (ASR), Speech Emotion Recognition (SER), Speech Quality Assessment (SQA), and even age and gender prediction. By integrating these tasks, the MTR model can guide the TTS system to generate audio that not only pronounces words accurately but also adheres to specific emotional or quality attributes.
For instance, if you want the TTS system to speak in a ‘happy’ tone, the MTR model can provide feedback based on its SER component, encouraging the system to produce speech tokens that are recognized as happy. This allows for ‘zero-shot’ control, meaning the system can generate speech with desired attributes even if it hasn’t been explicitly trained on data labeled with those specific emotions or qualities during the RL phase.
Also Read:
- Advancing Speech Quality Assessment with a Mixture of Experts Model
- Advancing Singing Voice Synthesis for Bollywood Hindi with LAPS-Diff
Promising Results and Future Directions
Experimental results demonstrate that DiffRO significantly boosts the pronunciation accuracy of TTS systems, achieving state-of-the-art Word Error Rate (WER) results on benchmarks like seed-tts-eval. When the MTR model is integrated, the system shows a remarkable ability to control emotional expression. For example, the research highlights how the system can learn to synthesize laughter, sobs, and breaths to convey emotion, even without explicit emotion-labeled data during the reinforcement learning phase.
While DiffRO shows strong performance in pronunciation and emotion control, the paper notes that controlling attributes like overall audio quality (Mean Opinion Score or MOS) and speaker characteristics (age and gender) is still challenging. This is because the final audio quality is heavily influenced by backend models like the Flow Matching (FM) model and vocoder, which are trained on clean audio and have denoising capabilities. Future work aims to incorporate more downstream tasks into the MTR model and explore applying DiffRO to the FM module to gain better control over speaker-related attributes.
This innovative DiffRO method represents a significant step forward in making LLM-based TTS systems more efficient, accurate, and controllable, paving the way for more natural and expressive synthetic speech. You can read the full research paper here.


