spot_img
HomeResearch & DevelopmentFine-Grained Emotion Control in Synthetic Speech Through Feature Disentanglement

Fine-Grained Emotion Control in Synthetic Speech Through Feature Disentanglement

TLDR: This research introduces a novel emotional Text-To-Speech (TTS) method that generates natural and emotionally rich speech by predicting fine-grained, phoneme-level emotion embeddings. It effectively separates emotion from speaker-specific timbre using a mutual-information-guided disentanglement approach. The method, built on the FastSpeech 2 architecture, employs dedicated Timbre and Emotion Extractors and uses Mutual Information Neural Estimation (MINE) along with explicit emotion and speaker predictors to ensure distinct and independent style attributes. Experimental results show superior performance over existing baselines in both naturalness and emotional consistency, confirmed by objective metrics and t-SNE visualizations.

Deep learning has brought significant advancements to Text-To-Speech (TTS) technology, moving beyond early statistical models to produce more natural and expressive synthetic speech. The introduction of deep neural networks and later, autoregressive and non-autoregressive generative models, has greatly improved speech fidelity, intelligibility, and efficiency. However, achieving precise and expressive emotional TTS, especially in situations where only a few seconds of reference speech are available (known as zero-shot settings), has remained a significant challenge.

Traditional emotional TTS methods often rely on encoding reference speech into a single, global style vector. While these approaches can capture the overall style, they frequently struggle to model the subtle, phoneme-level variations in emotion and prosody. This compression into a single global embedding risks losing crucial details, thereby limiting the expressiveness and control over the synthesized speech.

A new research paper, “Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement”, introduces a novel approach to address these limitations. The method focuses on two key innovations: predicting fine-grained, phoneme-level emotion embeddings and effectively separating these emotion embeddings from global timbre information through a process called mutual-information minimization.

The core of this new method is a dedicated Style Encoder, which comprises two parallel components: a global Timbre Extractor and a phoneme-aware Emotion Extractor. The Timbre Extractor focuses on speaker-specific voice characteristics, which tend to be stable. In contrast, the Emotion Extractor aligns reference acoustics with target phonemes to produce a sequence of emotion embeddings, capturing the nuanced emotional and prosodic variations at a very detailed level.

To ensure that these two extractors capture distinct attributes, an unsupervised Mutual Information Neural Estimation (MINE) technique is employed. MINE explicitly pushes the timbre and emotion representations apart, ensuring that the timbre embedding retains only speaker-specific information, while the emotion embeddings capture only prosodic nuance. This allows the model to synthesize speech that is both consistent in its speaker’s voice and rich in emotional expression.

The disentanglement process is further guided by explicitly predicting emotion and speaker labels from the respective emotion and timbre features. This provides clear optimization objectives, helping the system to effectively separate these distinct speech attributes. The model is built upon the FastSpeech 2 architecture, a well-known TTS backbone, and undergoes a two-stage training process to ensure clean and disentangled representations.

Experimental results demonstrate that this new method significantly outperforms several strong baseline TTS systems, including Global Style Token (GST), StyleSpeech, MIST, and DC Comix TTS. It achieves superior performance in both subjective evaluations (Mean Opinion Score for naturalness and Similarity MOS for style consistency) and objective metrics (mel-cepstral distortion and unweighted average accuracy for emotion recognition). Visualizations using t-SNE further confirm the effectiveness of the disentanglement strategy, showing tight, well-separated clusters for different emotion categories, unlike the scattered and overlapping embeddings from baseline models.

Also Read:

This work highlights the significant potential of combining phoneme-level emotion modeling with principled feature disentanglement for creating highly expressive and high-fidelity emotional TTS systems. Looking ahead, the researchers plan to extend these techniques to multimodal generation and conversational speech dialogue systems, and to port their phoneme-level emotion embedding and disentanglement methods to more advanced diffusion-based and language-model-based TTS backbones.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -