spot_img
HomeResearch & DevelopmentUnlocking Hidden Controls in Text-To-Speech Models with Repeated Fine-Tuning

Unlocking Hidden Controls in Text-To-Speech Models with Repeated Fine-Tuning

TLDR: RepeaTTS is a novel fine-tuning method for Text-To-Speech models that discovers new, controllable speech features by analyzing the inherent variations in synthesized samples. It uses Principal Component Analysis (PCA) to identify latent features, which are then used as new labels for secondary fine-tuning. The method successfully improved controllability for a model not initially trained on emotions, uncovering emotional intensity and neutral/emotive speech features, though its effectiveness varied depending on the model’s initial training.

Text-To-Speech (TTS) models have become incredibly advanced, producing speech that sounds very natural. However, these models often come with a challenge: users have limited control over how the speech is delivered. While some models allow adjustments for things like speaking rate or perceived gender, this control is usually restricted to features the model was explicitly trained on. On the flip side, even with the same input, TTS models can produce variations in speech that are hard to predict or control, often reflecting biases or patterns from the training data.

A new research paper titled “RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning” introduces an innovative approach to tackle these issues. The core idea is to leverage the very “uncontrollable variance” of the model to discover new, hidden features that can then be controlled. This is achieved through a novel fine-tuning process.

How RepeaTTS Works

The method involves a multi-step process. First, thousands of speech samples are generated by the TTS model. These samples, even with identical inputs, will naturally show variations. The researchers then use a technique called Principal Component Analysis (PCA) to analyze these variations. PCA helps identify the “latent features” – underlying patterns or characteristics – that account for the most significant differences in the generated speech. Think of it as finding the most important dimensions along which the speech varies.

Once these latent features are identified, they are incorporated as new “labels” for a secondary fine-tuning stage. This means the model is retrained, but this time, it learns to associate these newly discovered features with specific controls. The process is iterative: after one set of features is enrolled, the analysis can be repeated to find even more features that explain the remaining variation.

Evaluation and Key Findings

The proposed method was evaluated on two models trained using an expressive Icelandic speech corpus. One model, T3-emotion, was trained with explicit emotional labels, while the other, T3, was not. The researchers found that for the T3 model (without initial emotional disclosure), the method successfully uncovered both continuous (like emotional intensity) and discrete (like neutral vs. emotive speech) features. This significantly improved the overall controllability of the model, allowing users to influence aspects of speech that were previously hidden or uncontrollable.

For instance, by analyzing the variations in the T3 model’s output, the researchers could identify a “low intensity,” “medium intensity,” and “high intensity” emotional spectrum. After fine-tuning, the model could then be prompted to generate speech with these specific emotional intensities. They also found clear clusters corresponding to neutral and emotive speech, which could then be used as control inputs.

Interestingly, the method’s success varied. When applied to the T3-emotion model, which already had emotional labels, the feature discovery process was less effective in finding new prosodic features. Instead, it sometimes revealed correlations related to the recording environment of the original corpus, highlighting the method’s sensitivity to any variation in the speech signal.

Also Read:

Implications

This research demonstrates a promising path towards making Text-To-Speech models more controllable and user-friendly. By systematically exploring and leveraging the inherent variability of these models, RepeaTTS offers a way to uncover and integrate new control features, potentially leading to more nuanced and expressive synthetic speech. While challenges remain, particularly in distinguishing meaningful prosodic features from other variations, this work opens doors for future advancements in controllable TTS.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -