Unlocking Hidden Controls in Text-To-Speech Models with Repeated Fine-Tuning

TLDR: RepeaTTS is a novel fine-tuning method for Text-To-Speech models that discovers new, controllable speech features by analyzing the inherent variations in synthesized samples. It uses Principal Component Analysis (PCA) to identify latent features, which are then used as new labels for secondary fine-tuning. The method successfully improved controllability for a model not initially trained on emotions, uncovering emotional intensity and neutral/emotive speech features, though its effectiveness varied depending on the model’s initial training.

Text-To-Speech (TTS) models have become incredibly advanced, producing speech that sounds very natural. However, these models often come with a challenge: users have limited control over how the speech is delivered. While some models allow adjustments for things like speaking rate or perceived gender, this control is usually restricted to features the model was explicitly trained on. On the flip side, even with the same input, TTS models can produce variations in speech that are hard to predict or control, often reflecting biases or patterns from the training data.

A new research paper titled “RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning” introduces an innovative approach to tackle these issues. The core idea is to leverage the very “uncontrollable variance” of the model to discover new, hidden features that can then be controlled. This is achieved through a novel fine-tuning process.

How RepeaTTS Works

The method involves a multi-step process. First, thousands of speech samples are generated by the TTS model. These samples, even with identical inputs, will naturally show variations. The researchers then use a technique called Principal Component Analysis (PCA) to analyze these variations. PCA helps identify the “latent features” – underlying patterns or characteristics – that account for the most significant differences in the generated speech. Think of it as finding the most important dimensions along which the speech varies.

Once these latent features are identified, they are incorporated as new “labels” for a secondary fine-tuning stage. This means the model is retrained, but this time, it learns to associate these newly discovered features with specific controls. The process is iterative: after one set of features is enrolled, the analysis can be repeated to find even more features that explain the remaining variation.

Evaluation and Key Findings

The proposed method was evaluated on two models trained using an expressive Icelandic speech corpus. One model, T3-emotion, was trained with explicit emotional labels, while the other, T3, was not. The researchers found that for the T3 model (without initial emotional disclosure), the method successfully uncovered both continuous (like emotional intensity) and discrete (like neutral vs. emotive speech) features. This significantly improved the overall controllability of the model, allowing users to influence aspects of speech that were previously hidden or uncontrollable.

For instance, by analyzing the variations in the T3 model’s output, the researchers could identify a “low intensity,” “medium intensity,” and “high intensity” emotional spectrum. After fine-tuning, the model could then be prompted to generate speech with these specific emotional intensities. They also found clear clusters corresponding to neutral and emotive speech, which could then be used as control inputs.

Interestingly, the method’s success varied. When applied to the T3-emotion model, which already had emotional labels, the feature discovery process was less effective in finding new prosodic features. Instead, it sometimes revealed correlations related to the recording environment of the original corpus, highlighting the method’s sensitivity to any variation in the speech signal.

Also Read:

Implications

This research demonstrates a promising path towards making Text-To-Speech models more controllable and user-friendly. By systematically exploring and leveraging the inherent variability of these models, RepeaTTS offers a way to uncover and integrate new control features, potentially leading to more nuanced and expressive synthetic speech. While challenges remain, particularly in distinguishing meaningful prosodic features from other variations, this work opens doors for future advancements in controllable TTS.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Hidden Controls in Text-To-Speech Models with Repeated Fine-Tuning

How RepeaTTS Works

Evaluation and Key Findings

Implications

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates