Inworld AI Introduces TTS-1 and TTS-1-Max: Advanced Models for High-Fidelity, Controllable Speech Synthesis

TLDR: Inworld AI has unveiled TTS-1 and TTS-1-Max, two large Transformer-based text-to-speech models (1.6B and 8.8B parameters respectively) capable of generating high-resolution 48 kHz speech in 11 languages with fine-grained emotional and non-verbal control. Their state-of-the-art performance is achieved through a unique three-stage training process involving large-scale pre-training, supervised fine-tuning, and reinforcement learning alignment. The models feature an innovative audio codec for 48 kHz output and an optimized streaming inference pipeline for low latency. Ethical considerations are addressed through restricted model weight release, inaudible watermarking, and user consent for voice cloning.

Inworld AI has introduced its latest advancements in text-to-speech (TTS) technology with the release of Inworld TTS-1 and TTS-1-Max. These two Transformer-based models are designed to generate highly realistic and expressive speech, addressing the growing demand for high-quality audio in various applications, from interactive assistants to content creation.

The Inworld TTS-1 model, with 1.6 billion parameters, is optimized for efficiency and real-time speech synthesis, making it suitable for on-device use cases. Its larger counterpart, TTS-1-Max, boasts 8.8 billion parameters, focusing on achieving the utmost quality and expressiveness for more demanding applications. Both models are capable of generating high-resolution 48 kHz speech with low latency and support 11 languages, offering fine-grained emotional control and non-verbal vocalizations through special audio markups.

A Robust Training Methodology

The exceptional performance of Inworld TTS-1 and TTS-1-Max stems from a systematic three-stage training framework. This process begins with large-scale pre-training on over 1 million hours of raw audio mixed with text data, establishing a strong foundational model. Following this, the models undergo supervised fine-tuning (SFT) using 200,000 hours of high-quality, filtered audio-text pairs. The final stage involves reinforcement learning (RL) alignment, utilizing a technique called Group Relative Policy Optimization (GRPO). This crucial step fine-tunes the models against perceptual quality metrics, such as word error rate (WER), speaker similarity (SIM), and DNSMOS scores, helping to reduce synthesis artifacts and align the output with human preferences.

High-Resolution Audio and Expressive Control

A key innovation is the development of a novel audio codec built on the X-codec2 architecture, augmented with a super-resolution module. This allows the models to natively generate 48 kHz audio, a significant improvement for high-fidelity speech. To ensure consistent volume, especially in streaming applications, a root mean-square (RMS) loudness loss term was introduced during training. This helps prevent sudden volume changes when audio segments are concatenated.

Beyond high fidelity, the models offer extensive control over speech characteristics. Through the use of textual audio markups, users can guide the generation process to include specific speaking styles (e.g., angry, happy, whispering) and non-verbal vocalizations (e.g., breathe, laugh, sigh). This fine-grained control is achieved by training the models on paired neutral and stylized utterances from the same speaker, enabling them to learn stylistic nuances while preserving speaker identity.

Also Read:

Efficient Streaming and Ethical Considerations

Inworld TTS-1 and TTS-1-Max support two modes of execution: instant voice cloning, which uses a reference audio and its transcript to produce new generations, and professional voice cloning, where the model’s SpeechLM is fine-tuned on user voice recordings for enhanced similarity. Both modes operate within a low-latency streaming inference pipeline, designed for real-time speech synthesis. Optimizations include concatenating audio segments at non-voicing regions to mitigate artifacts and extending the audio decoder’s context to improve speaker identity replication.

The development team collaborated with Modular to optimize the inference engine, resulting in significant latency improvements. The streaming API can deliver the first two seconds of synthesized audio approximately 70% faster than standard implementations, showcasing the efficiency of their architecture.

Recognizing the ethical implications of powerful voice cloning technology, Inworld AI has implemented several safeguards. The model weights are not publicly released to mitigate misuse risks. All audio generated for production applications includes an inaudible watermark for content authentication, helping to distinguish synthetic speech from human speech. Additionally, users must explicitly confirm they have the rights to any voice they intend to clone using the instant voice cloning feature.

Inworld AI’s commitment to advancing speech synthesis is evident in these models, which combine state-of-the-art quality with practical utility and a strong focus on responsible AI development. For more technical details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Inworld AI Introduces TTS-1 and TTS-1-Max: Advanced Models for High-Fidelity, Controllable Speech Synthesis

A Robust Training Methodology

High-Resolution Audio and Expressive Control

Efficient Streaming and Ethical Considerations

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates