TLDR: Inworld AI has unveiled TTS-1 and TTS-1-Max, two large Transformer-based text-to-speech models (1.6B and 8.8B parameters respectively) capable of generating high-resolution 48 kHz speech in 11 languages with fine-grained emotional and non-verbal control. Their state-of-the-art performance is achieved through a unique three-stage training process involving large-scale pre-training, supervised fine-tuning, and reinforcement learning alignment. The models feature an innovative audio codec for 48 kHz output and an optimized streaming inference pipeline for low latency. Ethical considerations are addressed through restricted model weight release, inaudible watermarking, and user consent for voice cloning.
Inworld AI has introduced its latest advancements in text-to-speech (TTS) technology with the release of Inworld TTS-1 and TTS-1-Max. These two Transformer-based models are designed to generate highly realistic and expressive speech, addressing the growing demand for high-quality audio in various applications, from interactive assistants to content creation.
The Inworld TTS-1 model, with 1.6 billion parameters, is optimized for efficiency and real-time speech synthesis, making it suitable for on-device use cases. Its larger counterpart, TTS-1-Max, boasts 8.8 billion parameters, focusing on achieving the utmost quality and expressiveness for more demanding applications. Both models are capable of generating high-resolution 48 kHz speech with low latency and support 11 languages, offering fine-grained emotional control and non-verbal vocalizations through special audio markups.
A Robust Training Methodology
The exceptional performance of Inworld TTS-1 and TTS-1-Max stems from a systematic three-stage training framework. This process begins with large-scale pre-training on over 1 million hours of raw audio mixed with text data, establishing a strong foundational model. Following this, the models undergo supervised fine-tuning (SFT) using 200,000 hours of high-quality, filtered audio-text pairs. The final stage involves reinforcement learning (RL) alignment, utilizing a technique called Group Relative Policy Optimization (GRPO). This crucial step fine-tunes the models against perceptual quality metrics, such as word error rate (WER), speaker similarity (SIM), and DNSMOS scores, helping to reduce synthesis artifacts and align the output with human preferences.
High-Resolution Audio and Expressive Control
A key innovation is the development of a novel audio codec built on the X-codec2 architecture, augmented with a super-resolution module. This allows the models to natively generate 48 kHz audio, a significant improvement for high-fidelity speech. To ensure consistent volume, especially in streaming applications, a root mean-square (RMS) loudness loss term was introduced during training. This helps prevent sudden volume changes when audio segments are concatenated.
Beyond high fidelity, the models offer extensive control over speech characteristics. Through the use of textual audio markups, users can guide the generation process to include specific speaking styles (e.g., angry, happy, whispering) and non-verbal vocalizations (e.g., breathe, laugh, sigh). This fine-grained control is achieved by training the models on paired neutral and stylized utterances from the same speaker, enabling them to learn stylistic nuances while preserving speaker identity.
Also Read:
- AI Models Learn to Forget: Protecting Your Voice from Unwanted Replication
- New Benchmarks in ASR for Impaired Speech: Insights from the Interspeech 2025 Challenge
Efficient Streaming and Ethical Considerations
Inworld TTS-1 and TTS-1-Max support two modes of execution: instant voice cloning, which uses a reference audio and its transcript to produce new generations, and professional voice cloning, where the model’s SpeechLM is fine-tuned on user voice recordings for enhanced similarity. Both modes operate within a low-latency streaming inference pipeline, designed for real-time speech synthesis. Optimizations include concatenating audio segments at non-voicing regions to mitigate artifacts and extending the audio decoder’s context to improve speaker identity replication.
The development team collaborated with Modular to optimize the inference engine, resulting in significant latency improvements. The streaming API can deliver the first two seconds of synthesized audio approximately 70% faster than standard implementations, showcasing the efficiency of their architecture.
Recognizing the ethical implications of powerful voice cloning technology, Inworld AI has implemented several safeguards. The model weights are not publicly released to mitigate misuse risks. All audio generated for production applications includes an inaudible watermark for content authentication, helping to distinguish synthetic speech from human speech. Additionally, users must explicitly confirm they have the rights to any voice they intend to clone using the instant voice cloning feature.
Inworld AI’s commitment to advancing speech synthesis is evident in these models, which combine state-of-the-art quality with practical utility and a strong focus on responsible AI development. For more technical details, you can refer to the full research paper here.


