ParsVoice: Unlocking High-Quality Text-to-Speech for the Persian Language

TLDR: ParsVoice is introduced as the largest publicly available, high-quality Persian speech corpus for Text-to-Speech (TTS) synthesis, comprising 1,804 hours from over 470 speakers. It was created using an automated pipeline that processes audiobooks, incorporating advanced techniques for sentence segmentation, audio-text alignment, multi-dimensional quality assessment, and speaker identification, aiming to bridge the data gap for low-resource languages like Persian and accelerate speech technology development.

The Persian language, spoken by over 100 million people globally, has long faced a significant challenge in the realm of high-quality speech data, particularly for Text-to-Speech (TTS) synthesis. This scarcity has hindered the development of advanced Persian speech technologies, leaving the language underrepresented compared to its high-resource counterparts like English.

Addressing this critical gap, researchers Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, and Azadeh Shakery from the University of Tehran have introduced ParsVoice, the largest and most comprehensive Persian speech corpus specifically designed for modern TTS applications. This groundbreaking dataset offers a substantial leap forward, providing 1,804 hours of high-quality, clean speech from over 470 distinct speakers, making it comparable in speaker diversity and audio quality to major English corpora.

An Innovative Automated Pipeline

The creation of ParsVoice was made possible through a sophisticated, scalable, and automated pipeline that transforms raw audiobook content into TTS-ready data. This pipeline incorporates several novel techniques that can also be adapted for other low-resource languages.

Data Collection and Source Selection

The primary data source for ParsVoice was IranSeda, a platform hosting over 3,800 audiobooks. This choice was based on the platform’s content diversity, professional recording quality (44.1 kHz sampling rate), and open availability, ensuring the resulting dataset could be freely distributed and used for both academic research and practical speech technology development.

Intelligent Audio Segmentation

Raw audiobook files, often hours long, require precise segmentation into sentence-level chunks while preserving linguistic integrity. The ParsVoice pipeline employs a BERT-based sentence completion detector, fine-tuned on Persian, to identify and filter sentence fragments. This model, integrated into a three-phase segmentation process, first uses WebRTC Voice Activity Detection (VAD) to find silence-based boundaries, then transcribes segments using the Google Speech-to-Text API (chosen for its lower word error rate on Persian), and finally validates linguistic completeness, iteratively extending boundaries if necessary.

Boundary Optimization Algorithm

Even with accurate transcription, audio segments can contain unwanted silence, background noise, or acoustic artifacts at their start and end points. To address this, a boundary optimization algorithm uses a binary search strategy combined with linear fine-tuning. This ensures that each segment contains only essential speech content, maintaining transcription accuracy and improving TTS model performance.

Multi-Dimensional Quality Assessment

Quality control is paramount for TTS training data. ParsVoice implements a comprehensive assessment across both audio and text dimensions. The Persian Text Quality Framework evaluates transcriptions based on character quality, length, repetition, and phonetic coverage, assigning a weighted score. Similarly, the Audio Quality Metrics assess recordings for clarity, absence of distortions, and suitability for speech processing, considering factors like signal-to-noise ratio, dynamic range, and background music presence.

Speaker Identification

To accurately label speakers across the entire corpus, a two-stage identification pipeline based on ECAPA-TDNN embeddings was utilized. This involved local speaker diarization within each audiobook using clustering algorithms and then global speaker identification to group local speakers into consistent identities across the entire dataset, achieving 97.0% consistency with known narrator labels.

Final Data Cleaning and Preparation

The final stage involved creating a high-quality subset by removing audio files and text segments that did not meet specific quality score thresholds. Additionally, a fine-tuned ParsBERT model was used to restore missing punctuation, ensuring all text segments are properly punctuated and structurally complete.

Also Read:

Impact and Availability

The ParsVoice corpus, detailed in the research paper available at arXiv:2510.10774, represents a monumental effort to overcome data scarcity for Persian speech processing. By providing the largest publicly available, high-quality multi-speaker Persian speech dataset, it is set to accelerate research and development in Persian TTS and serve as a valuable template for similar initiatives in other low-resource languages. The complete dataset has been made publicly available, fostering an open environment for innovation in speech technology.