spot_img
HomeResearch & DevelopmentSloPalSpeech: Unlocking Advanced Speech Recognition for Slovak with Parliamentary...

SloPalSpeech: Unlocking Advanced Speech Recognition for Slovak with Parliamentary Data

TLDR: Researchers have introduced SloPalSpeech, a new 2,800-hour dataset of Slovak speech derived from parliamentary proceedings. This large-scale, meticulously processed dataset has been used to fine-tune OpenAI Whisper models, resulting in significant reductions in Word Error Rate (WER) for Slovak Automatic Speech Recognition (ASR). The fine-tuned Whisper-small model, for example, saw its WER drop by up to 70%, approaching the performance of the much larger Whisper-large-v3. The dataset, segmented transcripts, and fine-tuned models are publicly released to foster further research in low-resource language ASR.

Automatic Speech Recognition (ASR) technology has made incredible strides, especially with the advent of large multilingual models like OpenAI’s Whisper. However, the benefits of these advancements haven’t been equally distributed across all languages. Many so-called ‘low-resource’ languages, which lack extensive training data, still struggle to achieve high accuracy in ASR. Slovak, for instance, has historically faced this challenge, with existing public datasets offering only about 100 hours of speech data – a fraction of what’s needed for robust ASR systems.

To bridge this significant gap, researchers Erik Božík and Marek Šuppa have introduced SloPalSpeech, a groundbreaking new dataset comprising over 2,800 hours of Slovak speech. This massive collection is derived from publicly available parliamentary proceedings, offering a rich and formal source of spoken language. The creation of SloPalSpeech involved a sophisticated processing pipeline designed to transform long, raw recordings into clean, usable audio-transcript pairs, each no longer than 30 seconds, perfectly suited for training modern ASR models.

Building the Dataset: From Parliament to Processed Data

The journey to create SloPalSpeech began with identifying suitable data sources. The researchers turned to Slovak parliamentary hearings, which are not only publicly accessible but also regularly transcribed. They collected audio recordings from MediaPortál and official administrative transcripts from The Joint Czech and Slovak Digital Parliamentary Library. This involved overcoming several technical hurdles, such as extracting audio from HLS streams and converting DOCX transcripts into a parsable format.

A critical step was the parsing of transcripts to segment content based on speaker annotations and filter out non-verbatim text like transcriber’s notes. Due to inconsistencies in the original data, a heuristic-based approach was developed, leveraging a comprehensive list of National Council members to accurately identify speaker turns. Following this, a rigorous validation process was undertaken to ensure the collected transcripts accurately matched the audio, revealing and correcting issues like mismatched session lengths.

The raw audio and transcripts varied greatly in length, necessitating an innovative alignment and segmentation strategy. Traditional forced alignment methods proved unsuitable for the long-form, often imperfect parliamentary data. Instead, the team developed an anchor-based timestamping method. This involved generating a reference transcript with word-level timestamps using the WhisperX framework and then aligning it with the collected ground-truth transcripts. Anchors – words appearing in both versions – served as reliable reference points, allowing for the creation of strictly increasing timestamp sequences. Finally, segments were constructed, and a post-processing step involving re-applying the Whisper model and calculating Word Error Rate (WER) was used to filter out low-quality segments, ensuring the dataset’s overall integrity and quality.

Transformative Results with Whisper Models

The true impact of SloPalSpeech was demonstrated through its use in fine-tuning several OpenAI Whisper models, including Whisper-small, medium, large-v3-turbo, and large-v3. The objective was to assess how this specialized, large-scale dataset could enhance Slovak ASR performance. The fine-tuning process involved careful setup, utilizing NVIDIA A10 GPUs and the HuggingFace Transformers library, with strategies adapted for different model sizes, including multi-GPU configurations for the largest Whisper-large-v3 model.

The results were remarkable. Benchmarking on standard Slovak datasets like Common Voice 21 and FLEURS showed significant reductions in Word Error Rate (WER) across all fine-tuned models. For instance, the fine-tuned Whisper-small model saw its WER drop by an impressive 65–70% compared to its baseline performance. This brought its accuracy close to that of the much larger, pre-trained Whisper-large-v3 model. Even the already highly capable Whisper-large-v3 model, trained on millions of hours of data, saw further WER reductions, particularly improving its handling of rare Slovak words.

The Whisper Large-v3-Turbo model emerged as a standout, offering an excellent balance of performance and efficiency. With significantly fewer parameters than Large-v3, it achieved accuracy within about 1% of its larger counterpart, making it a strong candidate for practical applications.

Also Read:

Looking Ahead

The creation and public release of the SloPalSpeech dataset, along with the fine-tuned Whisper models and segmented transcripts (totaling 60 million words), marks a pivotal moment for Slovak ASR. This work not only elevates Slovak beyond its previous ‘low-resource’ status but also provides a robust methodology for aligning long-form audio with extensive transcripts, a technique that could be scaled to other low-resource languages in Europe and beyond.

While the fine-tuned models show a strong bias towards parliamentary speech, occasionally leading to ‘hallucinations’ of parliamentary terms, and a observed degradation in multilingual capability, these are known challenges in specialized model training. The researchers suggest future work could explore mitigation strategies, such as compression ratio checks during inference, and further research into maintaining multilingual performance while specializing in a target language.

This research, detailed in the paper SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data, represents a significant leap forward, demonstrating the power of dedicated, high-quality datasets in advancing ASR for languages that have historically been underserved.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -