SloPalSpeech: Unlocking Advanced Speech Recognition for Slovak with Parliamentary Data

TLDR: Researchers have introduced SloPalSpeech, a new 2,800-hour dataset of Slovak speech derived from parliamentary proceedings. This large-scale, meticulously processed dataset has been used to fine-tune OpenAI Whisper models, resulting in significant reductions in Word Error Rate (WER) for Slovak Automatic Speech Recognition (ASR). The fine-tuned Whisper-small model, for example, saw its WER drop by up to 70%, approaching the performance of the much larger Whisper-large-v3. The dataset, segmented transcripts, and fine-tuned models are publicly released to foster further research in low-resource language ASR.

Automatic Speech Recognition (ASR) technology has made incredible strides, especially with the advent of large multilingual models like OpenAI’s Whisper. However, the benefits of these advancements haven’t been equally distributed across all languages. Many so-called ‘low-resource’ languages, which lack extensive training data, still struggle to achieve high accuracy in ASR. Slovak, for instance, has historically faced this challenge, with existing public datasets offering only about 100 hours of speech data – a fraction of what’s needed for robust ASR systems.

To bridge this significant gap, researchers Erik Božík and Marek Šuppa have introduced SloPalSpeech, a groundbreaking new dataset comprising over 2,800 hours of Slovak speech. This massive collection is derived from publicly available parliamentary proceedings, offering a rich and formal source of spoken language. The creation of SloPalSpeech involved a sophisticated processing pipeline designed to transform long, raw recordings into clean, usable audio-transcript pairs, each no longer than 30 seconds, perfectly suited for training modern ASR models.

Building the Dataset: From Parliament to Processed Data

The journey to create SloPalSpeech began with identifying suitable data sources. The researchers turned to Slovak parliamentary hearings, which are not only publicly accessible but also regularly transcribed. They collected audio recordings from MediaPortál and official administrative transcripts from The Joint Czech and Slovak Digital Parliamentary Library. This involved overcoming several technical hurdles, such as extracting audio from HLS streams and converting DOCX transcripts into a parsable format.

A critical step was the parsing of transcripts to segment content based on speaker annotations and filter out non-verbatim text like transcriber’s notes. Due to inconsistencies in the original data, a heuristic-based approach was developed, leveraging a comprehensive list of National Council members to accurately identify speaker turns. Following this, a rigorous validation process was undertaken to ensure the collected transcripts accurately matched the audio, revealing and correcting issues like mismatched session lengths.

The raw audio and transcripts varied greatly in length, necessitating an innovative alignment and segmentation strategy. Traditional forced alignment methods proved unsuitable for the long-form, often imperfect parliamentary data. Instead, the team developed an anchor-based timestamping method. This involved generating a reference transcript with word-level timestamps using the WhisperX framework and then aligning it with the collected ground-truth transcripts. Anchors – words appearing in both versions – served as reliable reference points, allowing for the creation of strictly increasing timestamp sequences. Finally, segments were constructed, and a post-processing step involving re-applying the Whisper model and calculating Word Error Rate (WER) was used to filter out low-quality segments, ensuring the dataset’s overall integrity and quality.

Transformative Results with Whisper Models

The true impact of SloPalSpeech was demonstrated through its use in fine-tuning several OpenAI Whisper models, including Whisper-small, medium, large-v3-turbo, and large-v3. The objective was to assess how this specialized, large-scale dataset could enhance Slovak ASR performance. The fine-tuning process involved careful setup, utilizing NVIDIA A10 GPUs and the HuggingFace Transformers library, with strategies adapted for different model sizes, including multi-GPU configurations for the largest Whisper-large-v3 model.

The results were remarkable. Benchmarking on standard Slovak datasets like Common Voice 21 and FLEURS showed significant reductions in Word Error Rate (WER) across all fine-tuned models. For instance, the fine-tuned Whisper-small model saw its WER drop by an impressive 65–70% compared to its baseline performance. This brought its accuracy close to that of the much larger, pre-trained Whisper-large-v3 model. Even the already highly capable Whisper-large-v3 model, trained on millions of hours of data, saw further WER reductions, particularly improving its handling of rare Slovak words.

The Whisper Large-v3-Turbo model emerged as a standout, offering an excellent balance of performance and efficiency. With significantly fewer parameters than Large-v3, it achieved accuracy within about 1% of its larger counterpart, making it a strong candidate for practical applications.

Also Read:

Looking Ahead

The creation and public release of the SloPalSpeech dataset, along with the fine-tuned Whisper models and segmented transcripts (totaling 60 million words), marks a pivotal moment for Slovak ASR. This work not only elevates Slovak beyond its previous ‘low-resource’ status but also provides a robust methodology for aligning long-form audio with extensive transcripts, a technique that could be scaled to other low-resource languages in Europe and beyond.

While the fine-tuned models show a strong bias towards parliamentary speech, occasionally leading to ‘hallucinations’ of parliamentary terms, and a observed degradation in multilingual capability, these are known challenges in specialized model training. The researchers suggest future work could explore mitigation strategies, such as compression ratio checks during inference, and further research into maintaining multilingual performance while specializing in a target language.

This research, detailed in the paper SloPalSpeech: A 2,800-Hour Slovak Speech Corpus from Parliamentary Data, represents a significant leap forward, demonstrating the power of dedicated, high-quality datasets in advancing ASR for languages that have historically been underserved.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

SloPalSpeech: Unlocking Advanced Speech Recognition for Slovak with Parliamentary Data

Building the Dataset: From Parliament to Processed Data

Transformative Results with Whisper Models

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates