Leveraging TV Subtitles to Improve Speech Recognition Accuracy

TLDR: This research introduces a novel method to enhance Automatic Speech Recognition (ASR) by using TV subtitles as context-rich prompts rather than direct training targets. The approach involves an iterative, weakly supervised training framework where a pre-trained Whisper model refines its own generated pseudo transcripts, guided by subtitles. A Weighted Attention mechanism further improves accuracy during inference by emphasizing relevant subtitle tokens. Experiments on Flemish TV data demonstrate significant reductions in Word Error Rate (WER), particularly for rare and out-of-vocabulary words, proving the method’s effectiveness in improving ASR data quality without requiring additional labeled data.

In the rapidly evolving landscape of artificial intelligence, foundation models like Whisper have made significant strides in various tasks, including Automatic Speech Recognition (ASR). However, these powerful models often face challenges when applied to low-resource languages or specific domains where high-quality, labeled data is scarce. This scarcity can lead to performance discrepancies, making it difficult to generalize these models effectively.

A recent study, “Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR”, introduces an innovative method to tackle this problem by leveraging readily available TV subtitles. While subtitles are abundant, their inherent imprecision and lack of perfect alignment with spoken audio have traditionally limited their use as direct training targets for verbatim transcription. This research reimagines their role, transforming them into context-rich prompts that guide the ASR model rather than serving as strict supervision signals.

A New Approach to Weakly Supervised ASR

The core of this novel approach lies in its ability to handle the discrepancies between spoken audio and subtitle text. Instead of directly using subtitles for training, the method focuses on refining ‘pseudo transcripts’ – initial transcriptions generated by a pre-trained ASR model. Subtitles then act as guiding cues, facilitating an iterative refinement process for these pseudo transcripts.

The researchers developed a training methodology called Subtitle Prompting (SP) within a weakly supervised framework. They fine-tuned the Whisper model, using the generated pseudo transcripts as primary training targets. The subtitles were incorporated as contextual prompts for Whisper’s text decoder. This setup allows the model to progressively refine its understanding and generation of transcripts, gradually extracting more accurate information from the subtitles over multiple training iterations.

Enhancing Accuracy with Weighted Attention

To further boost performance, the study introduces a Weighted Attention (WA) mechanism during the inference phase. Subtitles often contain extraneous information not directly reflected in the audio. The WA mechanism addresses this by selectively emphasizing relevant subtitle tokens while minimizing the influence of irrelevant ones. This is achieved by using the Gini coefficient, a measure adapted from economics, to quantify the distribution of cross-attention weights between subtitle tokens and speech frames. Tokens with highly focused attention weights are considered more relevant and are given greater emphasis, guiding the model to prioritize speech-relevant information.

Experimental Validation and Promising Results

The effectiveness of this approach was demonstrated through experiments conducted on a dataset of 760 hours of multi-genre Flemish TV recordings. The initial Word Error Rate (WER) of raw subtitles against manually annotated verbatim transcripts was a high 34.3%, underscoring the challenge of their direct use. However, by integrating SP training, the WER significantly decreased for both medium and large Whisper models, even after just one iteration of fine-tuning.

A deeper analysis revealed that the method particularly benefited the transcription of rare and out-of-vocabulary words, which are typically challenging for ASR systems. This suggests that subtitle prompts not only help the model recall low-frequency words but also improve its generalization capabilities. The Weighted Attention mechanism further refined these results, consistently reducing WER across various test scenarios, with the best performance achieved when applied to all attention layers.

The iterative training process proved to be highly effective, with WERs consistently decreasing over successive cycles. After three iterations, the system achieved a remarkable WER of 10.34% on the evaluation set, showcasing the power of this iterative refinement driven by pseudo transcripts and subtitle prompts.

Also Read:

Conclusion

This research presents a significant step forward in enhancing the quality of ASR transcripts, especially in low-resource settings, without the need for additional verbatim labeled data. By cleverly re-purposing TV subtitles as contextual prompts and introducing a weighted attention mechanism, the study offers a robust and effective method for refining existing model transcripts and creating higher-quality datasets for weakly supervised ASR systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Leveraging TV Subtitles to Improve Speech Recognition Accuracy

A New Approach to Weakly Supervised ASR

Enhancing Accuracy with Weighted Attention

Experimental Validation and Promising Results

Conclusion

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates