spot_img
HomeResearch & DevelopmentLeveraging TV Subtitles to Improve Speech Recognition Accuracy

Leveraging TV Subtitles to Improve Speech Recognition Accuracy

TLDR: This research introduces a novel method to enhance Automatic Speech Recognition (ASR) by using TV subtitles as context-rich prompts rather than direct training targets. The approach involves an iterative, weakly supervised training framework where a pre-trained Whisper model refines its own generated pseudo transcripts, guided by subtitles. A Weighted Attention mechanism further improves accuracy during inference by emphasizing relevant subtitle tokens. Experiments on Flemish TV data demonstrate significant reductions in Word Error Rate (WER), particularly for rare and out-of-vocabulary words, proving the method’s effectiveness in improving ASR data quality without requiring additional labeled data.

In the rapidly evolving landscape of artificial intelligence, foundation models like Whisper have made significant strides in various tasks, including Automatic Speech Recognition (ASR). However, these powerful models often face challenges when applied to low-resource languages or specific domains where high-quality, labeled data is scarce. This scarcity can lead to performance discrepancies, making it difficult to generalize these models effectively.

A recent study, “Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR”, introduces an innovative method to tackle this problem by leveraging readily available TV subtitles. While subtitles are abundant, their inherent imprecision and lack of perfect alignment with spoken audio have traditionally limited their use as direct training targets for verbatim transcription. This research reimagines their role, transforming them into context-rich prompts that guide the ASR model rather than serving as strict supervision signals.

A New Approach to Weakly Supervised ASR

The core of this novel approach lies in its ability to handle the discrepancies between spoken audio and subtitle text. Instead of directly using subtitles for training, the method focuses on refining ‘pseudo transcripts’ – initial transcriptions generated by a pre-trained ASR model. Subtitles then act as guiding cues, facilitating an iterative refinement process for these pseudo transcripts.

The researchers developed a training methodology called Subtitle Prompting (SP) within a weakly supervised framework. They fine-tuned the Whisper model, using the generated pseudo transcripts as primary training targets. The subtitles were incorporated as contextual prompts for Whisper’s text decoder. This setup allows the model to progressively refine its understanding and generation of transcripts, gradually extracting more accurate information from the subtitles over multiple training iterations.

Enhancing Accuracy with Weighted Attention

To further boost performance, the study introduces a Weighted Attention (WA) mechanism during the inference phase. Subtitles often contain extraneous information not directly reflected in the audio. The WA mechanism addresses this by selectively emphasizing relevant subtitle tokens while minimizing the influence of irrelevant ones. This is achieved by using the Gini coefficient, a measure adapted from economics, to quantify the distribution of cross-attention weights between subtitle tokens and speech frames. Tokens with highly focused attention weights are considered more relevant and are given greater emphasis, guiding the model to prioritize speech-relevant information.

Experimental Validation and Promising Results

The effectiveness of this approach was demonstrated through experiments conducted on a dataset of 760 hours of multi-genre Flemish TV recordings. The initial Word Error Rate (WER) of raw subtitles against manually annotated verbatim transcripts was a high 34.3%, underscoring the challenge of their direct use. However, by integrating SP training, the WER significantly decreased for both medium and large Whisper models, even after just one iteration of fine-tuning.

A deeper analysis revealed that the method particularly benefited the transcription of rare and out-of-vocabulary words, which are typically challenging for ASR systems. This suggests that subtitle prompts not only help the model recall low-frequency words but also improve its generalization capabilities. The Weighted Attention mechanism further refined these results, consistently reducing WER across various test scenarios, with the best performance achieved when applied to all attention layers.

The iterative training process proved to be highly effective, with WERs consistently decreasing over successive cycles. After three iterations, the system achieved a remarkable WER of 10.34% on the evaluation set, showcasing the power of this iterative refinement driven by pseudo transcripts and subtitle prompts.

Also Read:

Conclusion

This research presents a significant step forward in enhancing the quality of ASR transcripts, especially in low-resource settings, without the need for additional verbatim labeled data. By cleverly re-purposing TV subtitles as contextual prompts and introducing a weighted attention mechanism, the study offers a robust and effective method for refining existing model transcripts and creating higher-quality datasets for weakly supervised ASR systems.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -