TLDR: A new research paper introduces an AI-driven pipeline to support the documentation of SENĆOŦEN, an endangered Indigenous language. Facing challenges like limited data and complex vocabulary, the system uses text-to-speech (TTS) to augment audio data and leverages pre-trained Speech Foundation Models (SFMs) for cross-lingual transfer learning. This approach significantly improves transcription accuracy, demonstrating a powerful tool for language preservation and revitalization efforts.
The SENĆOŦEN language, spoken by the W̱SÁNEĆ people on southern Vancouver Island, is facing significant challenges due to historical marginalization and a sharp decline in fluent speakers. In an effort to revitalize and preserve this vital part of Indigenous cultural heritage, the community is increasingly looking towards digital technology. A recent research paper explores how Automatic Speech Recognition (ASR) technology can play a crucial role in accelerating language documentation and the creation of educational resources for SENĆOŦEN.
Developing ASR systems for languages like SENĆOŦEN presents unique hurdles. Unlike widely spoken languages such as English, there’s a severe scarcity of digitized materials, especially audio recordings with aligned transcriptions. Furthermore, SENĆOŦEN has a complex linguistic structure, being polysynthetic (meaning words can be very long and complex, often combining many morphemes) and exhibiting stress-driven metathesis, which leads to extensive vocabulary variation. This complexity makes it difficult to build a comprehensive dictionary, resulting in many words being ‘out-of-vocabulary’ for ASR systems.
A Novel ASR-Driven Pipeline
To address these challenges, researchers have proposed an innovative ASR-driven documentation pipeline. This pipeline leverages several advanced techniques to make the most of the limited available data. It consists of four main stages:
- Training a Text-to-Speech (TTS) System: Using existing parallel audio and text data in SENĆOŦEN, a custom TTS system is trained. This system learns to convert written text into spoken audio.
- Generating Synthesized Audio: Once trained, the TTS system takes text-only data (like the extensive SENĆOŦEN dictionary) and generates corresponding synthesized audio. This process significantly augments the amount of audio data available for ASR training.
- Cross-Lingual Transfer Learning with Speech Foundation Models (SFMs): The original and newly synthesized audio data are then used to fine-tune pre-trained Speech Foundation Models. These SFMs, like Whisper, are large AI models initially trained on vast amounts of speech data from many languages. By fine-tuning them with SENĆOŦEN data, they can adapt their broad knowledge to the specific characteristics of the language, even with limited resources.
- Transcribing New Audio: Finally, the fine-tuned SFM is used to transcribe new SENĆOŦEN audio recordings. To further enhance accuracy, an external n-gram language model, trained on all available text data, is incorporated.
Overcoming Data Scarcity and Linguistic Complexity
The use of a Text-to-Speech system is a critical component for data augmentation. By converting thousands of dictionary entries and sentences into synthesized speech, the training dataset for the ASR system is expanded from 1.7 hours of real audio to approximately 13.3 hours, including 11.6 hours of synthesized speech. This dramatically increases the data available for the ASR models to learn from.
Speech Foundation Models are particularly well-suited for low-resource languages because they can transfer knowledge from high-resource languages. The research explored both encoder-based SFMs (like Wav2Vec2) and encoder-decoder-based SFMs (like Whisper). The results showed that these models significantly outperformed traditional ASR systems, especially in recognizing words not present in the initial training set.
The integration of an external language model also proved vital. By using a larger language model trained on the full SENĆOŦEN dictionary, the system’s ability to predict the next word in a sequence improved, leading to better transcription accuracy.
Also Read:
- Personalized Pronunciation Coaching: How Voice Cloning Detects Speech Errors
- Navigating the Past: How AI Handles Archival Audio and Voice Recognition Challenges
Promising Results and Future Implications
Experiments on the SENĆOŦEN dataset yielded impressive results. The top-performing system achieved a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09%. Notably, the test set had a high 57.02% rate of out-of-vocabulary words, highlighting the system’s robustness in handling unseen words. After filtering out minor errors related to cedillas (a diacritical mark in SENĆOŦEN that can be inconsistently used), the WER improved to 14.32% and CER to 3.45%.
To make this technology accessible, the researchers developed a user-friendly, web-based interface. This interface allows community members and linguists to upload or speak SENĆOŦEN audio and receive automatic transcriptions, streamlining the documentation process. It also includes features for segmenting audio and flagging sections for further review by language experts.
This pioneering work represents the first comprehensive investigation of Speech Foundation Models for documenting Canadian Indigenous languages and the first ASR-driven documentation pipeline specifically for SENĆOŦEN. The findings demonstrate the immense potential of this approach to significantly expedite the transcription process, offering invaluable support to ongoing SENĆOŦEN language revitalization efforts. For more details, you can refer to the original research paper here.


