TLDR: This research addresses the challenge of Automatic Speech Recognition (ASR) systems underperforming for non-native English speakers, especially lower-proficiency learners. The study introduces two novel strategies: proficiency-aware multitask learning and targeted data augmentation. These methods significantly reduce word error rates (up to 29.4%) and insertion/deletion errors (up to 58.6%), while also crucially narrowing performance gaps across different proficiency levels, leading to more accurate and equitable ASR for L2 learners.
Automatic Speech Recognition (ASR) systems have become ubiquitous, powering everything from voice assistants to language learning platforms. However, these general-purpose systems often struggle when faced with atypical speakers, particularly non-native English (L2) learners. This performance gap not only introduces biases but also limits the potential of ASR in critical areas like education, where reliable speech recognition is vital for providing feedback to language learners.
The unique characteristics of L2 speech, such as accents and temporal disfluencies like pauses and hesitations, pose significant challenges for ASR models primarily trained on native (L1) speech. While advancements have been made in making ASR more robust to different accents, the issue of proficiency robustness – how well ASR performs across various learner proficiency levels – has remained a critical hurdle.
A recent study, titled PROFICIENCY-AWARE ADAPTATION AND DATA AUGMENTATION FOR ROBUST L2 ASR, by Ling Sun, Charlotte Zhu, and Shuju Shi from Indiana University, delves into this challenge. Their work represents the first systematic investigation into adapting foundational ASR models with proficiency awareness, specifically targeting both the temporal and segmental deviations characteristic of L2 speech.
The researchers utilized the Speak & Improve (S&I) Corpus, a large dataset of L2 English learner speech graded according to the Common European Framework of Reference (CEFR) proficiency scale (A2–C1). This corpus, while reflecting real-world distributions, also presents an imbalance, with lower proficiency levels like A2 being significantly underrepresented.
Their findings revealed several crucial insights. Firstly, ASR errors are not merely a function of data availability but scale directly with CEFR proficiency levels. Lower-proficiency speakers consistently yielded higher Word Error Rates (WERs), indicating that proficiency is a key underlying factor in L2 ASR performance.
Secondly, the study demonstrated a significant risk of proficiency-agnostic adaptation. When a naive fine-tuning approach (LoRA adaptation) was applied to the Whisper-small model, it reduced the average WER but alarmingly widened disparities. Performance for higher-proficiency speakers improved, but for lower-proficiency learners (A2), the WER actually worsened by a relative 20-21%. This degradation was primarily driven by an increase in insertion errors, suggesting the model overfitted to filler-like usage common in disfluent, lower-proficiency speech.
To counteract these issues, the researchers proposed two innovative, proficiency-aware strategies:
Proficiency-Aware Multitask Learning
This approach involved jointly optimizing ASR with an auxiliary proficiency classification task. By explicitly modeling heterogeneous speech properties across proficiency levels, the system could better condition its acoustic representations on these variations.
Also Read:
- Enhancing Speech Recognition with Vocal Tract Movements: A New Approach to ASR
- Unlocking Dynamic Stress Detection from Speech: A Temporal Progression Approach
Targeted Data Augmentation
Recognizing the scarcity of low-proficiency (A2) speech in the dataset, the team applied spectrogram masking (SpecAug) specifically to A2 speech. This method adds local variability without altering the underlying proficiency label, helping to mitigate class imbalance and improve robustness for these underrepresented learners.
The results were compelling. Both proficiency-aware strategies, and especially their combination, led to substantial improvements. The combined model achieved the best performance, reducing the overall WER by 29.4% relative to the baseline. Crucially, these methods also reduced insertion and deletion errors by as much as 58.6% relative, effectively suppressing the time-sensitive error modes that disproportionately affected low-proficiency speakers. This led to a significant narrowing of proficiency gaps, resulting in more equitable outcomes across all learner groups.
In conclusion, this research underscores that proficiency is a critical dimension for developing fair and effective L2 ASR systems. While naive adaptation can exacerbate inequalities, proficiency-aware multitask learning and targeted data augmentation offer a robust path forward, enhancing both accuracy and fairness for language learners.


