TLDR: The paper introduces a “Cross-Speaker Joint Fine-Tuning” strategy for dysarthric speech recognition using the Chinese Dysarthria Speech Database (CDSD). It challenges conventional speaker-specific fine-tuning by demonstrating that training on multiple dysarthric speakers simultaneously improves individual recognition accuracy, enhances generalization, and reduces data dependence. The study also finds that for larger models, speech duration is more critical than speaker diversity, and character-level modeling units outperform direct phoneme-based units due to richer semantic information.
Individuals with dysarthria, a motor speech disorder, often face significant challenges in verbal communication due to impaired articulatory precision. This can lead to misunderstandings, social isolation, and psychological distress. Advancements in Dysarthric Speech Recognition (DSR) technology are crucial to help convert their impaired speech into text, thereby facilitating better comprehension among listeners.
While mainstream Automatic Speech Recognition (ASR) systems have achieved remarkable accuracy for typical speech, DSR research faces a primary hurdle: the acute scarcity of relevant datasets. Physiological constraints and demographic limitations make large-scale data collection difficult for dysarthric individuals. Existing English dysarthria corpora are limited in scale, and Chinese datasets like CUDYS and MSDM are also relatively small, often containing less than 10 hours of speech, which is insufficient for training comprehensive speech recognition models.
A recent and significant development is the release of the Chinese Dysarthria Speech Database (CDSD), the largest publicly available Mandarin dysarthric speech dataset to date. This database comprises recordings from 44 individuals, totaling 124 hours of dysarthric speech. Baseline experiments with the CDSD database using conventional ASR models revealed poor performance without fine-tuning, highlighting a severe incompatibility between pathological and normative speech features. Speaker-dependent fine-tuning showed some improvement for individual speakers but performed poorly on multi-speaker datasets, underscoring the substantial acoustic heterogeneity among individuals with dysarthria.
A New Strategy: Cross-Speaker Joint Fine-Tuning
Conventional DSR methods typically rely on speaker-specific fine-tuning, which requires extensive data for each patient and offers limited generalization across different dysarthric populations. To overcome these limitations, a new study proposes a “Cross-Speaker Joint Fine-Tuning” strategy. This innovative approach leverages inter-speaker pronunciation discrepancies as a form of intrinsic data augmentation, demonstrating that aggregating ultra-sparse samples from diverse dysarthric individuals can yield superior generalization capabilities compared to intensive single-speaker training.
The research, detailed in the paper Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD database, conducted three critical experiments to validate its hypotheses.
Key Findings from the Experiments
The first experiment validated the multi-speaker cross-training strategy. It showed that for all seven speakers with 10 hours of speech data in CDSD PartB, their individual Character Error Rates (CERs) significantly decreased when using multi-speaker fine-tuning compared to fine-tuning with only their personal speech data. Interestingly, sequential fine-tuning (first with multi-speaker data, then speaker-specific refinement) sometimes increased CERs for certain speakers (04 and 06). Furthermore, the study found that simply expanding the speaker population in PartB did not always lead to improved performance for a target speaker, suggesting that the composition of the multi-speaker dataset is important.
The second experiment investigated the influence of data scaling effects versus speaker population size. It demonstrated that for larger-scale models, speech duration was a more decisive factor than speaker diversity during dataset training, establishing duration as a more critical element for model training effectiveness.
The third experiment explored the efficacy of different modeling units, comparing phoneme-based and character-based approaches. The results confirmed that direct full-model fine-tuning with phoneme-based units yielded suboptimal performance. This suggests that character-level representations inherently capture richer semantic and contextual information, which enhances model performance, and that naive full-model fine-tuning may disrupt pre-trained knowledge alignment.
Also Read:
- Bridging the Speech Gap: How ASR Models Perform in Real-World Robot Interactions
- Precision and Clarity: Rule-Based Stuttering Detection in Clinical Settings
Implications for Future DSR Research
The findings suggest that the significant CER increases observed for some speakers after sequential fine-tuning might stem from conflicting acoustic characteristics between speakers. Future work should investigate these inter-speaker conflicts to optimize speaker selection criteria for cross-training. Additionally, further studies could systematically evaluate how speaker diversity and speech duration interact with different model scales. For phoneme-level modeling, identifying which model layers benefit most from phoneme-specific tuning could maximize cross-population parameter sharing.
In conclusion, this research provides valuable insights into enhancing dysarthric speech recognition. The proposed multi-speaker cross-training strategy offers a promising direction for improving adaptation efficacy, while the findings on data duration and modeling units highlight critical considerations for developing more robust and generalized DSR systems.


