TLDR: A new study comprehensively evaluates 15 state-of-the-art multimodal language models (MLMs) for pediatric speech pathology across five clinical tasks. It finds that no single model consistently excels, and many fall short of clinical reliability. Key challenges include systematic gender bias favoring male speakers, poor generalization to tonal languages, and degraded performance for younger children in audio-native models. Fine-tuning on domain-specific data significantly improves performance, but chain-of-thought prompting can sometimes hinder accuracy. The research highlights the potential of MLMs but underscores the need for further development and bias mitigation before clinical deployment.
Speech disorders affect millions of children in the U.S. alone, creating a significant demand for speech-language pathologists (SLPs) that far outstrips the available professionals. This gap highlights a critical need for technological solutions to support SLPs and improve access to care. Recent advancements in multimodal language models (MLMs) offer a promising avenue, but their effectiveness in real-world clinical settings for speech-language pathology has been largely unexplored.
A new study addresses this crucial gap by introducing the first comprehensive benchmark for evaluating MLMs in speech-language pathologies. Researchers collaborated with domain experts to develop a taxonomy of real-world use cases for these models, leading to a benchmark that includes five core tasks, each with 1,000 manually annotated data points. The evaluation also incorporates robustness and sensitivity tests, considering factors like background noise, speaker gender, and accent.
Understanding the Core Tasks
The benchmark assesses models across five key clinical scenarios:
- Disorder Diagnosis: Distinguishing between typical and disordered speech.
- Transcription-Based Diagnosis: A baseline approach that diagnoses based on transcribed text rather than direct audio.
- Transcription: Measuring the accuracy of automatic speech recognition (ASR) systems for children with disordered speech.
- Disorder Type Classification: Differentiating between articulation disorders (motor-based errors) and phonological disorders (rule-based sound pattern errors).
- Disorder Symptom Classification: Identifying specific symptoms like substitutions, omissions, additions, or stuttering.
Key Findings and Model Performance
The evaluation of 15 state-of-the-art MLMs revealed that no single model consistently outperforms others across all tasks. Importantly, many models currently fall short of clinically acceptable performance thresholds, which are typically F1 scores in the range of 0.80 to 0.85 for FDA-approved diagnostic systems. For instance, in disorder diagnosis, no model exceeded a micro F1 score of 0.56. Interestingly, for many models, a two-stage “transcribe-and-compare” approach (Transcription-Based Diagnosis) proved superior to direct acoustic reasoning for diagnosis, suggesting that complex diagnostic reasoning is easier on structured text input.
Transcription accuracy varied widely, with Gemini and OpenAI models performing best. However, high-quality transcripts were not always correlated with diagnostic accuracy, indicating that good transcription doesn’t automatically lead to reliable clinical reasoning.
For fine-grained tasks like Disorder Type and Symptom Classification, models showed substantial performance gaps. GPT-4o led in symptom classification but still lacked clinical actionability, and transcription-based models underperformed, suggesting that crucial acoustic cues are lost during transcription for these tasks.
Impact of Fine-tuning and Robustness
Fine-tuning MLMs on domain-specific data significantly improved performance, with some models showing over 30% improvement compared to base models. This highlights the effectiveness of adapting models to specialized speech pathology data. The researchers also publicly released their datasets, fine-tuned models, and benchmarking framework to support continued progress. You can find more details about their work here.
The study also uncovered systematic disparities. Models consistently performed better on male speakers, indicating a gender performance gap that favors male speech. This could be due to an imbalance in training data, as persistent speech-sound disorders are more common in boys. This finding underscores the need for targeted auditing and gender-balanced fine-tuning to ensure equitable diagnostic performance.
Cross-linguistic evaluations showed mixed results. While classification accuracy didn’t show a clear trend between languages (English, French, Dutch), Word Error Rate (WER) was significantly worse for French and Dutch. This suggests that while higher-level acoustic features might support language-agnostic reasoning for classification, transcription is highly language-dependent. Furthermore, current models largely failed to generalize to tonal languages like Taiwanese and Cantonese, often misdiagnosing typical speech as disordered due to conflating tonal variations with pathological patterns.
Age also played a role: audio-native models performed significantly worse for younger children (5-7 years old) in symptom classification, likely because models are optimized for adult speech. ASR+LLM pipelines, however, maintained more consistent accuracy across age ranges for classification tasks.
The Role of Reasoning and Ensembles
Chain-of-Thought (CoT) prompting, which encourages models to show their reasoning steps, systematically decreased F1 scores on symptom diagnosis and had mixed results on disorder type diagnosis. This suggests that for tasks with narrow decision boundaries or large label spaces, CoT can introduce distractions. However, analyzing CoT traces provided valuable insights into model failure modes.
Ensemble strategies, combining predictions from multiple models, showed nuanced performance. While some ensembles achieved similar F1 scores for disorder diagnosis, mixed-vendor ensembles (e.g., Google + OpenAI) performed substantially better in symptom identification. This indicates that effective ensembles might require combining models with diverse architectures rather than just the strongest individual ones.
Also Read:
- Unpacking AI’s Spatial Reasoning: How Multimodal LLMs Localize Disease in Chest X-rays
- Assessing Multimodal AI: Daily Tasks Reveal Gaps in General Intelligence
Conclusion and Future Directions
The research concludes that while MLMs show great potential for supporting SLPs, even the best-performing models currently lack the clinical-grade reliability needed for deployment. The identified performance gaps, gender biases, and limitations with low-resource and tonal languages highlight critical areas for future research. The study emphasizes the need for continued development, targeted adaptation, and bias mitigation strategies to create robust and ethically sound AI systems for speech-language pathology.


