TLDR: This research systematically evaluates how large language models (LLMs) can be adapted to detect Alzheimer’s disease and related dementias (ADRD) from speech. It compares in-context learning, reasoning-augmented prompting, and fine-tuning strategies using both text-only and multimodal models. Key findings show that token-level fine-tuning is generally most effective, enabling smaller open-weight LLMs to match or surpass commercial models. While multimodal models currently lag, the study highlights the potential of adapted LLMs for scalable and accessible cognitive screening.
Alzheimer’s disease and related dementias (ADRD) represent a significant global health challenge, with millions of individuals affected and a large percentage remaining undiagnosed. Early and scalable detection methods are crucial to address this growing concern. Traditional screening often misses subtle cognitive changes, but advancements in natural language processing (NLP) and large language models (LLMs) offer a promising new avenue: analyzing spontaneous speech.
This research paper, titled “Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies,” explores how different strategies for adapting LLMs can improve the detection of ADRD from speech. The study was conducted by a team of researchers including Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, and Maryam Zolnoori from Columbia University Irving Medical Center and the School of Nursing at Columbia University.
The primary goal of this study was to systematically compare various LLM adaptation strategies for identifying ADRD from speech recordings. The researchers utilized the DementiaBank speech corpus, analyzing audio-recorded speech from 237 participants, including both cognitively impaired (CI) and cognitively normal (CN) individuals. They evaluated nine text-only LLMs, ranging from smaller 3-billion parameter models to massive 405-billion parameter models, as well as three multimodal audio-text models.
Exploring Adaptation Strategies
The study investigated four main adaptation strategies:
- In-Context Learning (ICL): This involved providing the LLMs with a few examples (demonstrations) to guide their predictions. The researchers tested different ways of selecting these examples, such as choosing the most similar, least similar, or examples that represent the average characteristics of each class (cognitively impaired or normal).
- Reasoning-Augmented Prompting: This strategy aimed to enhance the LLMs’ ability to reason by providing them with explicit rationales, either generated by the models themselves or by larger, more capable ‘teacher’ models. Techniques like ‘self-consistency’ (aggregating multiple predictions) and ‘Tree-of-Thought’ (multi-step reasoning) were also explored.
- Parameter-Efficient Fine-Tuning: This involved directly training the LLMs on the specific task of classifying speech as CI or CN. Two methods were compared: ‘token-level’ fine-tuning, where the model predicts a specific label token, and ‘classification head’ fine-tuning, where a small neural network is added to the LLM to make the classification.
- Multimodal Audio-Text Integration: This component evaluated models that could process both audio and text simultaneously to see if acoustic information added value beyond just the transcribed text.
Key Findings and Insights
The research yielded several important findings:
- Fine-Tuning Leads the Way: Overall, fine-tuning proved to be the most effective adaptation strategy. Notably, smaller open-weight models like LLaMA 3B, LLaMA 70B, and LLaMA 8B achieved F1-scores (a measure of accuracy) of 0.83, 0.83, and 0.81 respectively, matching or even outperforming commercial models like GPT-4o (F1 = 0.80). This suggests that specialized training can make accessible open-source models highly competitive.
- The Importance of Demonstration Selection: For in-context learning, the way demonstrations were selected significantly impacted performance. Examples chosen based on their ‘average similarity to class centroids’ (representing typical speech patterns for each group) consistently delivered the best results.
- Reasoning Benefits Smaller Models: Teacher-generated rationales, especially from powerful models like LLaMA-405B, improved the F1-scores of smaller models like LLaMA 8B. This indicates that providing structured reasoning can guide less capable models towards better predictions. Self-consistency also helped stabilize predictions for the smallest LLM, LLaMA 3B.
- Tailoring Fine-Tuning: While token-level fine-tuning was generally superior, the study found that a ‘classification head’ approach dramatically improved models that struggled with token-level prediction, such as MedAlpaca 7B (improving its F1-score from 0.06 to 0.82). This highlights that the best fine-tuning method can be model-dependent.
- Multimodal Models Need More Work: Current multimodal LLMs, which integrate both audio and text, generally underperformed compared to the best text-only systems. Although Phi-4 Multimodal showed significant improvement after fine-tuning, it still lagged behind top text-based models. This suggests that better audio-text alignment and larger training datasets are needed for these models in clinical speech analysis.
Also Read:
- AudioCodecBench: A New Standard for Evaluating Audio Codecs in Large Language Models
- Unlocking Reliable Audio AI: AHAMask’s Instruction-Free Approach
Future Directions
The study concludes that properly adapted open-weight LLMs offer a scalable and effective approach for early cognitive screening based on speech. These AI-powered tools can complement existing diagnostic methods by identifying subtle linguistic changes, potentially leading to earlier detection and improved patient care. Future research should focus on combining LLM-based speech analysis with biological data, evaluating performance across diverse populations to ensure fairness, and addressing practical implementation challenges in clinical settings.
For more detailed information, you can read the full research paper available at arXiv.org.


