Speech-Based AI Screening: A Deep Dive into Language Model Strategies for Dementia Detection

TLDR: This research systematically evaluates how large language models (LLMs) can be adapted to detect Alzheimer’s disease and related dementias (ADRD) from speech. It compares in-context learning, reasoning-augmented prompting, and fine-tuning strategies using both text-only and multimodal models. Key findings show that token-level fine-tuning is generally most effective, enabling smaller open-weight LLMs to match or surpass commercial models. While multimodal models currently lag, the study highlights the potential of adapted LLMs for scalable and accessible cognitive screening.

Alzheimer’s disease and related dementias (ADRD) represent a significant global health challenge, with millions of individuals affected and a large percentage remaining undiagnosed. Early and scalable detection methods are crucial to address this growing concern. Traditional screening often misses subtle cognitive changes, but advancements in natural language processing (NLP) and large language models (LLMs) offer a promising new avenue: analyzing spontaneous speech.

This research paper, titled “Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies,” explores how different strategies for adapting LLMs can improve the detection of ADRD from speech. The study was conducted by a team of researchers including Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sepehr Karimi, Sina Rashidi, Ali Zolnour, Maryam Dadkhah, Yasaman Haghbin, Hossein AzadMaleki, and Maryam Zolnoori from Columbia University Irving Medical Center and the School of Nursing at Columbia University.

The primary goal of this study was to systematically compare various LLM adaptation strategies for identifying ADRD from speech recordings. The researchers utilized the DementiaBank speech corpus, analyzing audio-recorded speech from 237 participants, including both cognitively impaired (CI) and cognitively normal (CN) individuals. They evaluated nine text-only LLMs, ranging from smaller 3-billion parameter models to massive 405-billion parameter models, as well as three multimodal audio-text models.

Exploring Adaptation Strategies

The study investigated four main adaptation strategies:

In-Context Learning (ICL): This involved providing the LLMs with a few examples (demonstrations) to guide their predictions. The researchers tested different ways of selecting these examples, such as choosing the most similar, least similar, or examples that represent the average characteristics of each class (cognitively impaired or normal).
Reasoning-Augmented Prompting: This strategy aimed to enhance the LLMs’ ability to reason by providing them with explicit rationales, either generated by the models themselves or by larger, more capable ‘teacher’ models. Techniques like ‘self-consistency’ (aggregating multiple predictions) and ‘Tree-of-Thought’ (multi-step reasoning) were also explored.
Parameter-Efficient Fine-Tuning: This involved directly training the LLMs on the specific task of classifying speech as CI or CN. Two methods were compared: ‘token-level’ fine-tuning, where the model predicts a specific label token, and ‘classification head’ fine-tuning, where a small neural network is added to the LLM to make the classification.
Multimodal Audio-Text Integration: This component evaluated models that could process both audio and text simultaneously to see if acoustic information added value beyond just the transcribed text.

Key Findings and Insights

The research yielded several important findings:

Fine-Tuning Leads the Way: Overall, fine-tuning proved to be the most effective adaptation strategy. Notably, smaller open-weight models like LLaMA 3B, LLaMA 70B, and LLaMA 8B achieved F1-scores (a measure of accuracy) of 0.83, 0.83, and 0.81 respectively, matching or even outperforming commercial models like GPT-4o (F1 = 0.80). This suggests that specialized training can make accessible open-source models highly competitive.
The Importance of Demonstration Selection: For in-context learning, the way demonstrations were selected significantly impacted performance. Examples chosen based on their ‘average similarity to class centroids’ (representing typical speech patterns for each group) consistently delivered the best results.
Reasoning Benefits Smaller Models: Teacher-generated rationales, especially from powerful models like LLaMA-405B, improved the F1-scores of smaller models like LLaMA 8B. This indicates that providing structured reasoning can guide less capable models towards better predictions. Self-consistency also helped stabilize predictions for the smallest LLM, LLaMA 3B.
Tailoring Fine-Tuning: While token-level fine-tuning was generally superior, the study found that a ‘classification head’ approach dramatically improved models that struggled with token-level prediction, such as MedAlpaca 7B (improving its F1-score from 0.06 to 0.82). This highlights that the best fine-tuning method can be model-dependent.
Multimodal Models Need More Work: Current multimodal LLMs, which integrate both audio and text, generally underperformed compared to the best text-only systems. Although Phi-4 Multimodal showed significant improvement after fine-tuning, it still lagged behind top text-based models. This suggests that better audio-text alignment and larger training datasets are needed for these models in clinical speech analysis.

Also Read:

Future Directions

The study concludes that properly adapted open-weight LLMs offer a scalable and effective approach for early cognitive screening based on speech. These AI-powered tools can complement existing diagnostic methods by identifying subtle linguistic changes, potentially leading to earlier detection and improved patient care. Future research should focus on combining LLM-based speech analysis with biological data, evaluating performance across diverse populations to ensure fairness, and addressing practical implementation challenges in clinical settings.

For more detailed information, you can read the full research paper available at arXiv.org.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Speech-Based AI Screening: A Deep Dive into Language Model Strategies for Dementia Detection

Exploring Adaptation Strategies

Key Findings and Insights

Future Directions

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Arya Health Secures $18.2 Million to Revolutionize Post-Acute Care Administration with AI Agents

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates