Assessing Cognitive Impairment with Large Language Models Across Languages

TLDR: A new benchmark, CogBench, evaluates large language models (LLMs) for detecting cognitive impairment from speech across English and Mandarin. The study reveals that while traditional models struggle with generalization, LLMs, especially when enhanced with Chain-of-Thought prompting or fine-tuned with LoRA, demonstrate improved adaptability and generalization. This research marks a significant step towards developing more robust and linguistically diverse AI-assisted tools for early cognitive screening, despite current limitations in distinguishing normal speech variations from pathological signs.

Detecting cognitive impairment early is crucial for managing conditions like dementia. Traditionally, this has involved structured assessments by clinicians, which can be time-consuming and not always accessible. However, advancements in artificial intelligence, particularly with large language models (LLMs), offer a promising new path for non-invasive screening through spontaneous speech.

A new study introduces CogBench, the first benchmark specifically designed to test how well large language models can assess cognitive impairment from speech across different languages and clinical environments. This is a significant step because current AI methods often struggle to generalize, meaning a model trained in one setting might not perform well in another, especially with different languages.

The researchers used a consistent approach to evaluate models on three speech datasets: ADReSSo (English), NCMMSC2021-AD (Mandarin), and a newly collected Mandarin dataset called CIR-E. These datasets cover both binary (e.g., Alzheimer’s Disease vs. Non-Alzheimer’s Disease) and ternary (e.g., Alzheimer’s Disease, Mild Cognitive Impairment, Healthy Control) classification tasks.

The study found that conventional deep learning models, while effective in their training domain, showed a substantial drop in performance when applied to new datasets or languages. This highlights a major challenge in deploying these models in diverse real-world scenarios.

In contrast, large language models demonstrated better adaptability. The researchers explored different prompting strategies for LLMs. They found that Chain-of-Thought (CoT) prompting, which guides the model to reason step-by-step, significantly improved performance, particularly on the English dataset. This suggests that encouraging LLMs to think through their decisions can make them more effective in complex cognitive assessment tasks.

The study also investigated lightweight fine-tuning of LLMs using a technique called Low-Rank Adaptation (LoRA). This method allows for significant improvements in generalization without the high computational cost of fully retraining the entire model. Fine-tuned LLMs consistently outperformed basic LLMs across all datasets and showed superior generalization abilities compared to traditional small-scale models, especially on the new CIR-E test set.

While LLMs show great promise, the research also identified limitations. Current LLMs can be overly sensitive to normal speech variations, sometimes misinterpreting natural pauses or minor disfluencies as signs of impairment. They might also over-rely on surface-level fluency, failing to detect deeper content deficits that are indicative of cognitive decline. For instance, a subject with Alzheimer’s might speak fluently but provide very brief or repetitive descriptions, which some LLMs could misclassify as normal.

Also Read:

Future work aims to address these challenges by integrating more patient information (like age, gender, and health records) into the assessment process and extracting specific acoustic features related to speech fluency and vocal intensity. This will help LLMs differentiate between normal variations and pathological signs, leading to more accurate and clinically useful assessment tools. For more details, you can refer to the full research paper: CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing Cognitive Impairment with Large Language Models Across Languages

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Get Well and RhythmX AI Unite to Form GW RhythmX, Pioneering AI-Native Healthcare Intelligence

Arya Health Secures $18.2 Million to Revolutionize Post-Acute Care Administration with AI Agents

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates