AI Breakthrough: Large Language Models Accurately Measure Schizophrenia Risk Symptoms

TLDR: A new study demonstrates that large language models (LLMs) can accurately predict symptom severity in patients at clinical high risk for schizophrenia, using only clinical interview transcripts. The LLM’s performance, especially with semi-structured interviews, approaches human rater reliability. Key findings include the LLM’s ability to process foreign language transcripts and its improved accuracy when provided with a patient’s past clinical data. This research suggests LLMs could standardize and streamline symptom assessment, facilitating earlier and more precise interventions for individuals at risk.

Monitoring symptom severity in patients at clinical high risk (CHR) for schizophrenia is crucial for timely and effective treatment. Traditionally, tools like the Brief Psychiatric Rating Scale (BPRS) are used, but their administration requires lengthy, structured interviews, limiting their use in everyday clinical practice.

A recent study, titled “Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia” by Andrew X. Chen, Guillermo Horga, and Sean Escola, explores a promising new approach: leveraging large language models (LLMs) to predict BPRS scores directly from clinical interview transcripts. This innovative research utilized data from 409 CHR patients within the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort.

LLMs in Action: Predicting Symptom Severity

The researchers fed de-identified clinical interview transcripts into an LLM (specifically, OpenAI’s o3-mini-2025-01-31 model). These interviews were not specifically designed to measure BPRS scores, yet the LLM was tasked with predicting them in a “zero-shot” manner, meaning it had no prior training on this specific dataset or task-specific examples. The study analyzed two types of interviews: semi-structured PSYCHS interviews and more free-form open interviews.

The results were compelling. For PSYCHS transcripts, the LLM’s predictions showed a median concordance of 0.84 and an Intraclass Correlation Coefficient (ICC) of 0.73. These figures are remarkably close to reported human inter- and intra-rater reliability for BPRS assessments, which typically show a median concordance of 0.83 and an ICC of 0.70. This suggests that LLMs can infer symptom severity with an accuracy comparable to human experts.

However, the LLM’s performance was less accurate with open interview transcripts, which are less structured and vary widely in content. In these cases, the model tended to underestimate total BPRS scores, particularly for affective symptoms like anxiety and depression, and positive symptoms such as suspiciousness and hallucinations. This highlights the importance of structured information for accurate assessment.

Beyond English: Cross-Language Assessment

One of the most exciting findings was the LLM’s ability to assess symptoms from foreign language transcripts (primarily Spanish and Korean) using the same English-language prompt. The model performed comparably well, demonstrating a median concordance of 0.89 and an ICC of 0.70 for PSYCHS interviews in foreign languages. The LLM even seamlessly integrated foreign words into its English explanations, showcasing its latent cross-language capability. This feature could significantly standardize symptom assessment across diverse, multi-national research collaborations.

Integrating Longitudinal Data for Improved Accuracy

The study also explored how providing the LLM with previous interview transcripts and their corresponding BPRS scores (longitudinal data) could enhance performance. In patients with multiple time points, a “one-shot” learning approach, where the LLM was given a previous transcript and its true score, resulted in the most accurate predictions. This suggests that LLMs can adapt to an individual’s specific treatment course, a long-standing goal in precision psychiatry.

Also Read:

Implications and Future Directions

This research provides a strong proof of concept that contemporary LLMs can approximate expert human ratings on complex psychiatric scales using routine clinical text, even across different languages. By potentially lowering the logistical barriers to structured symptom assessment, this approach could accelerate research, harmonize multi-site collaborations, and ultimately support earlier and more precise interventions for individuals at risk of schizophrenia.

While promising, the study acknowledges limitations, including sensitivity to prompt wording, the need for more complete data on other disease-specific scales, and the current sparsity of very long-term longitudinal data. Future work may involve integrating other data modalities like video or audio, and fine-tuning LLMs with larger datasets to further improve performance and detect subtle changes in symptom severity over time.

For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Breakthrough: Large Language Models Accurately Measure Schizophrenia Risk Symptoms

LLMs in Action: Predicting Symptom Severity

Beyond English: Cross-Language Assessment

Integrating Longitudinal Data for Improved Accuracy

Implications and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates