TLDR: A new framework combines traditional rating scales with text data scored by large language models (LLMs) to create augmented psychological tests. This approach, demonstrated with depression assessments, significantly improves measurement precision and accuracy by empirically selecting informative LLM-derived items, bypassing the need for extensive human-labeled data or complex rubrics. It offers a scalable way to leverage natural language for more holistic psychological evaluations.
Psychological assessments often rely on structured rating scales, which are effective but can miss the rich details found in a person’s natural language. Imagine trying to express complex feelings using only a few numbered options – a lot of nuance can be lost. A new study introduces an innovative framework that combines traditional rating scales with text data scored by large language models (LLMs) to create more comprehensive and accurate psychological tests.
This novel approach aims to capture the “lost nuance” in psychological assessments. Instead of forcing individuals to translate their complex emotions into simple numbers, the framework uses LLMs to derive new “LLM items” from qualitative responses, such as essays or sentence completions. These LLM items are then integrated with the original rating-scale items to form an “augmented test.” The researchers demonstrated this method using depression as a case study, applying it to a real-world sample of 693 upper secondary students and a synthetic dataset of 3,000 individuals.
The core idea behind this framework is to enhance existing measures without changing the underlying psychological construct they are designed to measure. The process begins with collecting responses to both traditional rating scales and qualitative text from the same individuals. An LLM then scores this qualitative text using various prompting strategies, generating a pool of candidate LLM items. The most informative of these candidate items are then empirically selected and integrated into the augmented test. This selection is based on how much psychometric information they provide about the target trait, rather than relying on pre-labeled data or complex, expert-created rubrics, which are common bottlenecks in traditional automated scoring.
The study’s findings were significant. On held-out test sets, the augmented tests showed statistically significant improvements in both measurement precision (how consistently a test measures something) and accuracy (how close the measurement is to the true value). For the real-world data, the information gained from the LLM items was equivalent to adding about 6.3 average rating-scale items to the original 19-item test. In the synthetic data, this gain was even more substantial, equivalent to adding 16.0 items. These improvements were particularly noticeable in the early stages of computerized adaptive testing (CAT) simulations, where the LLM items provided an initial approximation of a respondent’s trait level, making the subsequent rating-scale item selection more efficient.
The researchers highlight that their framework represents a conceptual shift in automated scoring. Instead of trying to make LLMs perfectly replicate human judgment – which often requires extensive human-labeled training data or detailed rubrics – their method focuses on empirically selecting LLM items that provide the most psychometric information. This computational approach minimizes the reliance on time-intensive human expertise, offering a scalable way to leverage the growing amount of transcribed text data.
The potential applications of this framework are broad. Beyond clinical health, where it could enhance psychiatric evaluations by integrating interview transcripts, it could be used in organizational psychology to assess leadership ability from peer feedback, or in marketing to refine seller proficiency scores by combining customer reviews with ratings. It could even derive more value from legacy social science datasets, such as creating a Governmental Trust measure from open-ended responses in election studies.
Also Read:
- Moving Beyond Textbook Cases: A New Way to Evaluate AI in Medical Diagnosis
- Modeling Human Consciousness in AI Through Psychoanalysis and Personality Theory
While the study focused on depression and a specific demographic, the authors suggest future research could explore larger and more diverse candidate LLM item pools, using different LLMs or a wider array of scoring instructions. They also recommend replicating the approach with diverse populations and other psychological constructs to examine its generalizability. This innovative framework offers a promising pathway towards more holistic psychological assessments that effectively utilize the rich information embedded in natural language. You can read the full research paper here.


