TLDR: A new research paper highlights the limitations of applying human-designed psychological questionnaires to Large Language Models (LLMs). The study found that traditional psychometric tests lack “ecological validity,” meaning they don’t accurately reflect how LLMs operate in real-world scenarios. Compared to ecologically valid assessments like Value Portrait, established questionnaires yield different psychological profiles, suffer from unstable measurements due to insufficient items, create false impressions of consistent traits (often from LLMs recognizing expected answers), and produce misleading or exaggerated results when LLMs are given specific personas. The findings urge researchers to be cautious and adopt more context-aware methods for evaluating LLM psychology.
A recent research paper titled “Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models” by Dongmin Choi, Woojung Song, Jongwook Han, Eun-Ju Lee, and Yohan Jo, delves into a critical issue facing the evaluation of Large Language Models (LLMs): the suitability of traditional psychological assessments.
For some time, researchers have been using established psychometric questionnaires, like the Big Five Inventory (BFI) for personality traits and the Portrait Values Questionnaire (PVQ) for values, to understand the psychological characteristics reflected in LLM responses. This trend emerged as LLMs evolved beyond simple assistants to handle complex interactions such as emotional support, role-playing, and ethical reasoning. However, a significant concern has been raised regarding the application of these human-designed questionnaires to AI: their lack of ecological validity.
What is Ecological Validity?
Ecological validity refers to how well survey questions reflect and resemble the real-world contexts in which LLMs actually operate and generate text in response to user queries. Traditional questionnaires, designed for humans, often present hypothetical self-report items that may not be applicable or meaningful for an LLM’s typical conversational environment. For instance, a question like “I see myself as someone who gets nervous easily” might not accurately reflect an LLM’s real-world usage context.
The Research Approach
To address this gap, the researchers conducted a comprehensive comparative analysis between established questionnaires and ecologically valid ones. For the latter, they leveraged a dataset called Value Portrait (Han et al., 2025). Unlike traditional questionnaires, Value Portrait uses real-world queries and scenarios derived from human-LLM conversations and human-human advisory contexts (like Reddit and Dear Abby archives) to elicit value-laden responses. This approach aims to capture how LLMs express values and personality traits in more realistic settings.
The study involved 10 diverse LLMs, including models from the GPT-4, Gemini-2.5, Llama-3.1, Qwen-2.5, and Qwen3 families. Each model was prompted to respond to both established psychometric questionnaires (PVQ-21, PVQ-40, BFI, BFI-10) and Value Portrait items.
Key Findings: Limitations of Established Questionnaires
The analysis revealed several significant limitations when using established psychological questionnaires for LLMs:
1. Different Psychological Profiles: The study found that established questionnaires yield substantially different psychological profiles for LLMs compared to ecologically valid ones. This suggests that results from traditional tests may not accurately translate to how LLMs behave in real-world contexts. For example, while some values like Universalism and Benevolence showed moderate consistency across both types of assessments (possibly due to LLM alignment processes), personality traits often showed low correlations, indicating a divergence in how these traits are expressed.
2. Insufficient Items for Stable Measurement: Established questionnaires generally exhibited higher measurement uncertainty, characterized by wider confidence intervals. This indicates that the limited number of items in these questionnaires might be insufficient for stably measuring psychological constructs in LLMs. Unlike humans, LLMs do not suffer from attention fatigue, implying that longer, more comprehensive assessments could be used to achieve more stable measurements.
3. False Impressions of Consistency: While LLMs often show high consistency in their responses to established questionnaires, this consistency appears to stem from the models recognizing what is being measured and providing expected answers, rather than possessing stable psychological constructs. In contrast, ecologically valid questionnaires showed lower consistency. The study also noted that LLMs struggled with reverse-coded items in some traditional tests, misinterpreting them as indicating high trait levels regardless of the item’s true direction.
4. Exaggerated Profiles for Persona-Prompted LLMs:
- Misleading Persona Understanding: When LLMs were prompted to adopt specific personas (e.g., hero vs. villain), established questionnaires produced theoretically consistent distinctions (e.g., villains higher in Power, heroes in Benevolence). However, Value Portrait assessments showed heroes scoring uniformly higher across *all* constructs, even those typically associated with villains. This suggests that LLMs might be pattern-matching learned associations in traditional tests, but fail to apply these distinctions in real-world scenarios.
- Exaggerated Persona Bias: The research also found that established questionnaires exaggerated biases induced by demographic personas (gender, age, religion, political view, educational attainment) compared to human data and Value Portrait. This implies that the apparent biases detected by traditional questionnaires might not be directly expressed in real-world queries, highlighting the importance of selecting appropriate assessment tools.
Also Read:
- Unpacking AI’s Moral Compass: How Language Models Prioritize Values
- Building Representative Digital Societies with Language Models
Conclusion and Future Directions
Overall, the paper cautions against the uncritical use of established psychological questionnaires for LLMs. The findings suggest that these tools can lead to misleading conclusions about an LLM’s psychological characteristics, measurement stability, consistency, and persona understanding. The work provides crucial guidance for researchers aiming to investigate psychological constructs in LLM outputs, emphasizing the need for ecologically valid assessment methods that reflect how LLMs interact in real-world contexts.
For more details, you can read the full research paper here.


