TLDR: This research introduces a computational framework called SIMPA for interpretable text-based personality assessment from social media. It addresses challenges like data scarcity and model interpretability by creating two new datasets, MBTI9k and PANDORA, from Reddit. The study demonstrates how linguistic features and demographic data can predict personality traits, highlights biases in gender classification linked to personality, and shows that SIMPA, by matching user statements to validated personality items, significantly improves prediction accuracy and interpretability for Big Five traits.
Understanding personality is fundamental to human interaction, influencing everything from teaching styles to team selection and personal relationships. Historically, personality assessment relied on observable cues like speech and language. With the advent of the digital age, particularly social media, automated methods for personality assessment have become increasingly important. However, this field faces significant challenges, including a scarcity of large, labeled datasets, a disconnect between personality psychology and natural language processing (NLP), and issues with model validity and interpretability.
A recent doctoral thesis, titled A COMPUTATIONAL FRAMEWORK FOR INTERPRETABLE TEXT-BASED PERSONALITY ASSESSMENT FROM SOCIAL MEDIA, by Matej Gjurkovi´c, addresses these challenges by proposing a new computational framework for interpretable text-based personality assessment from social media. The research highlights the critical role of language in personality theory and assessment, moving beyond traditional methods to leverage the vast amounts of text data available online.
New Datasets from Reddit
The research identifies Reddit as a promising, yet underutilized, data source for personality analysis. Reddit’s user anonymity, diverse discussion topics across thousands of subreddits, and the sheer volume of user-generated text offer a rich environment for extracting personality cues. To overcome the limitations of existing datasets, two new datasets were developed: MBTI9k and PANDORA.
The MBTI9k dataset is based on the Myers-Briggs Type Indicator (MBTI) and contains over 22 million comments from more than 13,000 users. Users’ MBTI types were primarily identified through ‘flairs’ – self-declared labels in subreddits. The PANDORA dataset expands on this by integrating the more scientifically accepted Big Five personality model scores, along with crucial demographic information like gender, age, and location, from over 10,000 users and 17 million comments. These datasets are significant because they provide a large-scale, demographically enriched resource for personality research, allowing for a deeper understanding of how language, demographics, and personality interact.
Insights from Data Analysis
Analysis of these datasets revealed unique personality distributions among Reddit users. For instance, MBTI9k showed a dominance of introverted, intuitive, thinking, and perceiving types, suggesting that Reddit users tend to be more educated and intellectually engaged than the general population. PANDORA indicated that the average Reddit user has higher openness and lower extraversion, agreeableness, and conscientiousness compared to the general population. Demographic data also showed that most users are from English-speaking countries, with males being slightly younger and scoring lower on agreeableness than females.
The research also explored various linguistic and behavioral features for personality prediction. It found that different features are relevant for different personality dimensions. For example, social and family-related words are strong indicators of Extraversion, while complex and longer words are associated with Intuitive types. Temporal patterns, such as posting habits on certain days of the week or months, also showed correlations with personality traits.
Addressing Bias and Improving Prediction
Experiments demonstrated the potential for predicting Redditor personalities using relatively simple features. A key finding was the importance of considering demographic variables as confounding factors. For example, a gender classification model trained on Reddit data showed biases, misclassifying individuals with certain personality trait combinations more frequently. This highlights the need for comprehensive psychodemographic profiles to detect and mitigate biases in machine learning models, especially since Reddit data is often used to train conversational AI.
The study also explored predicting Big Five traits using MBTI data, leveraging the higher availability of MBTI labels. By treating this as a domain adaptation problem, the research showed that MBTI predictions could significantly assist in predicting Big Five traits, despite the MBTI’s questionable psychological validity.
The SIMPA Framework for Interpretable Assessment
To address the challenges of interpretability and validity, the thesis introduces the Statement-to-Item Matching Personality Assessment (SIMPA) framework. Inspired by the Realistic Accuracy Model (RAM), SIMPA breaks down personality assessment into four stages: relevance, availability, detection, and utilization, with a crucial feedback loop for iterative refinement.
SIMPA focuses on matching ‘Trait-Indicative Statements’ (TISes) – user-generated text reflecting personality – with ‘Trait-Relevant Statements’ (TRSes) – validated questionnaire items. This approach aims to provide transparent and explainable personality assessments by explicitly linking textual cues to specific personality traits at granular levels (nuances, facets, domains).
The implementation of SIMPA explored different types of TRSes, including items from the IPIP-NEO questionnaire, expert-crafted statements (eTRSes), and statements generated by Large Language Models (LLMs). Expert-crafted eTRSes proved more effective in matching social media language, and while LLMs could generate and evaluate TRSes, their performance was comparable to an average human annotator, suggesting the continued need for expert oversight.
In supervised personality assessment, incorporating SIMPA’s TIS-based features significantly improved the state-of-the-art prediction accuracy for all Big Five traits on the PANDORA dataset, particularly for Extraversion. The framework also showed promise for unsupervised personality assessment, demonstrating convergent and discriminant validity.
Also Read:
- Unpacking Emotions in Speech: How AI Distinguishes Context from Personal Feelings
- Large Language Models Reshape Topic Modeling Approaches
Future Directions
While the research acknowledges limitations, such as the inherent biases in social media data and the simplifications made in the SIMPA implementation, it lays a strong foundation for future work. The SIMPA framework is highly versatile and can be applied to various constructs beyond personality, especially those with complex taxonomies and layered cues. Future research will focus on refining the relevance scoring function, leveraging advanced LLMs for more effective TIS detection, and developing agentic workflows for dynamic adaptation.
This work represents a significant step towards developing automated personality assessment systems that combine advanced NLP technologies with the rigorous scientific standards of personality psychology, offering more interpretable, valid, and efficient tools for understanding human personality from digital footprints.


