Unlocking Personality: A New Framework for Text-Based Assessment from Social Media

TLDR: This research introduces a computational framework called SIMPA for interpretable text-based personality assessment from social media. It addresses challenges like data scarcity and model interpretability by creating two new datasets, MBTI9k and PANDORA, from Reddit. The study demonstrates how linguistic features and demographic data can predict personality traits, highlights biases in gender classification linked to personality, and shows that SIMPA, by matching user statements to validated personality items, significantly improves prediction accuracy and interpretability for Big Five traits.

Understanding personality is fundamental to human interaction, influencing everything from teaching styles to team selection and personal relationships. Historically, personality assessment relied on observable cues like speech and language. With the advent of the digital age, particularly social media, automated methods for personality assessment have become increasingly important. However, this field faces significant challenges, including a scarcity of large, labeled datasets, a disconnect between personality psychology and natural language processing (NLP), and issues with model validity and interpretability.

A recent doctoral thesis, titled A COMPUTATIONAL FRAMEWORK FOR INTERPRETABLE TEXT-BASED PERSONALITY ASSESSMENT FROM SOCIAL MEDIA, by Matej Gjurkovi´c, addresses these challenges by proposing a new computational framework for interpretable text-based personality assessment from social media. The research highlights the critical role of language in personality theory and assessment, moving beyond traditional methods to leverage the vast amounts of text data available online.

New Datasets from Reddit

The research identifies Reddit as a promising, yet underutilized, data source for personality analysis. Reddit’s user anonymity, diverse discussion topics across thousands of subreddits, and the sheer volume of user-generated text offer a rich environment for extracting personality cues. To overcome the limitations of existing datasets, two new datasets were developed: MBTI9k and PANDORA.

The MBTI9k dataset is based on the Myers-Briggs Type Indicator (MBTI) and contains over 22 million comments from more than 13,000 users. Users’ MBTI types were primarily identified through ‘flairs’ – self-declared labels in subreddits. The PANDORA dataset expands on this by integrating the more scientifically accepted Big Five personality model scores, along with crucial demographic information like gender, age, and location, from over 10,000 users and 17 million comments. These datasets are significant because they provide a large-scale, demographically enriched resource for personality research, allowing for a deeper understanding of how language, demographics, and personality interact.

Insights from Data Analysis

Analysis of these datasets revealed unique personality distributions among Reddit users. For instance, MBTI9k showed a dominance of introverted, intuitive, thinking, and perceiving types, suggesting that Reddit users tend to be more educated and intellectually engaged than the general population. PANDORA indicated that the average Reddit user has higher openness and lower extraversion, agreeableness, and conscientiousness compared to the general population. Demographic data also showed that most users are from English-speaking countries, with males being slightly younger and scoring lower on agreeableness than females.

The research also explored various linguistic and behavioral features for personality prediction. It found that different features are relevant for different personality dimensions. For example, social and family-related words are strong indicators of Extraversion, while complex and longer words are associated with Intuitive types. Temporal patterns, such as posting habits on certain days of the week or months, also showed correlations with personality traits.

Addressing Bias and Improving Prediction

Experiments demonstrated the potential for predicting Redditor personalities using relatively simple features. A key finding was the importance of considering demographic variables as confounding factors. For example, a gender classification model trained on Reddit data showed biases, misclassifying individuals with certain personality trait combinations more frequently. This highlights the need for comprehensive psychodemographic profiles to detect and mitigate biases in machine learning models, especially since Reddit data is often used to train conversational AI.

The study also explored predicting Big Five traits using MBTI data, leveraging the higher availability of MBTI labels. By treating this as a domain adaptation problem, the research showed that MBTI predictions could significantly assist in predicting Big Five traits, despite the MBTI’s questionable psychological validity.

The SIMPA Framework for Interpretable Assessment

To address the challenges of interpretability and validity, the thesis introduces the Statement-to-Item Matching Personality Assessment (SIMPA) framework. Inspired by the Realistic Accuracy Model (RAM), SIMPA breaks down personality assessment into four stages: relevance, availability, detection, and utilization, with a crucial feedback loop for iterative refinement.

SIMPA focuses on matching ‘Trait-Indicative Statements’ (TISes) – user-generated text reflecting personality – with ‘Trait-Relevant Statements’ (TRSes) – validated questionnaire items. This approach aims to provide transparent and explainable personality assessments by explicitly linking textual cues to specific personality traits at granular levels (nuances, facets, domains).

The implementation of SIMPA explored different types of TRSes, including items from the IPIP-NEO questionnaire, expert-crafted statements (eTRSes), and statements generated by Large Language Models (LLMs). Expert-crafted eTRSes proved more effective in matching social media language, and while LLMs could generate and evaluate TRSes, their performance was comparable to an average human annotator, suggesting the continued need for expert oversight.

In supervised personality assessment, incorporating SIMPA’s TIS-based features significantly improved the state-of-the-art prediction accuracy for all Big Five traits on the PANDORA dataset, particularly for Extraversion. The framework also showed promise for unsupervised personality assessment, demonstrating convergent and discriminant validity.

Also Read:

Future Directions

While the research acknowledges limitations, such as the inherent biases in social media data and the simplifications made in the SIMPA implementation, it lays a strong foundation for future work. The SIMPA framework is highly versatile and can be applied to various constructs beyond personality, especially those with complex taxonomies and layered cues. Future research will focus on refining the relevance scoring function, leveraging advanced LLMs for more effective TIS detection, and developing agentic workflows for dynamic adaptation.

This work represents a significant step towards developing automated personality assessment systems that combine advanced NLP technologies with the rigorous scientific standards of personality psychology, offering more interpretable, valid, and efficient tools for understanding human personality from digital footprints.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Personality: A New Framework for Text-Based Assessment from Social Media

New Datasets from Reddit

Insights from Data Analysis

Addressing Bias and Improving Prediction

The SIMPA Framework for Interpretable Assessment

Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates