TLDR: Large Language Models (LLMs) struggle to accurately recognize and align with fine-grained human emotions, often missing contextual cues. A new dataset, EXPRESS, with self-disclosed emotions from Reddit, was used to evaluate 14 LLMs. While few-shot learning showed some improvement, Chain-of-Thought prompting did not help. Experts even slightly preferred LLM predictions over self-disclosed ones in some cases, highlighting the complexity and subjectivity of emotion recognition.
Large Language Models (LLMs) have become incredibly powerful in understanding natural language, leading to their increased use in fields like mental health research. While these models can classify emotions into broad categories, a new study reveals a significant gap: their ability to truly align with the fine-grained, nuanced emotions that humans express.
A recent research paper, titled “Fluent but Unfeeling: The Emotional Blind Spots of Language Models,” delves into this critical issue. Authored by Bangzhao Shu, Isha Joshi, Melissa Karnaze, Anh C. Pham, Ishita Kakkar, Sindhu Kothe, Arpine Hovasapian, and Mai ElSherief, the study introduces a novel benchmark to evaluate LLMs’ emotional intelligence.
The Challenge of Fine-Grained Emotion
Current methods for evaluating LLMs often rely on limited, predefined emotion categories or crowdsourced annotations, which can be unreliable and miss the subtle complexities of human feelings. Furthermore, most benchmarks focus on short sentences, overlooking the rich context found in longer emotional disclosures.
To address these limitations, the researchers developed EXPRESS (EXperiences and PRocessed Emotions in Self-disclosure Stories). This unique dataset comprises 33,679 human experiences and their self-disclosed emotions, curated from Reddit communities. What makes EXPRESS stand out is its inclusion of 251 fine-grained emotion labels, far exceeding the typical handful of emotions found in other datasets. The average post length is also significantly longer, at 259 words, providing ample context for emotional understanding.
How LLMs Were Evaluated
The study employed a comprehensive evaluation framework. LLMs were tasked with predicting masked emotion words within the Reddit posts. To move beyond simple lexical matches, the predicted and self-disclosed emotions were decomposed into 10 dimensions based on established emotion theories, specifically Plutchik’s Wheel, which includes eight basic emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and two sentiment dimensions (positive and negative).
Fourteen prevalent language models were tested, including masked language models (like RoBERTa and Longformer), Seq2Seq models (Flan-T5 family), and causal language models (Llama3.1, Gemma2, GPT-3.5-turbo, and GPT-4o). These models were evaluated under various prompt settings: zero-shot (no examples), few-shot (with random or similar examples), and Chain-of-Thought (CoT) prompting, where models are instructed to reason step-by-step.
Key Findings: Emotional Blind Spots Revealed
The results highlighted that accurately predicting human self-disclosed emotions remains a significant challenge for LLMs. Lexical accuracy (exact word match) and vector accuracy (match at the basic emotion/sentiment level) were generally low, indicating that models often fail to align with human emotions, even at a fundamental level.
Several factors influenced performance:
- Model Architecture and Size: Masked language models, despite their smaller size, performed comparably to some larger causal language models like GPT-4o, likely due to their specialization in mask-filling tasks. Generally, performance improved as model size increased within causal language model families.
- Chain-of-Thought (CoT) Prompting: Surprisingly, CoT prompting consistently worsened performance across all tested models. This suggests that for subjective and context-sensitive tasks like emotion recognition, step-by-step reasoning might lead models to over-rely on prior knowledge rather than the specific contextual cues provided.
- Few-Shot Learning: In contrast, few-shot prompting, especially when provided with examples of similar emotional experiences, significantly improved emotion recognition. This indicates that LLMs have the potential to learn and enhance their emotional awareness with targeted training.
The error analysis revealed that LLMs frequently overused certain emotion words like ‘anxious,’ ‘grateful,’ and ‘frustrated,’ while humans expressed these feelings with more diverse and subtle terms. Models also tended to reduce emotional intensity, predicting ‘angry’ instead of ‘furious,’ for example.
Human Experts Weigh In
In a fascinating qualitative analysis, emotion experts were asked to judge whether the LLM-predicted emotion or the human self-disclosed emotion was more plausible in a given context. Unexpectedly, experts slightly preferred the LLM’s predicted emotions over the self-disclosed ones in some instances. However, the agreement among coders was low, underscoring the inherent subjectivity and complexity of human emotion. This suggests that while LLMs can generate theoretically consistent emotions, they sometimes miss subtle contextual cues that are intuitive to humans.
Also Read:
- Unpacking Emotions in Pop Song Lyrics: A Deep Dive into AI’s Ability to Understand Music’s Heart
- Unmasking Hidden Threats: How LLMs Fall for Camouflaged Attacks
Implications for the Future
The EXPRESS dataset and the evaluation framework provide crucial insights into the limitations of LLMs in fine-grained emotion alignment. This research is particularly valuable for the development of emotion-aware AI systems, especially in sensitive applications like mental health support tools. The findings emphasize the importance of using ecologically valid, self-disclosed emotions as benchmarks and training material to improve model alignment with human experiences.
While LLMs show promise in learning emotional intelligence, further research is needed to address their contextual understanding and biases. The full research paper can be accessed here: Fluent but Unfeeling: The Emotional Blind Spots of Language Models.


