TLDR: A new study auditing Facial Emotion Recognition (FER) datasets and models reveals two critical issues: a significant number of posed expressions in datasets claiming to be ‘in-the-wild,’ which can lead to inaccurate real-world performance predictions; and a concerning racial bias where FER models frequently misclassify non-white individuals and those with darker skin tones as displaying negative emotions, even when they are smiling or neutral. The research highlights the potential for real-world harm and calls for a re-evaluation of FER applications, suggesting a shift towards understanding facial expressions as social communication rather than indicators of inner emotional states.
Facial Emotion Recognition (FER) algorithms are designed to classify human facial expressions into emotions like happiness, sadness, or anger. These algorithms hold promise for various applications, particularly in human-computer interaction. However, a recent audit of state-of-the-art FER datasets and models has brought to light significant challenges related to data collection practices and inherent biases.
One major hurdle facing FER algorithms is their performance drop when detecting spontaneous, real-world expressions compared to posed, intentional ones. This discrepancy is crucial because many datasets, despite claiming to contain “in-the-wild” images, actually include a substantial number of posed expressions. The study found that 46.5% of images in AffectNet and 35.3% in RAF-DB, two widely used FER datasets, were posed. This means that models trained on these datasets might not accurately represent their true performance when deployed in real-life scenarios where spontaneous expressions are more common.
To address the challenge of identifying posed expressions, the researchers proposed a new methodology. This method draws on existing work, such as identifying genuine smiles by specific facial muscle movements (e.g., raised cheeks), and introduces new criteria for non-smiling expressions. These criteria include recognizing actors in movie scenes, identifying plain, mono-color backgrounds often used for stock images, and observing subjects looking directly at the camera in very well-lit, artificial environments. While individual factors might not be conclusive, a combination of these elements can indicate a high likelihood of a posed image.
Beyond performance, a critical ethical concern for FER algorithms is their tendency to perform poorly for people of certain races and skin colors. Prior research has indicated that facial recognition algorithms often show reduced accuracy for individuals with darker skin tones. This study extends that concern to emotion recognition, conducting a comprehensive fairness audit on two state-of-the-art FER models trained on AffectNet and RAF-DB, respectively. The models were tested using the FairFace dataset, which provides balanced race labels.
The findings revealed a concerning racial bias. The audited FER models were significantly more likely to predict negative emotions, such as anger or sadness, for individuals labeled as non-white or determined to have darker skin, even when those individuals were smiling or had a neutral expression. For instance, across both models, 23.4% of samples with negative predictions observed as White were smiling, compared to 33% for Black individuals, 35.6% for East Asian, and 37.7% for Southeast Asian. Similar trends were observed for neutral faces being misclassified as negative emotions. This bias was also more pronounced for darker skin tones.
The presence of such biases in FER models carries serious real-world implications. In social contexts, human emotion perception, though flawed, can lead to significant consequences; for example, legal actors may issue harsher sentences to defendants whose natural facial expressions are perceived as angry. Similarly, societal biases can influence judgments, as seen in schools where Black children are more frequently perceived as angry than white children. If FER technology is deployed in applications like automated interviews or crowd security without addressing these biases, it could perpetuate and amplify existing societal harms.
The researchers strongly encourage a re-evaluation of how FER technology is framed. Instead of viewing it as a tool to reveal innermost emotional states, they suggest adopting Fridlund’s Behavioral Ecology Theory, which posits that facial expressions are socially motivated and performative. This perspective would position FER technology as a tool for understanding intentionally presented social cues, making it better suited for communication applications rather than high-stakes security or evaluative contexts.
Also Read:
- AGCD-Net: Enhancing Emotion Recognition by Mitigating Contextual Bias
- Fair-FLIP: Balancing Accuracy and Equity in Deepfake Detection
The challenges highlighted in this audit underscore the difficulties in collecting and annotating large-scale machine learning datasets without inadvertently incorporating social and cultural biases. The study serves as a crucial reminder for FER researchers to be more cautious about the framing and deployment of their technology, advocating for its use in ways that detect and transmit intentionally expressed social cues rather than inferring deep emotional states. For more details, you can refer to the full research paper here.


