From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

1Lake Wales High School, 2Thomas Jefferson Senior High School, 3Bellarmine College Preparatory

datasets used in this project.

What is FER?

Human communication heavily relies on facial expressions as a means to express emotions, intentions, and reactions without the need for verbal cues. Facial expression recognition (FER) technology capitalizes on this natural form of communication through the analysis of visual media and key facial features, such as the eyes, eyebrows, and mouth, to accurately determine an individual's emotional state. FER encompasses three core stages: face detection, feature extraction, and feature expression classification to distinguish between seven major emotions (anger, disgust, fear, happiness, neutrality, sadness, and surprise).

270% increase in FER usage over the past four years.

Abstract

This study addresses the racial biases in facial expression recognition (FER) systems within Large Multimodal Foundation Models (LMFMs). Despite advances in deep learning and the availability of diverse datasets, FER systems often exhibit higher error rates for individuals with darker skin tones.

Existing research predominantly focuses on traditional FER models (CNNs, RNNs, ViTs), leaving a gap in understanding racial biases in LMFMs. We benchmark four leading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance in facial emotion detection across different racial demographics. A linear classifier trained on CLIP embeddings obtains accuracies of 95.9% for RADIATE, 90.3% for Tarr, and 99.5% for Chicago Face.

Furthermore, we identify that Anger is misclassified as Disgust 2.1 times more often in Black Females than White Females. This study highlights the need for fairer FER systems and establishes a foundation for developing unbiased, accurate FER technologies.

Related Works

Deep Learning Architectures: In recent years, researchers have become interested in Convolutional Neural Networks (CNNs) and hybrid models such as CNN-LSTM (Long Short-Term Memory) for facial emotion recognition. These architectures are effective at extracting features from facial images, but often require large amounts of labeled data for training and suffer from overfitting when dealing with limited datasets. Although LMFMs show potential due to their vast training networks, we discover that they struggle to capture subtle nuances in facial expressions.

Dataset Selections: Various factors such as illumination, noise, and blur can impair FER performance. Additionally, upscaling FER datasets to 224 x 224 pixels for neural networks often results in detail loss and reduced classification accuracy. To ensure a fair evaluation of Large Multimodal Foundation Models, we employ high-resolution and uniform datasets.