i
ABSTRACT
Facial emotion recognition (FER) plays a critical role in nonverbal
communication, human-computer interaction, and mental health applications. This
study introduces the Convolutional Neural Network-10 (CNN-10) model, designed
specifically for robust facial expression analysis, and compares its performance
with Vision Transformers (ViT), VGG19, and InceptionV3 on widely used datasets,
including CK+, FER-2013, and JAFFE. The CNN-10 model leverages dynamic
feature selection and local-global context integration to achieve state-of-the-art
performance, with accuracy scores of 99.9% on CK+, 84.3% on FER-2013, and
95.4% on JAFFE, outperforming traditional models. Through innovative
architecture and data augmentation, the study addresses challenges such as
incomplete face detection, image diversity, and lighting variability. Moreover, the
proposed CNN-10 model demonstrates robustness in recognizing both subtle and
complex facial expressions, benefiting from hierarchical feature learning and
reduced computational overhead compared to traditional deep learning models.
Additionally, the integration of ViT introduces a novel attention mechanism that
enhances global context understanding, though it remains computationally more
demanding. Future work includes improving temporal emotion recognition,
enhancing model interpretability, and optimizing the framework for real-time
deployments.
ii
TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO. NO.
ABSTRACT iii
LIST OF TABLES iv
LIST OF FIGURES v
LIST OF ABBREVIATIONS vi
1 INTRODUCTION 1
1.1 Evolution of Facial Emotion Recognition 2
1.2 Computer Vision and Machine Learning 3
1.3 Deep Learning Revolution 3
1.4 Challenges in Real-World Implementation 5
2 LITERATURE REVIEW 6
3 PROPOSED SYSTEM 13
3.1 Block Diagram 13
3.2 Face Detection and Preprocessing 15
3.3 CNN-10 Model for Emotion Recognition 15
3.4 Emotion Classification 16
3.5 Post-Processing and Output 17
4 CONCLUSION 18
REFERENCES 19
iii
LIST OF FIGURES
FIGURE FIGURE NAME PAGE
NO. NO.
3.1 Block diagram of facial emotion recognition. 14
iv
LIST OF ABBREVIATIONS
AI Artificial intelligence
BNN Binary Neural Network
CNN Convolutional Neural Network
DL Deep Learning
DLSTA Deep Learning Assisted Semantic Text Analysis
EEG Electroencephalography
FER Facial Emotion Recognition
FLN Fast Learning Network
GSP Galvanic Skin Response
HOG Histogram of Oriented Gradients
KNN K-Nearest Neighbour
LBP Local Binary Pattern
LDA Linear Discriminant Analysis
ML Machine Learning
NLP Natural language processing
PCA Principal Component Analysis
RAF-DB Realworld Affective Face Database
ReLU Rectified Linear Unit
ResNet50 Residual Network
RNNs Recurrent Neural Networks
SVM Support Vector Machine
VGG-16 Visual Geometry Group
ViT Vision Transformer
1
CHAPTER 1
INTRODUCTION
Facial expressions are a universal and powerful means of non-verbal
communication, transcending cultural and linguistic boundaries. They provide a
window into an individual's emotions, often revealing more about their feelings and
intentions than spoken words. This intrinsic human ability to interpret emotions
from facial cues plays a critical role in everyday interactions, influencing social
bonds, decision-making, and the perception of trustworthiness. In recent decades,
the importance of this non-verbal form of communication has extended into the
realm of technology, leading to significant advancements in Facial Emotion
Recognition (FER). FER is an interdisciplinary field that intersects psychology,
computer vision, and artificial intelligence (AI), focusing on the automation of
emotion detection from facial expressions.
The increasing prevalence of artificial intelligence in our daily lives has
created a demand for systems that can interact with humans in a more natural and
empathetic way. FER systems aim to bridge this gap by enabling machines to
"read" and respond to human emotions, mimicking human-like interactions. This
capability is particularly critical in applications like healthcare, security, marketing,
education, and entertainment. For instance, in healthcare, FER is used to monitor
patients' emotional states, particularly those unable to articulate their feelings, such
as children or individuals with cognitive impairments. It also aids in diagnosing
mental health conditions like depression or anxiety. In security, FER can be applied
in surveillance systems to detect suspicious behaviors by analyzing facial cues.
Marketing strategies often rely on FER to gauge consumer reactions to
advertisements or products, allowing businesses to tailor their offerings more
effectively. Similarly, in education, FER enhances virtual learning environments by
identifying students' levels of engagement or frustration, enabling adaptive
teaching methods.
2
1.1 Evolution of Facial Emotion Recognition
The evolution of Facial Emotion Recognition (FER) has been a dynamic
journey, progressing through different stages marked by advancements in
technology, algorithms, and data availability. The recognition of human emotions
from facial expressions is one of the most critical aspects of human-computer
interaction (HCI), social robotics, and psychological studies. It allows machines to
interpret emotional cues, enabling more natural and intuitive interactions with
users. The evolution of FER has transformed from early, basic methods to
sophisticated deep learning-based systems. Here is a comprehensive look at the key
stages in the development of FER.
In the early years, before the digital era of deep learning, FER systems were
highly dependent on psychological theories, particularly Paul Ekman’s model of
basic emotions. Ekman identified six universal emotions: happiness, sadness,
anger, fear, surprise, and disgust. This model became the foundation of early FER
systems, where the focus was on identifying these emotions based on facial
expressions. Initially, FER relied on manually defined features to classify
emotions. Facial landmarks, such as the positions of the eyes, eyebrows, nose, and
mouth, were extracted using rudimentary techniques. These features were assumed
to provide enough information to distinguish between different emotional states.
For example, a smile might indicate happiness, while furrowed brows might
signify anger. These systems, however, were highly dependent on manual input
and had limited accuracy. The approaches were sensitive to variations in lighting,
facial orientation, and the subject's appearance. They were also unable to handle
the complexity of subtle facial expressions and posed a challenge in terms of
scalability. In this period, FER research mainly focused on psychophysiological
studies and basic computational models, lacking the computational power
necessary for large-scale image analysis. The primary challenge was to understand
how facial expressions related to specific emotional states and how to translate this
understanding into a computational model.
3
1.2 Computer Vision and Machine Learning
The early 2000s marked a significant shift in the development of FER as
computer vision techniques became more sophisticated. At this point, researchers
began utilizing more advanced image processing algorithms to automatically detect
and extract facial features from images. This was a significant leap forward
compared to earlier manual methods. Technologies like Principal Component
Analysis (PCA) and Linear Discriminant Analysis (LDA) became widely used for
feature extraction.
PCA was used to reduce the dimensionality of facial images, enabling the
identification of the most significant features from large datasets of facial
expressions. LDA, on the other hand, was employed to maximize the separation
between different emotional states, improving classification accuracy. With the
advent of these methods, machine learning algorithms such as Support Vector
Machines (SVM), k-Nearest Neighbors (k-NN), and decision trees began to play a
crucial role in FER systems. These algorithms were employed to classify different
facial expressions based on the features extracted using PCA, LDA, or similar
techniques. The models were trained using labeled emotion datasets, such as the
Cohn-Kanade (CK+) dataset, which provided a rich source of annotated images
with facial expressions and corresponding emotional labels. While these methods
showed improvements over manual feature extraction, they were still limited in
handling variations such as facial occlusion, changes in lighting conditions, and
dynamic facial expressions. Moreover, they relied heavily on predefined features
and human-designed algorithms, which made the systems less adaptable to new or
unseen data.
1.3 Deep Learning Revolution
4
The 2010s saw the arrival of deep learning, which revolutionized the field of
FER. This shift marked a move away from traditional feature extraction methods to
automated feature learning, where deep learning models such as Convolutional
Neural Networks (CNNs) could automatically learn to detect and classify facial
expressions from raw image data. CNNs are a class of deep neural networks
designed for processing grid-like data, such as images. They can learn hierarchical
features by using convolutional layers, pooling layers, and fully connected layers.
CNN-based models dramatically improved the accuracy of FER systems by
eliminating the need for manual feature extraction. These models automatically
learned complex features directly from pixel values in images, enabling them to
handle a broader variety of facial expressions. This approach also proved effective
in identifying subtle facial expressions that earlier systems struggled to recognize.
Large-scale datasets, such as FER2013, containing over 35,000 facial images with
labeled emotions, played a crucial role in the advancement of deep learning-based
FER systems. These datasets provided the volume of labeled data necessary to train
CNNs and allowed for more robust and accurate recognition across diverse facial
expressions, lighting conditions, and facial variations. One of the significant
innovations in this era was the development of deep transfer learning. This
approach involves training a neural network on a large, diverse dataset and then
fine-tuning it for a specific task with a smaller dataset. Transfer learning allowed
FER systems to leverage pre-trained networks, such as VGGNet, ResNet, and
Inception, which had already been trained on large image databases like ImageNet.
These networks were then adapted to recognize facial expressions, significantly
reducing the amount of labeled data required for effective training.
Despite these advancements, deep learning-based FER systems still faced
challenges. They were computationally expensive, requiring high-performance
GPUs to train deep neural networks. Additionally, these models were prone to
overfitting, especially when trained on small or unbalanced datasets, and struggled
with real-time applications due to their high computational demand. Another
important development is the use of attention mechanisms, particularly the
5
Transformer architecture, which enables models to focus on the most relevant
regions of the face when processing facial expressions. These models have shown
promising results in enhancing the accuracy and efficiency of FER systems.
1.4 Challenges in Real-World Implementation
The real-world implementation of Facial Emotion Recognition (FER)
systems faces several challenges, despite advancements in technology. One of the
primary challenges is the variability in facial expressions. Human expressions can
differ significantly due to cultural background, personal habits, and emotional
intensity. A smile, for example, may indicate happiness in one culture but could be
interpreted differently in another. Subtle emotions, such as frustration or confusion,
are often harder to detect, especially when the expression is faint or masked by
other emotions. Additionally, certain facial expressions may overlap, making it
difficult to distinguish between similar emotions, such as anger and disgust, or joy
and surprise. FER systems must account for this variability by training on diverse
datasets that capture a wide range of cultural and emotional contexts. However,
even with diverse datasets, the system may struggle with recognizing less obvious
emotions or emotional changes over time, especially when the expressions are
fleeting or ambiguous.
Another significant challenge is the impact of environmental factors and
lighting conditions. Poor lighting can obscure important facial features, making it
difficult for the system to accurately detect facial expressions. For instance, dim or
uneven lighting can cast shadows on the face, while extreme lighting conditions
like backlighting or overly bright settings can distort or wash out key features. In
real-world scenarios, individuals may be captured under a variety of lighting
conditions, such as natural light, artificial lights, or in low-light environments.
These variations complicate the FER task, as the system must be able to adapt to
these environmental differences. Additionally, background clutter and other
distractions, such as moving objects or the presence of other people, can interfere
with facial feature detection and expression recognition. Therefore, FER systems
6
must incorporate techniques to filter out noise and focus on the relevant facial
features, which requires advanced image processing capabilities.
CHAPTER 2
LITERATURE REVIEW
Ajay B S & Rao (2021) proposed real time emotion detection on an edge
computing device to detect passenger anomaly" explores the integration of facial
emotion recognition (FER) systems in public transportation to enhance passenger
safety. The authors highlight the significance of recognizing human emotions
through facial expressions, which can serve as a critical tool for immediate
assistance in emergencies, particularly in shared mobility contexts like taxis and
rideshares. The study emphasizes the limitations of traditional FER methods in
uncontrolled environments and proposes a novel approach using a Binary Neural
Network (BNN) combined with Local Binary Pattern (LBP) preprocessing,
implemented on an FPGA platform for real-time performance. The results indicate
an accuracy of 81% on the JAFFE dataset and 75% in real-time applications,
showcasing the potential of this technology to identify passenger anomalies
effectively. The findings suggest that this system could significantly improve safety
protocols in connected mobility networks, addressing a gap in current public
transportation safety measures
Wang et al.: (2019) presents a novel approach to emotion recognition using
EEG signals, emphasizing the transition from traditional methods to bio-signal
assisted techniques. The authors highlight the significance of emotions in human
cognition and behavior, noting that EEG is a fast, non-invasive method with
excellent temporal resolution for monitoring brain activity. The study utilizes the
DEAP dataset, which includes recordings from 32 participants reacting to music-
video clips, and aims to improve emotion detection accuracy through a multi-
channel fused processing method using convolutional neural networks (CNNs).
The results indicate a mean classification accuracy of 83.88%, showcasing the
7
effectiveness of the proposed hardware design in real-time emotion detection. The
authors also discuss the challenges of subject-independent emotion recognition and
suggest the need for custom preprocessing engines to enhance accuracy in future
work .
Vincentius Haposan et al.:(2024) delves into the advancements in Facial
Expression Recognition (FER), transitioning from traditional edge detection
techniques to sophisticated machine learning approaches. It highlights the
effectiveness of hybrid models combining Convolutional Neural Networks (CNNs)
and Recurrent Neural Networks (RNNs) for enhanced emotion detection,
particularly in video data. The study introduces four previously unexplored
emotions-amusement, enthusiasm, awe, and liking—using the Emotion Wearable
Dataset 2020, which enriches the emotional spectrum in FER research. The custom
CNN-RNN model developed in this study achieved an accuracy of 63%, while the
MobileNetV2-RNN and InceptionV3-RNN models reached 59% and 66%
accuracy, respectively. The implications of this research extend across various
fields, including healthcare and psychology, emphasizing the importance of
understanding human emotional responses in digital interactions and real-world
applications
Emmanuel Gbenga Dada et al.:(2023) explores the advancements in facial
emotion recognition through deep learning techniques. Various studies have been
conducted in this domain, highlighting the effectiveness of different models,
including dynamic deep learning features and custom techniques for recognizing
facial expressions. Notably, the authors utilized the CK+ dataset, achieving high
accuracy rates of 99.9% for CK+, 84.3% for FER-2013, and 95.4% for JAFFE
datasets, showcasing the robustness of their CNN-10 model compared to other
architectures like VGG19 and InceptionV3. However, limitations such as the small
dataset size and challenges in distinguishing between certain emotions were noted
in previous works. The paper emphasizes the need for improved methods to
address these challenges and enhance classification accuracy, contributing to the
ongoing research in emotion recognition and its applications in human-computer
8
interaction and cognitive computing. The findings indicate that while significant
progress has been made, further advancements are necessary to refine the models
and their performance in diverse and complex datasets.
Sowmya BJ et al.:(2023) explores the multifaceted nature of emotions,
defining them as reflections or actualizations of feelings, which can be genuine or
simulated. It highlights the significance of emotions across various fields, including
psychology, healthcare, biomedical engineering, and neurology, emphasizing the
growing interest in emotion recognition within biomedical engineering. Recent
advancements in this area have focused on motion detection and automated
detection of psychological disorders, utilizing diverse tools such as multimodal
approaches, galvanic skin response (GSR), electroencephalography (EEG), facial
expressions, and visual scanning actions. The paper also notes the increasing
application of deep learning techniques, particularly in image grading, which have
shown promising results in emotion recognition tasks. The research presented in
this paper contributes to the understanding of how emotions can be quantified and
analyzed, paving the way for further developments in emotion recognition
technologies.
Manimohan P et al.:(2024) propose a robust facial emotion detection system
that leverages advanced deep learning techniques, specifically Convolutional
Neural Networks (CNNs) and transfer learning. This system is designed to classify
facial expressions into five distinct emotional categories: neutral, surprise, sad,
happy, and angry. To enhance the model's performance, the authors utilized a
diverse dataset of over 10,000 annotated facial images and implemented data
augmentation techniques, such as rotation, translation, and flipping, to improve
training robustness. They also fine-tuned a pre-trained CNN model, ResNet50, to
leverage its learned features, which significantly contributed to the system's
accuracy. The proposed system is evaluated under various conditions, including
different lighting, poses, and facial occlusions, demonstrating its potential for real-
time emotion monitoring in applications like human-computer interaction and
mental health assessment. The authors acknowledge the challenges faced, such as
9
variability in emotional expression and the need for diverse datasets, while
emphasizing the system's capability to enhance user experiences and improve the
understanding of human emotions in practical settings.
Rajesh K M et al.:(2016) proposed a robust method for real-time face
recognition and emotion detection using support vector machines (SVMs). For face
recognition, they used the Fisherface algorithm which combines principal
component analysis (PCA) and linear discriminant analysis (LDA) to achieve high
accuracy. For emotion detection, they extracted facial features like eyes, nose, lips,
and face contour using the open-source software dlib, and then used SVM
classifiers to predict emotions like happy, sad, angry, fear, disgust, and surprise.
The authors tested their algorithms on publicly available datasets like CK, CK+,
and IMM, and reported good detection accuracy and low processing time. They
also discussed the implementation using OpenCV and Python, and mentioned the
potential for future work like integrating the system into Android platforms to
improve accessibility.
Fatemeh Shahrabi Farahani et al.:(2013) proposed a new fuzzy-based
method for emotion recognition from facial features like eye opening, mouth
opening, eye opening/width ratio, and mouth width. They used a combination of
different color spaces like YCbCr, L*a*b, and HSV to detect the face, eyes, and
mouth regions. Employing Mamdani-type fuzzy rules, the facial attributes were
mapped to the emotion space. Evaluating the proposed method on the Ebner's
facial expression database, the authors achieved an average accuracy of 78.8% in
recognizing six basic emotions - happiness, neutrality, sadness, disgust, fear, and
anger. Compared to other methods like LBP, SVM, AdaBoost, and Boosted-LBP,
the proposed fuzzy-based approach showed better performance in recognizing
happy, fear, and sad emotions, but struggled with neutral and disgust emotions.
Rohit Pathar et al.:(2019) proposed a Convolutional Neural Network (CNN)
model to recognize human facial emotions from the FER2013 dataset. They
explored different network architectures, including a shallow CNN with one
10
convolutional layer and a deep CNN with 8 convolutional layers. The shallow
model achieved an accuracy of 47.94%, while the deep model reached an
impressive accuracy of 89.98%. The authors observed that the deep model tended
to overfit the training data, and they used techniques like dropout to combat
overfitting. The results were visualized through graphs, confusion matrices, and
real-time implementation on a webcam, demonstrating the model's ability to
accurately recognize multiple facial emotions simultaneously. The authors suggest
that with more training data and computational resources, the model could
potentially achieve state-of-the-art accuracies of around 95%.
Kanagaraju P et al.:(2022) used the VGG16 architecture and trained it on the
FER-2013 dataset, which contains over 35,000 facial expression images across 6
emotion classes. After training and validation, the proposed VGG16 model
achieved a validation accuracy of 64.52%. The authors analyzed the confusion
matrix to identify which emotions were easier or more difficult to distinguish. The
results showed the model performed well on recognizing neutral, happy, sad, angry,
and surprised expressions, but had more difficulty with the fear emotion class. The
authors also discussed plans for future work, including exploring other deep
learning and machine learning algorithms to potentially improve the accuracy
further. Additionally, they aim to develop a user-friendly web application that can
play solution videos based on the detected emotions (e.g. playing a calming video
if sadness or anger is detected). Overall, this work demonstrates the feasibility of
using a CNN-based approach for real-time facial emotion recognition with
reasonable accuracy
Renuka S Deshmukh et al.:(2017) explores the necessary characteristics of
the training dataset, feature representations and machine learning algorithms for a
facial emotion recognition system that operates reliably in more realistic
conditions. A new database called the Realworld Affective Face Database (RAF-
DB) is presented, which contains about 30,000 diverse facial images from social
networks. Crowdsourcing results suggest that real-world expression recognition is
a typical imbalanced multi-label classification problem, and the balanced, single-
11
label datasets currently used in the literature could potentially lead the research into
misleading algorithmic solutions. To address this, a deep learning architecture
called DeepEmo is proposed, which learns highly effective high-level feature
representations for recognizing real-world facial expressions.
Arpita Gupta et al:(2020) proposes a deep learning framework for facial
emotion recognition that combines convolutional neural networks (CNN), residual
networks (ResNet), and an attention mechanism. The proposed model outperforms
existing CNN-based networks, achieving 85.76% training accuracy and 64.40%
testing accuracy on the FER dataset. The key components of the model are: 1)
Convolutional layers with 3x3 kernels and 32-64 filters, 2) Residual connections to
address the vanishing gradient problem, and 3) An attention block to provide better
visual perceptibility. The authors claim the proposed model has greater
applicability for real-time facial emotion detection applications.
Jio Guo (2021) propose a Deep Learning Assisted Semantic Text Analysis
(DLSTA) model for detecting human emotions from text analysis of big data. The
model effectively integrates both semantic and syntactic features of the text to
improve the efficiency of emotion detection, leveraging natural language
processing (NLP) techniques and word embeddings. These embeddings are widely
used in NLP tasks such as sentiment analysis and machine translation, enhancing
the model's ability to understand and classify emotions. Experimental results show
that the DLSTA model achieves a significantly higher emotion detection rate of
97.22% and classification accuracy of 98.02%, outperforming existing approaches.
This demonstrates the model’s effectiveness in detecting human emotions from
textual data. Additionally, the integration of both semantic and syntactic features
allows the model to handle complex textual data, making it more accurate and
reliable for emotion analysis in large datasets. However, the model's reliance on
high-quality, labeled datasets and the computational cost may limit its applicability
in real-time or large-scale environments.
12
Majid Razaq Mohamed Alesemawi et al. (2023) present a system for
recognizing human facial emotions from images using the Histograms of Oriented
Gradients (HOG) feature extraction technique and the Fast Learning Network
(FLN) algorithm for classification. The system was evaluated using the Yale Faces
database, which contains 165 facial emotion images across 11 different
expressions. By extracting HOG features from preprocessed facial images and
feeding them into the FLN classifier, the system achieved an accuracy of 95.04%
and precision, recall, F-measure, and G-mean of 72.73% using 500 hidden neurons
in the FLN. The results demonstrate the effectiveness of the FLN algorithm for
facial emotion recognition, showing a significant improvement over traditional
machine learning techniques. The system's ability to process facial images
efficiently highlights its potential for real-world applications, particularly in
environments where accurate facial emotion recognition is crucial, such as in
human-computer interaction and security systems. However, the approach could be
further optimized by testing it on a wider variety of facial emotion datasets to
ensure robustness across different settings and populations.
Amit Pandey et al. (2022) propose a facial emotion detection and recognition
approach based on a convolutional neural network (CNN) model that effectively
extracts facial features for emotion classification. The method processes training
sample images by directly inputting pixel values and applies background removal
to improve the accuracy of emotion classification. This approach involves several
stages, including face detection, feature extraction, and expression classification,
utilizing techniques such as support vector machines and neural networks. The
paper reviews various facial emotion datasets and compares the performance of the
proposed method with existing approaches, noting that it outperformed them,
particularly in handling non-frontal or angularly captured photos. The model's
ability to accurately classify emotions from non-frontal facial images makes it
promising for real-world applications, where such images are common. However,
challenges remain in optimizing the model for diverse environmental conditions
and in enhancing its real-time performance for large-scale applications.
13
14
CHAPTER 3
PROPOSED SYSTEM
The proposed system for facial emotion recognition (FER) leverages
advanced deep learning models to detect and classify human emotions from facial
expressions. The system utilizes the Convolutional Neural Network-10 (CNN-10)
architecture, known for its efficiency and high accuracy in facial feature extraction,
alongside the Vision Transformer (ViT), which enhances the model's ability to
understand the global context of facial expressions. By integrating both models, the
system aims to improve emotion recognition performance under diverse conditions,
such as varying lighting, facial orientations, and partial occlusions. This system is
designed to handle complex datasets like CK+, FER-2013, and JAFFE, which
include diverse facial expressions from various subjects. It includes preprocessing
steps such as image normalization, data augmentation, and cross-validation to
enhance robustness. The system is aimed at real-time applications, particularly in
areas like mental health monitoring, human-computer interaction, and behavioral
analysis. Facial emotion recognition system utilizing CNN-10 and Vision
Transformer (ViT) models to classify emotions based on facial expressions. The
proposed system addresses challenges like image diversity, partial occlusions, and
varying lighting by combining CNN-10's dynamic feature selection, which focuses
on key facial regions such as the eyes and mouth, with ViT's attention mechanisms
for capturing global and local interactions across facial regions. This approach
ensures robust and scalable performance, supported by data augmentation to
enhance model generalization. Evaluated on standard datasets (CK+, FER-2013,
and JAFFE), the system demonstrates superior accuracy, achieving 99.9%, 84.3%,
and 95.4%, respectively, outperforming traditional models like VGG19 and
InceptionV3. These results highlight the system's potential for applications in
human-computer interaction, mental health monitoring, and emotion-driven AI
systems.
3.1 Block Diagram
15
Facial emotion recognition is a critical application of deep learning in areas
like human-computer interaction, mental health assessment, and behavioral
analysis. The proposed system utilizes advanced models such as CNN-10 and
Vision Transformer (ViT) to extract local and global features from facial
expressions, ensuring accurate and robust classification of emotions. The process is
enhanced with data augmentation techniques and a systematic workflow, as
detailed in the block diagram below.
Figure 3.1 Block diagram of facial emotion recognition
The block diagram comprises the following key components:
1. Input Dataset: Images of facial expressions are collected from standard
datasets like CK+, FER-2013, and JAFFE. These datasets contain labeled
expressions, such as happiness, sadness, anger, and fear.
2. Data Augmentation: Enhances the dataset by applying transformations such
as rotation, scaling, flipping, and cropping. This increases diversity and
makes the model resilient to variations in lighting, orientation, and
occlusions.
3. Feature Extraction: CNN-10 Module: Focuses on extracting local spatial
features from specific facial regions like eyes, mouth, and nose. ViT Module:
Captures global context and relationships between facial regions using
attention mechanisms for a holistic understanding of expressions.
16
4. Classification: The extracted features are passed through classification
layers, which assign the emotion to predefined categories (e.g., happiness,
anger, sadness, etc.).
5. Output Predictions: The system outputs the recognized emotion for each
input image, ready for use in applications such as emotion-aware systems or
cognitive computing.
3.2 Face Detection and Preprocessing
1. Face Detection:
Face detection is the first critical step in the system. Various algorithms can
be used for this task, but for the proposed system, a reliable face detection
technique (e.g., Haar cascades, is employed to locate the face in the image.
Once detected, the system isolates the face region for further analysis.
2. Face Alignment :
After detecting the face, alignment techniques ensure that the face is
normalized and properly oriented. This helps the model focus on the facial
landmarks (eyes, eyebrows, nose, mouth) rather than irrelevant background
features. This step is essential to handle variations in head poses.
3. Image Normalization:
The system normalizes the image to standardize lighting conditions and
reduce any noise that may affect the emotion recognition process.
Techniques such as histogram equalization and contrast adjustment are
applied to enhance the facial features.
3.3 CNN-10 Model for Emotion Recognition
The core of the proposed system is the CNN-10 model, which is a custom
convolutional neural network architecture specifically designed for facial emotion
recognition. CNN-10 is a simplified yet powerful network that efficiently extracts
relevant facial features to classify emotions. CNN-10 is designed with 10
convolutional layers, making it deeper than typical shallow networks but less
17
complex than very deep models like ResNet or VGG. The layers in CNN-10
progressively learn complex representations of the input facial images. The key
features of CNN-10 include:
1. Convolutional Layers: These layers detect local features in the input image,
such as edges, shapes, and textures. The convolutional layers gradually build
more abstract and complex representations of the face.
2. Activation Function: The ReLU activation function is used after each
convolutional layer to introduce non-linearity and allow the model to learn
more complex patterns in the data.
3. Batch Normalization: Batch normalization is employed to stabilize the
training process and improve the convergence rate of the model by
normalizing the activations of each layer.
4. Pooling Layers: Max pooling is used to down-sample the feature maps and
reduce the spatial dimensions of the input, preserving only the most
significant features. This reduces computational complexity and allows the
model to generalize better.
5. Fully Connected Layers: The output of the convolutional layers is passed
through a series of fully connected layers. These layers integrate the learned
features and perform the final classification.
6. Output Layer: The final layer is a softmax layer that outputs the probability
distribution for each emotion class. Based on the highest probability, the
system classifies the facial expression into one of the predefined emotional
categories such as happiness, sadness, anger, fear, disgust, or surprise.
3.4 Emotion Classification
The CNN-10 model is trained on labeled datasets such as CK+, FER-2013,
and JAFFE to recognize facial emotions. During training, the network learns to
associate specific facial feature patterns with different emotions. The classification
process works as follows:
18
1. Feature Extraction: CNN-10 extracts hierarchical features from the face
image, starting from basic edges and textures to more complex facial
features like the shape of the mouth or the furrow of the brow.
2. Emotion Prediction: After passing through several layers of convolution and
pooling, the network produces a final output in the form of class
probabilities. The model assigns a label (emotion) to the facial expression
based on the highest probability.
3. Emotion Classes: The system classifies the facial expression into one of the
following primary emotional categories:
Happy
Sad
Anger
Fear
Disgust
Surprise
The model is designed to handle subtle variations in facial expressions,
recognizing both obvious and nuanced emotional states.
3.5 Post-Processing and Output
Once the emotion has been classified, the system proceeds to the post-
processing module. This component provides the user with the following outputs:
1. Emotion Label: The predicted emotion (e.g., Happy, Sad, Angry).
2. Confidence Score: The probability associated with the predicted emotion,
indicating the confidence of the model in its classification.
3. Emotion Visualization: The system can generate a graphical representation
of the detected emotion, possibly with a color-coded scale or emotional
heatmap. This output can be displayed on a graphical user interface (GUI)
for the end-user, which is especially useful in applications such as customer
feedback analysis, psychological studies, or human-computer interaction.
19
CHAPTER 4
CONCLUSION
Facial Emotion Recognition (FER) has become an essential technology for
understanding human emotions based on facial expressions. As an integral aspect
of non-verbal communication, facial expressions provide valuable insights into a
person’s emotional state, which can enhance interactions between humans and
machines. The development of FER systems, especially those based on deep
learning models like Convolutional Neural Networks (CNNs), has significantly
improved the accuracy and applicability of emotion detection. The Convolutional
Neural Network-10 (CNN-10) architecture, designed specifically for facial emotion
recognition, offers a balanced approach by combining computational efficiency
with high accuracy. CNN-10 simplifies the standard CNN architecture, making it
suitable for real-time applications that require both speed and precision. This model
is optimized to focus on crucial facial regions such as the eyes and mouth, which
are the most expressive areas for detecting emotions, while maintaining a
lightweight structure that reduces computational overhead.
By leveraging this advanced architecture, FER systems are better equipped
to handle challenges such as varying lighting conditions, head poses, and
occlusions. Additionally, CNN-10 improves the robustness of emotion recognition
across diverse datasets, ensuring that the model generalizes well to real-world
conditions. The system's ability to classify emotions such as happiness, sadness,
anger, fear, disgust, and surprise with high accuracy positions CNN-10 as a
valuable tool in a variety of applications. The potential applications of facial
emotion recognition are vast and include areas like healthcare where it can assist in
monitoring mental health by detecting signs of stress or depression. Marketing by
analyzing consumer reactions to products. Human-computer interaction enhancing
user experiences by enabling systems to respond to emotional cues. Furthermore,
ethical considerations around privacy and consent in the use of emotion data need
to be addressed as FER systems become more integrated into everyday life.
20
REFERENCES
1. Ajay B. S & Rao, M (2021) “Binary neural network-based real-time emotion
detection on an edge computing device to detect passenger anomaly”, 34th
International Conference on VLSI Design and 20th International Conference
on Embedded Systems (VLSID), Vol. 42, No. 1, pp. 421–425.
2. Amit Pandey et al.:(2022) “Facial emotion detection and recognition”,
International Journal of Engineering Applied Sciences and Technology, Vol.
7, No. 1, pp. 176–179.
3. Arpita Gupta et al.:(2020) “Deep self-attention network for facial emotion
recognition”, Computer Science, Vol. 171, No. 1, pp. 1527–1534.
4. Emmanuel Gbenga Dada et al.:(2023) “Facial emotion recognition and
classification using the convolutional neural network-10 (CNN-
10)”,Research Article, Vol. 42, No. 1, pp. 421–425.
5. Fatemeh Shahrabi Farahani et al.:(2013) “A fuzzy approach for facial
emotion recognition”,13th Iranian Conference on Fuzzy Systems (IFSC),
Vol. 42, No. 1, pp. 421–425.
6. Jia Guo(2021) “Deep learning approach to text analysis for human emotion
detection from big data”, Journal of Intelligent Systems, Vol. 31, No. 1, pp.
113–126.
7. Kanagaraju et al.:(2022) “Emotion detection from facial expression using
image processing”, International Journal of Health Sciences, Vol. 6, No. S6,
pp. 1368–1379.
8. Majid Razaq Mohamed Alsemawi et al.:(2023) “Emotions recognition from
human facial images based on fast learning network”, Indonesian Journal of
Electrical Engineering and Computer Science, Vol. 30, No. 3, pp. 1478–
1487.
21
9. Manimohan et al.:(2024) ‘Human emotion detection using CNN and transfer
learning”, International Research Journal on Advanced Engineering and
Management, Vol. 2, No. 5, pp. 1365–1371.
10.Rajesh K. M et al.:(2016) “A robust method for face recognition and face
emotion detection system using support vector machines”,International
Conference on Electronics, Communication, Computer Technology
(ICEECCOT), Vol. 42, No. 1, pp. 421–425.
11.Renuka S Deshmukh et al.:(2017) “Facial emotion recognition system
through machine learning approach”, International Conference on Intelligent
Computing and Control Systems (ICICCS), Vol. 1, No. 1, pp. 421–425.
12.Rohit Pathar et al.:(2019) “Human emotion recognition using convolutional
neural network in real time”, 1st International Conference on Innovations in
Information and Communication Technology (ICIICT), Vol. 1, No. 1, pp.
421–425.
13.Sowmya, B. J et al.:(2023) “Machine learning model for emotion detection
and recognition using an enhanced convolutional neural network”, Research
Article, Vol. 42, No. 1, pp. 421–425.
14.Vincentius Haposan et al:(2024) “Detection of human emotions through
facial expressions using hybrid convolutional neural network-recurrent
neural network algorithm”, Intelligent Systems with Applications, Vol. 21,
No. 4, pp. 200339.
15.Wang et al.: (2019) “Design of intelligent EEG system for human emotion
recognition with convolutional neural network”, IEEE International
Conference on Artificial Intelligence Circuits and Systems (AICAS), Vol. 42,
No. 1, pp. 421–425.