Project Repoprt Final-Speech Emotion Recognition
Project Repoprt Final-Speech Emotion Recognition
PROJECT REPORT ON
MAJOR PROJECT
Submitted By
Miss. KAVITHA
Department of CSE
CERTIFICATE
Certified that the project work entitled “SPEECH EMOTION RECOGNITION” is a bonafide
work carried out by ABIN K SHAJI, AKHIL ASOKAN, NASHAL AHMAD, VIGNESH
PRABHAKARAN bearing USN’s 4SH20CS003, 4SH20CS005, 4SH20CS041, 4SH20CS070
respectively in partial fulfilment for the VTU CBCS subject Major Project, and for the
award of degree of Bachelor of Engineering in Computer Science and Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2023-2024. It is certified
that all corrections / suggestions indicated for Internal Assessment have been incorporated in
the report deposited in the departmental library. The project report has been approved as it
satisfies the academic requirements in respect of project work prescribed for the degree of
Bachelor of Engineering.
EXTERNAL VIVA
1.
2.
ACKNOWLEDGEMENT
Abin K Shaji
Akhil Asokan
Nashal Ahmad
Vignesh Prabhakaran
DECLARATION
We Abin K Shaji, Akhil Asokan, Nashal Ahmad, Vignesh Prabhakaran bearing USN’s
4SH20CS003,4SH20CS005, 4SH20CS041, 4SH20CS070 respectively, students of 8th
semester Bachelor of Engineering, Computer Science and Engineering, Shree Devi Institute of
Technology, Mangalore declare that the project work entitled “Speech Emotion Recognition”
has been duly executed by us under the guidance of Miss, Kavitha Asst. Professor,
Department of Computer Science and Engineering, Shree Devi Institute of Technology,
Mangalore and submitted for the requirements for the 8th semester Major Project of
Bachelor of Engineering in Computer Science Engineering during the year 2023-2024.
Date:
Place: Mangalore
Speech emotion recognition processing has emerged as a critical research area with
applications spanning human-computer interaction, affective computing, and mental health
diagnostics. This project investigates the feasibility and efficacy of utilizing machine learning
techniques to discern emotional states from speech signals. A comprehensive dataset
encompassing a diverse range of emotions is employed to train and evaluate the models,
ensuring robustness and generalizability. The performance of each model is assessed based on
metrics such as accuracy, precision and recall. The project explores the impact of various
factors such as language, gender, and cultural background on the accuracy of emotion
recognition systems. Insights gained from this analysis contribute to the development of more
inclusive and adaptable models. This project provides valuable insights into the state-of-the-
art techniques and challenges in emotion recognition through speech processing. The findings
lay the groundwork for future research endeavors aimed at enhancing the accuracy and
applicability of such systems in real-world scenarios
TABLE OF CONTENTS
1.1 Introduction 1
2.1 Objective 4
5 METHODOLOGY 8
6 RESULTS 11-15
7 CONCLUSION 16
REFERENCES 16
LIST OF FIGURES
6.8 Result 1 14
6.9 Result 2 15
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
The most elementary way of communication in humans is Speech. To enrich interaction, one
needs to know and understand the emotion of another person and how to react to it. Unlike
machines, we humans can naturally recognize the nature and emotion of the speech. Can a
machine also detect the emotion from a speech? Well this could be made possible using machine
learning. Machines need a specific model for detecting the emotions of a speech and such a model
can be implemented using machine learning.
Speech emotion recognition is a very useful and important topic in today's world. A
machine detecting the emotion of a human speech can be proved useful in various industries. A
very basic usage of speech recognition is in the health sector where it can be used in detecting
depression, anxiety, stress etc. in a patient. It can also be used in industries like the crime sector
where emotions can be recognized from the speech to distinguish between victims and criminals.
Emotions can be of various types like happy, sad, angry, disguised etc. depending on the
feeling and frame of mind of the person. In our study, we have used various datasets with different
emotions. We have also combined four datasets to one dataset and then applied to the model so
that the efficiency of the model can be improved and there can be a variety in the data points. This
has also resulted in eliminating the overfitting condition in our model.
Despite significant progress in the field, speech emotion recognition still poses several
challenges. The variability and subjectivity inherent in human emotions, as well as the influence of
factors such as language, culture, and individual differences, present obstacles to building robust
and generalizable emotion recognition systems. Additionally, the presence of speech-related
phenomena such as background noise, speaker variability, and emotional masking further
complicates the task of emotion recognition.
In light of these challenges, this project aims to investigate the feasibility and efficacy of
utilizing machine learning techniques for speech emotion recognitio
1.Data Availability and Quality: One of the primary limitations of speech emotion recognition
studies is the availability and quality of annotated datasets. Limited access to diverse and well-
labeled datasets may restrict the generalizability of the findings and the robustness of the
developed models.
2. Subjectivity and Variability of Emotions: Emotions are inherently subjective and complex,
making their recognition from speech signals a challenging task. The variability in emotional
expression across individuals, cultures, and contexts introduces ambiguity and difficulty in
accurately categorizing emotions.
3.Speech Variability and Noise: Variability in speech characteristics, such as accent, pitch,
intonation, and speaking rate, can affect the performance of emotion recognition systems.
Additionally, the presence of background noise and environmental factors can further obscure
emotional cues in speech signals, leading to decreased accuracy.
4. Limited Emotional Range in Datasets: Many existing datasets for speech emotion recognition
focus on a limited range of basic emotions (e.g., happiness, sadness, anger), neglecting more
nuanced and complex emotional states. This limitation may restrict the applicability of the
developed models to real-world scenarios where emotions are multifaceted.
5.Speaker Dependency: Emotion recognition systems may exhibit bias or reduced accuracy when
confronted with speech from speakers who were not represented adequately in the training data.
Speaker-dependent models may struggle to generalize to new speakers or demographic groups,
limiting their practical utility.
CHAPTER 2
2.1 OBJECTIVE
The objective of this project is to develop a robust and accurate system for automatically
recognizing and classifying emotions from speech signals. The system will utilize machine learning
algorithms and signal processing techniques to analyze acoustic features of speech, such as pitch,
intensity, and spectral characteristics, in order to classify emotions into predefined categories such
as happiness, sadness, anger, and neutrality. The project aims to contribute to advancements in
human-computer interaction, affective computing, and the development of emotionally intelligent
systems.
The purpose of this study is speech emotion detection using machine learning algorithms. In
detail, this document will provide a general description of our project, including user
requirements, product perspective, and overview of requirements, general constraints. In addition,
it will also provide the specific requirements and functionality needed for this project - such as
interface, functional requirements and performance requirements.
Automatic speech emotion recognition is an active research area in the field of human computer
interaction (HCI) with wide range of applications. Extracted features of our project work are
mainly related to statistics of pitch and energy as well as spectral features
The ability to accurately recognize and classify emotions conveyed through speech is crucial for
various applications, including human-computer interaction, customer service, mental health
monitoring, and sentiment analysis in social media. However, existing speech emotion recognition
systems often face challenges such as limited accuracy, robustness to noise, and generalization
across different speakers and languages. Furthermore, there is a need for real-time processing
capabilities to enable seamless integration into interactive systems. Therefore, this project seeks to
address these challenges by developing a novel speech emotion recognition system that achieves
high accuracy, robustness, and efficiency across diverse speech samples and environmental
conditions.
CHAPTER 3
LITERATURE REVIEW
Md. Rayhan Ahmed et al. [1], used four deep neural network-based models built using LFABS.
Model-A uses seven LFABs followed by FCN layers and a soft max layer for classification. Model-
B uses LSTM and FCNs, Model-C uses GRU and FCNs and Model-D combines the three
individual models by adjusting their weights. From each of these audio files, they hand-craft five
categories of features- MFCC, LMS, ZCR, RMSE. did. data set. These features are used as inputs to
a one-dimensional (1D) CNN architecture to further extract hidden local features in these speech
files. To obtain additional contextual long-term representations of these learned local features via
the 1D CNN block, we extended our experiment by incorporating LSTM and GRU after the CNN
block, giving us more improved accuracy. After running DA, we observe that all four models
perform very well on the SER task of detecting emotions from raw speech audio. Among all four
models, the ensemble Model-D achieves the state-of-the-art weighted average accuracy of 99.46%
for TESS dataset.
A novel paradigm for emotion identification in the presence of noise and interference was put out
by Shibani Hamsa et al. [3]. In order to examine the 21 speaker's emotions, our method takes into
account the speaker's energy, time, and spectral factors. However, we suggest adopting the novel
wavelet packet transform (WPT)-based cochlear filter bank rather of the gammatone filter bank and
short-time Fourier transform (STFT) frequently employed in the literature. To do. When tested on
three different speech corpora in two different languages, our system—which combines this
representation with a random forest classifier—performs better than other existing algorithms and is
less prone to stressful noise. All metrics (Accuracy, Precision, Recall, and F1 scores) in the
RAVDESS and SUSAS datasets score above 80%.
A data imbalance processing approach based on the selective interpolation synthetic minority
oversampling (SISMOTE) methodology is suggested by Zhen-Tao Liu et al. [4] to reduce the
influence of sample imbalance on emotion identification outcomes. In order to minimise duplicate
characteristics with inadequate emotional representation, a feature selection approach based on
analysis of variance and gradient-enhanced decision trees (GBDT) is also provided. The results of
speech emotion detection tests on the CASIA, Emo DB, and SAVEE databases demonstrate that our
technique produces an average of 90.28% (CASIA), 75.00% (SAVEE), and 85.82% (based on the
findings) (Emo-DB). It demonstrates its precision in recognition. Utilizing voice emotion
recognition is superior to some cutting-edge technologies.
Dr. Nilesh Shelke et al. [2] used RAVDESS, TESS and SAVEE datasets for classification. Their
purpose is to mandate the modernization of current plans and technology enabling EDS and to
implement assistance in all areas of computers and technology. Analytics complement emotions
extracted from databases, layers, and model libraries created for emotion recognition from speech.
It mainly focuses on data collection, feature extraction, and automatic emotion detection results.
The intermodal recognition computer system is considered a unimodal solution because it offers
higher sorting accuracy. Accuracy depends on the number of emotions detected, the features
extracted, the classification method, and the stability of the database.
To accomplish efficient speech emotion identification, Apeksha Aggarwal et al. [5] have presented
two alternative feature extraction strategies. First, utilising super convergence to extract two sets of
latent features from voice data, bidirectional feature extraction is presented. Principal Component
Analysis (PCA) is used to produce the first set of features for the first set of features. A second
method involves extracting the Mel spectrogram picture from the audio file and feeding the 2D
image into his pre-trained VGG-16 model. In this study, several algorithms are used in
comprehensive experimentation and rigorous comparative analysis of feature extraction approaches
across two 22 datasets (RAVDESS ANDTESS). The RAVDESS dataset offered significantly more
accuracy.
A voice analysis-based emotion recognition system was proposed by Noushin Hajarolasvadi and
Hasan Demirel [7]. In order to extract an 88-dimensional vector of audio characteristics, including
Mel-frequency cepstrum coefficients (MFCC), pitch, and intensity for each frame, they first
partition each audio signal into overlapping frames of identical duration. For every frame, a
spectrogram is created concurrently. The speech signal is retrieved from each audio signal by
applying k-means clustering to the extracted characteristics of all the frames. This is the last
preprocessing step. Then, 3D tensor keyframes 23 are used to represent the relevant series of
spectrograms. Instead of using the entire set of spectrograms corresponding to speech frames, they
selected the k best frames to represent the entire speech signal. They then compared the proposed
3D-CNN results s. With the 2D-CNN results and demonstrated that the proposed method
outperforms pre-trained 2D meshes.
K.A.Darshan, DR.B.N.Veerappa [11] the purpose of this paper is to document the development of
speech recognition systems using CNNs. Design a model that can recognize the emotion of an
audio sample. Various parameters are changed to improve the accuracy of the model. This paper
also aims to find the factors that affect model accuracy and the key factors needed to improve
model efficiency. The whitepaper concludes with a discussion of various CNN architectures and
parameter accuracies needed to improve accuracy, as well as potential areas for improvement.
CHAPTER 4
SYSTEM REQUIREMENTS
• RAM: 8GB
• CPU: Intel Core i3
• Disk: Minimum 512GB
TECHNIQUE USED
The simple system of device studying is to give schooling facts to a learning algorithm. The
learning algorithm then generates a brand-new set of rules, primarily based on inferences from the
facts. This is in essence producing a new algorithm, officially referred to as the machine mastering
model. Instead of programming the computer each step of the manner, this approach offers the
device commands that allow it to study from facts without new step-with the aid of-step
commands via the programmer. Several troubles need to be considered while addressing AI,
including, socioeconomic effects; troubles of transparency, bias, and accountability; new makes
use of for information, considerations of protection and safety, ethical issues; and, how AI enables
the advent of latest ecosystems.The concerns regarding responsibility; and, its doubtlessly
disruptive effects on social and monetary structures.
CHAPTER 5
METHODOLOGY
In Speech Emotion Recognition (SER) methodology, data collection involves gathering diverse
speech recordings with labeled emotions, followed by preprocessing steps like feature extraction,
normalization, and segmentation. Model selection encompasses traditional machine learning or deep
learning models, or hybrid approaches. Training involves splitting data, training the chosen model(s),
and fine-tuning hyperparameters to prevent overfitting, while evaluation employs metrics like
accuracy and F1-score. Post-processing and performance tuning refine results, with deployment in
real-world applications following. Continual improvement considers ethical implications and
incorporates new research findings to enhance SER performance iteratively.
Librosa Model
The recording that we gathered were split into two classes one for preparing information and
another for test information to be utilized in grouping test. Preparing systems are prepared with the
python TensorFlow library. At that point exactness is measure and picks two appropriate models for
use in picture recognition
In the provided code, you're extracting features from the audio data using Mel-frequency cepstral
coefficients (MFCCs), which are commonly used for speech and audio processing tasks. Then,
you're building a classification model using a recurrent neural network (RNN) with Long Short-
Term Memory (LSTM) units.
In this progression, it demonstrates a pipeline for processing audio data, extracting features, and
training a model for emotion detection. It's a common approach in the field of affective computing,
which focuses on developing systems that can recognize, interpret, process, and simulate human
emotions.
Model Evaluation: After training the model, evaluate its performance on a separate validation set
or through cross-validation. This typically involves calculating metrics such as accuracy, precision,
recall, and F1-score for each emotion class.
Confusion Matrix: Examine the confusion matrix to see how well the model is performing for
each emotion category. The confusion matrix shows the true positives, false positives, true
negatives, and false negatives for each class, providing insights into which emotions are being
correctly classified and which are being confused with others.
Class-wise Metrics: Calculate metrics like precision, recall, and F1-score for each emotion class.
Precision measures the proportion of true positive predictions among all positive predictions for a
given emotion, recall measures the proportion of true positive predictions among all actual
instances of that emotion, and F1-score is the harmonic mean of precision and recall.
Visualization: Visualize the model's performance using plots such as ROC curves (Receiver
Operating Characteristic) or precision-recall curves. These can provide a visual understanding of
the trade-off between true positive rate and false positive rate, or between precision and recall,
respectively.
Error Analysis: Examine instances where the model misclassifies emotions and try to understand
why. This could involve listening to audio samples, analyzing the features extracted from those
samples, and identifying potential reasons for misclassification.
Comparison with Baselines: Compare the performance of your model with baseline methods or
previous studies in the field. This can help contextualize the results and understand whether the
model is performing competitive.
The use case diagram depicts the interaction between the user, the emotion detection system, and
external sources of audio data. The user selects an audio file, which is then analysed for emotions by
the Emotion Detector system using the Audio System (Librosa). Finally, the results of emotion
classification are provided to the user for viewing. External audio sources, such as repositories or
databases, can provide additional audio files for analysis.
CHAPTER 6
RESULTS
Fig
6.4
CHAPTER 7
CONCLUSION
Overall, this project contributes to the ongoing research efforts in affective computing
and lays the groundwork for the development of more sophisticated and emotionally
intelligent systems that can better understand and respond to human emotions conveyed
through speech.
REFERENCES
[1] Shelke, N., Wadyalkar, V., Kotangale, D., Kuyate, N., Nerkar, A., & Gour, N. (n.d.). A NOVEL
APPROACH TO EMOTION DETECTION FROM SPEECH
[2] Kumar, A., & Iqbal, J. L. M. (2019). Machine Learning Based Emotion Recognition using
Speech Signal. International Journal of Engineering and Advanced Technology (IJEAT), 9,
2249–8958.
[3] Mittal, R., Vart, S., Shokeen, P; Kumar, M. (2022). Speech Emotion Recognition
[4] P. Tzirakis, J. Zhang, B. W. Schuller, ―End-to-end speech emotion recognition using
deep neural networks‖, 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 5089-5093, 2018