Efficient Speech Emotion Recognition
Presented By: Samir kumar Majhi
Presentation
outlines
1. Introduction
2. Literature review
3. Proposed Model
4. Dataset Used
5. Result
6. Comparison Analysis
Introduction
Speech Emotion Recognition (SER) is about understanding how a person feels
by listening to their voice. It looks at things like how high or low the voice is,
how loud it is, and how the person talks. SER can tell if someone is happy, sad,
angry, or scared. It is used in places like customer service, healthcare, and
talking with computers to make communication better. Thanks to new
technology like machine learning, SER is getting better at helping machines
understand human emotions.
Literature review
S.N. AUTHOR TITLE AND JOURNAL PROPOSED METHOD MERITS LIMITATION
PUBLISHER
1. Ala Saleh Alluhaidan, Speech Applied A Convolutional Neural Achieves high Limited to audio-
Oumaima Saidani el Emotion Sciences Network (CNN) using hybrid accuracy (97%, based datasets and
at Recognition features (MFCC and time- 93%, 92% on needs further
through Hybrid domain features) to improve Emo-DB, SAVEE, exploration for
Features and emotion detection accuracy and RAVDESS multimodal or real-
Convolutional datasets world complex
Neural respectively), emotional detection
Network outperforms scenarios
traditional
methods
2. Samuel Kakuba, Deep IEEE CoSTGA model using multi- Achieves 75.50% Limited to speech-
Alwin Poulose, Dong Learning- Access level fusion of spatial, weighted accuracy based emotion
Seog Han et al Based Speech temporal, and semantic and 75.82% recognition and may
Emotion features unweighted need further
Recognition accuracy, showing exploration with
Using Multi- improved additional modalities
Level Fusion of robustness for real-world
Concurrent applications
Features
S.N. AUTHOR TITLE AND JOURNAL PROPOSED METHOD MERITS LIMITATION
PUBLISHER
3. SAMUEL KAKUBA, Deep IEEE Proposed includes utilizing An advantage of Disadvantages of
ALWIN POULOSE AND Learning- ACCESS (CNNs) and deep learning using deep learning using complex
DONG SEOG HAN 4 , Based Speech model called CoSTGA focuses on models, particularly architectures and
(Senior Member, IEEE) Emotion learning features from both those trained using multi-level fusion
Recognition audio and text of speech at the multi-level fusion, is techniques for speech
Using Multi- same time their emotion recognition
Level Fusion comprehensive include large data
of Concurrent understanding, real- requirements and a
Features world applicability, potential lack of
enhanced accuracy, interpretability due to
and ability to the complexity of the
effectively ignore models.
background noise in
speech emotion
recognition tasks.
4. Cheng Lu , Wenming Speech IEEE introduces an Attentive Time- The proposed While achieving high
Zheng , Senior Emotion TRANSACTI Frequency Neural Network for method achieves accuracy, the
Member, IEEE, Hailun Recognition ONS ON Speech Emotion Recognition, state-of-the-art computational
Lian, Yuan Zong , via an COMPUTATI which effectively captures performance in complexity of the
Member, IEEE, Attentive ONAL
time-frequency patterns in speech emotion proposed model
Chuangao Tang , Time– SOCIAL
Sunan Li, and Yan Frequency SYSTEMS,
speech signals to identify recognition by might be higher
Zhao Neural VOL. 10, emotional states. leveraging compared to simpler
Network NO. 6, attention models, limiting its
DECEMBER mechanisms to real-time application
2023 focus on relevant in resource-
time-frequency constrained
features. environments.
Dataset used
:- RAVDESS
The Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS) contains 1,440 speech samples from
24 actors expressing 8 emotions (neutral, calm, happy, sad,
angry, fearful, disgust, and surprised). It is widely used for
emotion recognition research due to its high-quality,
labeled emotional expressions, supporting the training of
machine learning models like DNNs and CNNs.
OUR FINAL Proposed model
Steps involved
1.Audio Input – Raw audio is given to the system.
2.MFCC Extraction – Gets important voice features.
3.RMS Extraction – Measures loudness of the sound.
4.ZCR Extraction – Counts how often the sound signal changes direction.
5.Feature Fusion – Combines MFCC, RMS, and ZCR features.
6.CNN Block – Finds useful patterns in the combined features.
7.Flattening – Turns the patterns into a single line of numbers.
8.Dense Layers – Learns from the patterns to understand emotions.
9.Dropout – Helps prevent the model from memorizing too much.
10.Softmax Output – Picks the emotion with the highest chance.
RESULTS AND DISCUSSION
•Model performance metrics: Accuracy (Acc),
Precision, Recall, F1-score.
•Comparison with other models:
• MFCC only: 85.30% accuracy.
• RMS + ZCR: 88.40% accuracy.
• MFCC + RMS + ZCR (Fusion Model): 99.31%
accuracy.
Comparison
• .
analysis
Existing
. Model (MFCC + VGGish):
•Uses MFCC with a pre-trained model called VGGish.
•VGGish is good for general sound tasks, not emotions.
•Accuracy is lower because it's not made for emotion detection.
Proposed Model (MFCC + RMS + ZCR + CNN):
•Combines MFCC, RMS, and ZCR features.
•Uses CNN to learn better patterns from the data.
•Gets very high accuracy (99%) for emotion recognition.
•Works better because it uses more types of information and is made for emotions.