Deep Learning Report 1 3

This project develops a deep learning-based Speech Emotion Recognition (SER) system using datasets like RAVDESS, CREMA-D, and SAVEE to classify emotions such as happiness, sadness, anger, and fear. The methodology includes extensive feature extraction using Mel-Frequency Cepstral Coefficients (MFCCs) and the training of Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), with data augmentation techniques applied to enhance model generalization. The results indicate that deep learning models can significantly improve human-computer interaction by enabling machines to interpret human emotions from speech effectively.

Uploaded by

Subhradip Bhattacharyya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views3 pages

Deep Learning Report 1 3

Uploaded by

Subhradip Bhattacharyya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Speech Emotion Detection

The Neural Voices

Avirup Das & Subhradip Bhattacharyya
[email protected] & [email protected]
May 1, 2025

Abstract
This project focuses on developing a deep learning-based Speech Emotion Recognition
(SER) system using benchmark datasets such as RAVDESS, CREMA-D, and SAVEE.
We performed extensive feature extraction, including Mel-Frequency Cepstral Coecients
(MFCCs), to capture the emotional characteristics of speech signals. Both Arti cial Neural
Networks (ANNs) and Convolutional Neural Networks (CNNs) were designed and trained
to classify emotions such as happiness, sadness, anger, and fear. To improve generalization,
we implemented data augmentation techniques and conducted comparative experiments
between the original and augmented datasets. The models were evaluated using accu-
racy, precision, recall, and F1-score to assess their performance comprehensively. Our
results demonstrate the potential of deep learning models to signi cantly enhance human-
computer interaction by enabling machines to e ectively interpret human emotions from
speech.

1 Introduction
What?
This project involves building a Speech Emotion Recognition (SER) system using deep
learning techniques. The objective is to classify human emotions|such as happiness,
sadness, anger, and fear|based on audio recordings of speech.
Why?
Understanding emotions from speech is essential for improving human-computer interac-
tion. It has signi cant applications in areas such as virtual assistants, mental health moni-
toring, customer service automation, and intelligent tutoring systems. Machines equipped
with emotional awareness can respond more naturally and empathetically, enhancing the
user experience.

1
How?
We used benchmark speech emotion datasets, including RAVDESS, CREMA-D, and SAVEE,
which provide labeled audio samples for various emotional states. From these audio
recordings, we extracted meaningful features such as Mel-Frequency Cepstral Coecients
(MFCCs). Using these features, we trained both Arti cial Neural Networks (ANNs) and
Convolutional Neural Networks (CNNs) to classify emotions. Additionally, data augmen-
tation techniques were applied to improve model generalization. The trained models were
evaluated using standard classi cation metrics such as accuracy, precision, recall, and F1-
score to assess their performance.

2 Literature Review
• Early Approaches: Traditional SER methods used handcrafted features like pitch,
energy, and spectral features, classi ed using models such as Support Vector Machines
(SVM) and Hidden Markov Models (HMM). These approaches were often limited in
performance due to shallow representations.
• Deep Learning Models:

– Trigeorgis et al. (2016) introduced a Convolutional Recurrent Neural Network

(CRNN) for end-to-end emotion recognition from raw audio, combining CNNs
for spectral features and RNNs for temporal dynamics.
– Neumann and Vu (2017) implemented attention mechanisms in deep networks
to improve SER on benchmark datasets.
• State-of-the-Art on RAVDESS:

– The VQ-MAE-S-12 (Frame) model with the Query2Emo framework currently

holds the highest accuracy.
– This model leverages vector-quantized masked autoencoders and transformer-
based architectures to learn robust representations in a self-supervised fashion.
• State-of-the-Art on CREMA-D:

– The best-performing model is based on a Vision Transformer (ViT) with verti-

cally long patches, which treats speech spectrograms as images.
– This method excels in capturing both spectral and temporal features using
attention-based mechanisms.
• Limitations of Existing Work:

– Most SOTA models require extensive pretraining on large corpora and rely on
heavy computational resources.

2
– These methods are typically optimized for a single dataset and do not o er
generalizability insights across multiple datasets.
• What we try to do:

– We conduct a comparative study using Arti cial Neural Networks (ANNs) and
Convolutional Neural Networks (CNNs) across three standard datasets: RAVDESS,
CREMA-D, and SAVEE.
– We extract MFCC features from raw audio and apply data augmentation tech-
niques (e.g., noise addition, pitch shift) to enhance diversity and generalization.
– Unlike transformer-based models, our approach is lightweight and reproducible
in resource-constrained environments.
– We evaluate model performance using standard classi cation metrics (accuracy,
precision, recall, F1-score), and analyze the impact of augmentation and model
architecture on cross-dataset performance.

3 Proposed Methodology
This section outlines the methodological framework followed in the development of the
Speech Emotion Recognition (SER) system. A sequence of preprocessing, modeling, and
evaluation steps were performed to construct and validate models capable of classifying
speech audio signals into discrete emotional categories. Below are the detailed components
of the proposed methodology:

3.1 Feature Extraction using MFCC

Mel-Frequency Cepstral Coecients (MFCCs) were extracted from each audio sample.
MFCCs are a widely adopted feature representation in speech and audio analysis as they
e ectively encode the short-term power spectrum of sound based on human auditory per-
ception. The extraction process involved:
• Pre-emphasis of the audio signal to amplify high-frequency components.

• Framing and windowing to divide the signal into overlapping segments.

• Applying the Fast Fourier Transform (FFT) to obtain the power spectrum.

• Mapping powers to the Mel scale using triangular lter banks.

• Taking the logarithm of Mel spectrum energies.

• Applying the Discrete Cosine Transform (DCT) to decorrelate features.

Format4 Tpacad 4 0 Code Definition Rev2 14 12 2016
No ratings yet
Format4 Tpacad 4 0 Code Definition Rev2 14 12 2016
49 pages
SER Techniques for Researchers
No ratings yet
SER Techniques for Researchers
55 pages
Speech Emotion Recognition Guide
No ratings yet
Speech Emotion Recognition Guide
14 pages
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
No ratings yet
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
12 pages
DL Research Paper PDF
No ratings yet
DL Research Paper PDF
15 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
Electronics 12 00839 v2
No ratings yet
Electronics 12 00839 v2
17 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Review 3 PPT Final1)
No ratings yet
Review 3 PPT Final1)
51 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Sentispeak Tone Mood Detector
No ratings yet
Sentispeak Tone Mood Detector
16 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Reality
No ratings yet
Reality
11 pages
9 - Yogendra
No ratings yet
9 - Yogendra
5 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Emotion Recognition Using Speech Processing
No ratings yet
Emotion Recognition Using Speech Processing
5 pages
Literature Review (2) Smaple
No ratings yet
Literature Review (2) Smaple
9 pages
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
No ratings yet
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
5 pages
Literature Study 2025
No ratings yet
Literature Study 2025
10 pages
FP-05 4
No ratings yet
FP-05 4
6 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
Cross-Accent Emotion Recognition
No ratings yet
Cross-Accent Emotion Recognition
19 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
No ratings yet
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
13 pages
Speech Emotion Recognition Model
No ratings yet
Speech Emotion Recognition Model
19 pages
Speech
No ratings yet
Speech
6 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
1st Review
No ratings yet
1st Review
19 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
Final Presentation
No ratings yet
Final Presentation
50 pages
Voice Emotion Recognition
No ratings yet
Voice Emotion Recognition
11 pages
Speech Emotion Recognition Using Deep Learning Hybrid Models
No ratings yet
Speech Emotion Recognition Using Deep Learning Hybrid Models
5 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
No ratings yet
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
31 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
4 pages
Research Paper Seminar
No ratings yet
Research Paper Seminar
17 pages
Speech Emotion Analysis System
No ratings yet
Speech Emotion Analysis System
10 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
5 pages
An Enhanced Speech Emotion Recognition Using Vision Transformer
No ratings yet
An Enhanced Speech Emotion Recognition Using Vision Transformer
17 pages
Attention-Based Multi-Level Feature Fusion For Multilingual Speech Emotion Recognition
No ratings yet
Attention-Based Multi-Level Feature Fusion For Multilingual Speech Emotion Recognition
6 pages
Speech Emotion Recognition Guide
No ratings yet
Speech Emotion Recognition Guide
86 pages
Project Report - 092046
No ratings yet
Project Report - 092046
5 pages
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
No ratings yet
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
10 pages
XEmoAccent Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
18 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
SPRINGERIJST
No ratings yet
SPRINGERIJST
11 pages
1 s2.0 S0003682X23002906 Main
No ratings yet
1 s2.0 S0003682X23002906 Main
11 pages
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
No ratings yet
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
68 pages
Deep Learning Techniques For Speech Emotion Recognition A Review
No ratings yet
Deep Learning Techniques For Speech Emotion Recognition A Review
6 pages
Deep Learning in Speech Emotion Recognition
No ratings yet
Deep Learning in Speech Emotion Recognition
4 pages
Draft 6
No ratings yet
Draft 6
14 pages
4 ENGS3594 Subhradip Bhattacharyya-SA
No ratings yet
4 ENGS3594 Subhradip Bhattacharyya-SA
24 pages
Deep Learning Report 4 6
No ratings yet
Deep Learning Report 4 6
3 pages
2015 458236 Meghe-Dhaka-Tara
No ratings yet
2015 458236 Meghe-Dhaka-Tara
126 pages
IKS Town Planning
No ratings yet
IKS Town Planning
11 pages
Electronic Reservation Slip (ERS) : 2814462702 12057/janshatabdi Exp Ac Chair Car (CC)
No ratings yet
Electronic Reservation Slip (ERS) : 2814462702 12057/janshatabdi Exp Ac Chair Car (CC)
3 pages
NITs/IIITs Seat Matrix 2023
No ratings yet
NITs/IIITs Seat Matrix 2023
1 page
Printed 黃建華Oracle - EBS Workflow
No ratings yet
Printed 黃建華Oracle - EBS Workflow
90 pages
2017 Mit 070
No ratings yet
2017 Mit 070
71 pages
Real-ESRGAN: Synthetic Data Super-Resolution
No ratings yet
Real-ESRGAN: Synthetic Data Super-Resolution
10 pages
Immediate download Chaplains in Early Modern England Patronage Literature and Religion Politics Culture Society in Early Modern Britain Politics Culture and Society in Early Modern Britain Tom Lockwood Editor Gillian Wright Editor ebooks 2024
No ratings yet
Immediate download Chaplains in Early Modern England Patronage Literature and Religion Politics Culture Society in Early Modern Britain Politics Culture and Society in Early Modern Britain Tom Lockwood Editor Gillian Wright Editor ebooks 2024
14 pages
AZ204 Resources
No ratings yet
AZ204 Resources
3 pages
Cómo Escribir Un Ensayo Paso A Paso
100% (1)
Cómo Escribir Un Ensayo Paso A Paso
7 pages
QX-5000 Configurator User Guide
No ratings yet
QX-5000 Configurator User Guide
40 pages
SOFTWARE MANUAL DesignStudioReference RevT
No ratings yet
SOFTWARE MANUAL DesignStudioReference RevT
936 pages
SE CH04 Software Requirement Analysis
No ratings yet
SE CH04 Software Requirement Analysis
77 pages
Week 11 APP Tutorial Assignment
No ratings yet
Week 11 APP Tutorial Assignment
4 pages
Practical File Python
No ratings yet
Practical File Python
25 pages
WWW Hackingarticles in Category Collection of Hacking Tools
No ratings yet
WWW Hackingarticles in Category Collection of Hacking Tools
28 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
10 pages
CS English
No ratings yet
CS English
47 pages
SB2X 115 02
No ratings yet
SB2X 115 02
20 pages
Dffccil 2 X 25 KV Tender Document
No ratings yet
Dffccil 2 X 25 KV Tender Document
264 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Michael Hennerich - Multichannel Phase Coherent System
No ratings yet
Michael Hennerich - Multichannel Phase Coherent System
64 pages
Digital Literacy
No ratings yet
Digital Literacy
19 pages
Prinect Product Portfolio
No ratings yet
Prinect Product Portfolio
143 pages
Eda On Housing Data
No ratings yet
Eda On Housing Data
7 pages
Overview On DBS
No ratings yet
Overview On DBS
30 pages
General Terminal Commands::cd:pwd
No ratings yet
General Terminal Commands::cd:pwd
19 pages
Decentralized Music Finance Engineer Job
No ratings yet
Decentralized Music Finance Engineer Job
2 pages
GMC 300E Plus User Guide
No ratings yet
GMC 300E Plus User Guide
24 pages
Anurag Resume
No ratings yet
Anurag Resume
3 pages
Lutech Viewer2-9 Manual FINAL 200930A
No ratings yet
Lutech Viewer2-9 Manual FINAL 200930A
19 pages
Cybersecurity Tool for All Users
No ratings yet
Cybersecurity Tool for All Users
39 pages
Ignore Pause in LaTeX Beamer With Handout - Gordon Lesti
No ratings yet
Ignore Pause in LaTeX Beamer With Handout - Gordon Lesti
2 pages