0% found this document useful (0 votes)

28 views4 pages

Speech Emotion Recognition Insights

Uploaded by

yassmen youssef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views4 pages

Speech Emotion Recognition Insights

Uploaded by

yassmen youssef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

SER Final Report

All my presentation links :

https://drive.google.com/drive/folders/1StI5socw_u5xd0_whrFUFWzLFf
JbgqoC?usp=sharing

My drive link :
https://drive.google.com/drive/folders/1jPYbcaIs7ZMQkYcAah66bcFNR
63NJHMr?usp=sharing

I initially attempted to utilize the SpeechBrain framework for speech

emotion recognition on the challenging CREMA-D dataset. Despite the
initial efforts, this approach did not yield promising results. Recognizing
the need for an alternative approach, I switched to fine-tuning the
Wav2Vec2 model for speech emotion recognition using the CREMA-D
and RAVDESS datasets. The aim was to accurately classify emotions in
speech data. Through experimentation, I achieved impressive results on
the test set, with an accuracy of 79% and 81% for the CREMA-D and
RAVDESS datasets, respectively. The model weights can be accessed on
my Hugging Face repository, and comprehensive information, including
datasets and all trails is available in the provided drive and presentation
links.
 The code for Ravdess's implementation of fine tune wav2vec2 can be
accessed here Similarly, the code for crema-d can be found here .

Additionally, I think that incorporating the visual component through

multimodal emotion recognition, particularly by utilizing videos, has the
potential to further enhance the performance on both datasets.

1. Multimodal emotion recognition using cross modal audio-

video fusion with attention and deep metric learning

During the implementation of the paper , I encountered difficulties,

particularly in the video stream section. The results I obtained were not
accurate. One major issue I faced was that the model seemed to be unable
to learn effectively. Consequently, I decided to shift my focus to the
audio part. However, I encountered further challenges in the audio section,
particularly when applying augmentation techniques to the spectrogram
features. Additionally, some of the test samples were mistakenly included
in the training data, leading to inflated accuracy scores that did not reflect
the true performance of the model.

To fix these issues, I tried many different experiments and made changes
to my approach. One important change I made was to separate the test
and training data before applying any augmentation techniques. This way,
I made sure that the augmented samples didn't overlap between the two
sets. This helped to prevent any misleading results.

The figure above illustrate the model architecture of the paper

I experimented with different settings and values for the hyperparameters

in the data preprocessing and triplet loss function, but the paper didn't
provide clear information about them. Unfortunately, these experiments
didn't lead to good results, especially when I worked with the crema-d
dataset. To address this issue, I switched to the ravdess dataset and
conducted several trials on it. Here some of the best outcomes from these
experiments in the following table, while the remaining results are stored
in my drive:

Model (trails on RAVDESS) Result on test ( with CV_5 folds )

Paper model ”AT” (resnet18+attention) ~ 71%
with augmentation(3 way)
Paper model ”AT” (resnet18+attention) ~ 61%
without augmentation
Paper model ”AT” (resnet18+attention) ~ 71%
with augmentation(add noise only)
Paper model ”AT” (resnet18+attention) ~ 77%
with augmentation(3 new ways)
In the previous table, the results were based on the RAVDESS dataset
using only the spectrogram feature. However, I conducted further
experiments by incorporating the MFCC feature. The updated table below
showcases the results obtained:

Model (trails on RAVDESS) Result on test ( with CV_5 folds )

Paper model ”AT” (resnet18+attention) ~ 81%
( spectrogram + mfcc features) with 3
new augmentations
Edited model with mfcc only and new ~ 95%
augmentation
Edited model : The ResNet weights were initialized with the pre-trained
weights from ImageNet. A pooling layer was added after the ResNet
features, and instead of multihead self-attention, normal temporal self-
attention was employed.

I made several modifications to the model implementation, including

changes to both the ResNet architecture and the attention module. After
these modifications, I observed a noticeable improvement in the accuracy,
reaching around 87%. However, when I attempted to switch to ResNet50
and ResNet34 architectures, the results actually worsened instead of
improving.I decided to incorporate the MFCC features into the model.
Surprisingly, this adjustment led to a remarkable accuracy of 95%.
 “AT” notebooks on Ravdess can be accessed here , check_mfcc only.
I also explored another approach called speaker-independent technique
with the same dataset. However, the results were not as promising, with
an accuracy of approximately 60% on the test actors.

2. Vision Transformers (ViT) and ViTCoord

 In my initial experiment, I followed this repository from GitHub and
applied it to the spectrogram and MFCC features of the RAVDESS
dataset I saved from previous trails. However, the results I obtained
on the test set were only 60% accurate.
 After that I tried vitcn model from this paper on crema-d, However, I
encountered issues related to CUDA, which required me to reduce the
batch size to 1. Although this adjustment allowed the model to run
without errors, it significantly increased the training time.
Furthermore, the results after the initial two epochs were
unsatisfactory.
 In my search for alternative solutions, I discovered another GitHub
repository that also utilized Vision Transformers (ViT). I decided to
try it out, and fortunately, it worked smoothly without encountering
the previous issues. However, the results obtained from applying this
model to both the crema-d and RAVDESS datasets were still below
expectations, with an accuracy in range of 48%.
 I explored another method called CoordVit, which I found in a
research paper (which it on the top in leadrboard of crema-d), This
approach involved adding coordinates to each pixel in the input
batches during training. I conducted various trials with different
preprocessing techniques, augmentations, and implementation
approach, but the results consistently fell within the range of 49%.

 Throughout these trials, I made several adjustments to the

preprocessing techniques, including changes to how the data was split
into small segments and how spectrograms were generated. These
modifications aimed to improve the preprocessing pipeline and
enhance the performance of the ViT models. Despite these efforts, the
results remained unsatisfactory, indicating the need for further
investigation and experimentation to achieve better accuracy.

3. Mixing ResNet18 with TCN and Coord techniques

I decided to give ResNet a final attempt using the same approaches of tcn
(Temporal Convolutional Networks ) and coord on crema-d only and here
some of results in the following table:

Model (on CREMA-D) Test results

Resnet18 only (spec size(299,200), no ~ 54%
resize, no normalize,fixed length, edit
first conv. layer)
Resnet18 with coord approach ~ 54%
Resnet18 + coord ( resizing, change ~ 54%
batch size from 16 to 32)
Resnet18 +coord + tcn (no resizing ~ 42%
each segment shape is (24,200))
Resnet18 +coord +tcn (resizing) ~ 45%

As we can see from results, there was a notable issue with the results
obtained using the previous methods on crema-d. In response, I
decided to explore wav2vec2. The outcomes achieved using
wav2vec2 were exceptionally promising, as demonstrated on the first
page of the report.

Speech Emotion Recognition Using HyAudioNet
No ratings yet
Speech Emotion Recognition Using HyAudioNet
3 pages
Project Report - 092046
No ratings yet
Project Report - 092046
5 pages
Real-Time Speech Emotion Detection
No ratings yet
Real-Time Speech Emotion Detection
16 pages
Emotion Tagging via Weak Supervision
No ratings yet
Emotion Tagging via Weak Supervision
46 pages
IEEE Conference Template 2
No ratings yet
IEEE Conference Template 2
4 pages
Speech Emotion Recognition From Raw Audio Using Deep Learning
No ratings yet
Speech Emotion Recognition From Raw Audio Using Deep Learning
83 pages
LLM
No ratings yet
LLM
28 pages
Project Report SSUC-12
No ratings yet
Project Report SSUC-12
2 pages
Deep Learning Report 4 6
No ratings yet
Deep Learning Report 4 6
3 pages
Emotion Recognition Using CNN and RNN
No ratings yet
Emotion Recognition Using CNN and RNN
37 pages
BATCH 9 - FINAL PPT (Anisha, Faiza, Vandana)
No ratings yet
BATCH 9 - FINAL PPT (Anisha, Faiza, Vandana)
27 pages
Voice Emotion Recognition
No ratings yet
Voice Emotion Recognition
11 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
5 pages
Multimodal Emotion Recognition
No ratings yet
Multimodal Emotion Recognition
5 pages
Col774 Ass4 v2
No ratings yet
Col774 Ass4 v2
7 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
Thesis
No ratings yet
Thesis
76 pages
Facial Emotion Recognition Report
No ratings yet
Facial Emotion Recognition Report
12 pages
Deep Learning for Emotion Detection
No ratings yet
Deep Learning for Emotion Detection
9 pages
A Multimodal Fusion Approach Emotion Identification From Audio and Video
No ratings yet
A Multimodal Fusion Approach Emotion Identification From Audio and Video
5 pages
AIoT E-Healthcare Project Pressentation
No ratings yet
AIoT E-Healthcare Project Pressentation
140 pages
162,163,174 - 1 - Revised Paper
No ratings yet
162,163,174 - 1 - Revised Paper
14 pages
ViT Explained
No ratings yet
ViT Explained
15 pages
Ass
No ratings yet
Ass
5 pages
Notes
No ratings yet
Notes
3 pages
Set A
No ratings yet
Set A
20 pages
Vision Transformer (ViT)
No ratings yet
Vision Transformer (ViT)
26 pages
Assigh
No ratings yet
Assigh
2 pages
1802.05630v2 - Speech Emotion Detection
No ratings yet
1802.05630v2 - Speech Emotion Detection
5 pages
Literature Review (2) Smaple
No ratings yet
Literature Review (2) Smaple
9 pages
ml2 Copy
No ratings yet
ml2 Copy
20 pages
Best Project
No ratings yet
Best Project
2 pages
Reality
No ratings yet
Reality
11 pages
DCASE2022 Dinkel 52 t4
No ratings yet
DCASE2022 Dinkel 52 t4
5 pages
Entropy 21 00479 PDF
No ratings yet
Entropy 21 00479 PDF
17 pages
Opening The Black Box of Wav2Vec Feature Encoder
No ratings yet
Opening The Black Box of Wav2Vec Feature Encoder
5 pages
Base Papaer
No ratings yet
Base Papaer
6 pages
FER PDF
No ratings yet
FER PDF
20 pages
Artificial Intelligence in Finance Newsletter by Slidesgo
No ratings yet
Artificial Intelligence in Finance Newsletter by Slidesgo
27 pages
Miniprojectppt
No ratings yet
Miniprojectppt
22 pages
Audio Recognition with CNN
No ratings yet
Audio Recognition with CNN
14 pages
Research Notes
No ratings yet
Research Notes
9 pages
Chest Xray Captioning
No ratings yet
Chest Xray Captioning
28 pages
Phase 2 Review 1
No ratings yet
Phase 2 Review 1
32 pages
162,163,174 - 1 - Revised Paper
No ratings yet
162,163,174 - 1 - Revised Paper
13 pages
Computer Vision Integrated Website
No ratings yet
Computer Vision Integrated Website
6 pages
Iris Classification Using ResNe1
No ratings yet
Iris Classification Using ResNe1
8 pages
Consulting Proposal
No ratings yet
Consulting Proposal
28 pages
Computer Vision Fa2
No ratings yet
Computer Vision Fa2
5 pages
Methodology, 1
No ratings yet
Methodology, 1
1 page
Deepak Report Phase1
No ratings yet
Deepak Report Phase1
80 pages
2151 6982 1 SM
No ratings yet
2151 6982 1 SM
6 pages
CS772 Project Report
No ratings yet
CS772 Project Report
9 pages
Emotion Dect
No ratings yet
Emotion Dect
4 pages
W01 PracticalProblemsProjects
No ratings yet
W01 PracticalProblemsProjects
27 pages
Project Document - Oralens
No ratings yet
Project Document - Oralens
7 pages
Driver Drowsiness Detection AI
No ratings yet
Driver Drowsiness Detection AI
9 pages
Master Thesis 2022 UiO Andreas Mathisen Edvard Unsvaag
No ratings yet
Master Thesis 2022 UiO Andreas Mathisen Edvard Unsvaag
147 pages
Visual Transformers
No ratings yet
Visual Transformers
26 pages
Plant Health App for Farmers
No ratings yet
Plant Health App for Farmers
12 pages
Fake News Detect
No ratings yet
Fake News Detect
4 pages
Marin Token Pooling in Vision Transformers For Image Classification WACV 2023 Paper
No ratings yet
Marin Token Pooling in Vision Transformers For Image Classification WACV 2023 Paper
10 pages
Hierarchical Temporal Memory
No ratings yet
Hierarchical Temporal Memory
11 pages
Facial Expression-Based Emotion Detection For Adaptive Teaching in Educational Environments
No ratings yet
Facial Expression-Based Emotion Detection For Adaptive Teaching in Educational Environments
7 pages
AER850: ML Project Guide
No ratings yet
AER850: ML Project Guide
22 pages
Research Paper On AI Basics
No ratings yet
Research Paper On AI Basics
2 pages
Empirical Evaluation of Rectified Activations in ConvolutionNetwork
No ratings yet
Empirical Evaluation of Rectified Activations in ConvolutionNetwork
5 pages
Regression Problems in Python PDF
No ratings yet
Regression Problems in Python PDF
34 pages
Calibrating Language Models With Adaptive Temperature Scaling
No ratings yet
Calibrating Language Models With Adaptive Temperature Scaling
11 pages
Real-Time Face Recognition with Neural Network
No ratings yet
Real-Time Face Recognition with Neural Network
6 pages
Linear Regression With One Variable: Gradient Descent
No ratings yet
Linear Regression With One Variable: Gradient Descent
30 pages
Chapter 06 - Sharda - 11e - Full - Accessible - PPT - 06
No ratings yet
Chapter 06 - Sharda - 11e - Full - Accessible - PPT - 06
39 pages
Generative AI Interview Questions
No ratings yet
Generative AI Interview Questions
90 pages
Anime Face Generation Using DC-GANs
No ratings yet
Anime Face Generation Using DC-GANs
6 pages
State of AI Report - 2024 ONLINE
No ratings yet
State of AI Report - 2024 ONLINE
213 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
3 pages
KNN-SVM Assignment
No ratings yet
KNN-SVM Assignment
4 pages
Specilization in Ai&ml
No ratings yet
Specilization in Ai&ml
8 pages
U-Net Sabri 2022
No ratings yet
U-Net Sabri 2022
8 pages
Leveraging Web Scraping To Develop A Fake News Detection Model For Philippine News Using RNN-LSTM
No ratings yet
Leveraging Web Scraping To Develop A Fake News Detection Model For Philippine News Using RNN-LSTM
7 pages
Neural Network Learning Methods
No ratings yet
Neural Network Learning Methods
50 pages
Unit 6
No ratings yet
Unit 6
22 pages
MLT Unit-4 Notes
No ratings yet
MLT Unit-4 Notes
30 pages
AI & ML: A Comprehensive Overview
No ratings yet
AI & ML: A Comprehensive Overview
16 pages
Lecture 02 - Introduction To Neural Networks (Optional)
No ratings yet
Lecture 02 - Introduction To Neural Networks (Optional)
28 pages
AIML Bangalore Brochure Jan2020 PDF
No ratings yet
AIML Bangalore Brochure Jan2020 PDF
12 pages
Neural Networks for Ordinal Regression
No ratings yet
Neural Networks for Ordinal Regression
6 pages
Data Science Mastery with Cognico
No ratings yet
Data Science Mastery with Cognico
18 pages
Class 6 AI QUIZ
No ratings yet
Class 6 AI QUIZ
12 pages

Speech Emotion Recognition Insights

Uploaded by

Speech Emotion Recognition Insights

Uploaded by

SER Final Report

All my presentation links :

I initially attempted to utilize the SpeechBrain framework for speech

Additionally, I think that incorporating the visual component through

1. Multimodal emotion recognition using cross modal audio-

During the implementation of the paper , I encountered difficulties,

The figure above illustrate the model architecture of the paper

I experimented with different settings and values for the hyperparameters

Model (trails on RAVDESS) Result on test ( with CV_5 folds )

Model (trails on RAVDESS) Result on test ( with CV_5 folds )

I made several modifications to the model implementation, including

2. Vision Transformers (ViT) and ViTCoord

 Throughout these trials, I made several adjustments to the

3. Mixing ResNet18 with TCN and Coord techniques

Model (on CREMA-D) Test results

You might also like