Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
28 views4 pages

Speech Emotion Recognition Insights

Uploaded by

yassmen youssef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views4 pages

Speech Emotion Recognition Insights

Uploaded by

yassmen youssef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

SER Final Report

All my presentation links :


https://drive.google.com/drive/folders/1StI5socw_u5xd0_whrFUFWzLFf
JbgqoC?usp=sharing

My drive link :
https://drive.google.com/drive/folders/1jPYbcaIs7ZMQkYcAah66bcFNR
63NJHMr?usp=sharing

I initially attempted to utilize the SpeechBrain framework for speech


emotion recognition on the challenging CREMA-D dataset. Despite the
initial efforts, this approach did not yield promising results. Recognizing
the need for an alternative approach, I switched to fine-tuning the
Wav2Vec2 model for speech emotion recognition using the CREMA-D
and RAVDESS datasets. The aim was to accurately classify emotions in
speech data. Through experimentation, I achieved impressive results on
the test set, with an accuracy of 79% and 81% for the CREMA-D and
RAVDESS datasets, respectively. The model weights can be accessed on
my Hugging Face repository, and comprehensive information, including
datasets and all trails is available in the provided drive and presentation
links.
 The code for Ravdess's implementation of fine tune wav2vec2 can be
accessed here Similarly, the code for crema-d can be found here .

Additionally, I think that incorporating the visual component through


multimodal emotion recognition, particularly by utilizing videos, has the
potential to further enhance the performance on both datasets.

1. Multimodal emotion recognition using cross modal audio-


video fusion with attention and deep metric learning

During the implementation of the paper , I encountered difficulties,


particularly in the video stream section. The results I obtained were not
accurate. One major issue I faced was that the model seemed to be unable
to learn effectively. Consequently, I decided to shift my focus to the
audio part. However, I encountered further challenges in the audio section,
particularly when applying augmentation techniques to the spectrogram
features. Additionally, some of the test samples were mistakenly included
in the training data, leading to inflated accuracy scores that did not reflect
the true performance of the model.

To fix these issues, I tried many different experiments and made changes
to my approach. One important change I made was to separate the test
and training data before applying any augmentation techniques. This way,
I made sure that the augmented samples didn't overlap between the two
sets. This helped to prevent any misleading results.

The figure above illustrate the model architecture of the paper

I experimented with different settings and values for the hyperparameters


in the data preprocessing and triplet loss function, but the paper didn't
provide clear information about them. Unfortunately, these experiments
didn't lead to good results, especially when I worked with the crema-d
dataset. To address this issue, I switched to the ravdess dataset and
conducted several trials on it. Here some of the best outcomes from these
experiments in the following table, while the remaining results are stored
in my drive:

Model (trails on RAVDESS) Result on test ( with CV_5 folds )


Paper model ”AT” (resnet18+attention) ~ 71%
with augmentation(3 way)
Paper model ”AT” (resnet18+attention) ~ 61%
without augmentation
Paper model ”AT” (resnet18+attention) ~ 71%
with augmentation(add noise only)
Paper model ”AT” (resnet18+attention) ~ 77%
with augmentation(3 new ways)
In the previous table, the results were based on the RAVDESS dataset
using only the spectrogram feature. However, I conducted further
experiments by incorporating the MFCC feature. The updated table below
showcases the results obtained:

Model (trails on RAVDESS) Result on test ( with CV_5 folds )


Paper model ”AT” (resnet18+attention) ~ 81%
( spectrogram + mfcc features) with 3
new augmentations
Edited model with mfcc only and new ~ 95%
augmentation
Edited model : The ResNet weights were initialized with the pre-trained
weights from ImageNet. A pooling layer was added after the ResNet
features, and instead of multihead self-attention, normal temporal self-
attention was employed.

I made several modifications to the model implementation, including


changes to both the ResNet architecture and the attention module. After
these modifications, I observed a noticeable improvement in the accuracy,
reaching around 87%. However, when I attempted to switch to ResNet50
and ResNet34 architectures, the results actually worsened instead of
improving.I decided to incorporate the MFCC features into the model.
Surprisingly, this adjustment led to a remarkable accuracy of 95%.
 “AT” notebooks on Ravdess can be accessed here , check_mfcc only.
I also explored another approach called speaker-independent technique
with the same dataset. However, the results were not as promising, with
an accuracy of approximately 60% on the test actors.

2. Vision Transformers (ViT) and ViTCoord


 In my initial experiment, I followed this repository from GitHub and
applied it to the spectrogram and MFCC features of the RAVDESS
dataset I saved from previous trails. However, the results I obtained
on the test set were only 60% accurate.
 After that I tried vitcn model from this paper on crema-d, However, I
encountered issues related to CUDA, which required me to reduce the
batch size to 1. Although this adjustment allowed the model to run
without errors, it significantly increased the training time.
Furthermore, the results after the initial two epochs were
unsatisfactory.
 In my search for alternative solutions, I discovered another GitHub
repository that also utilized Vision Transformers (ViT). I decided to
try it out, and fortunately, it worked smoothly without encountering
the previous issues. However, the results obtained from applying this
model to both the crema-d and RAVDESS datasets were still below
expectations, with an accuracy in range of 48%.
 I explored another method called CoordVit, which I found in a
research paper (which it on the top in leadrboard of crema-d), This
approach involved adding coordinates to each pixel in the input
batches during training. I conducted various trials with different
preprocessing techniques, augmentations, and implementation
approach, but the results consistently fell within the range of 49%.

 Throughout these trials, I made several adjustments to the


preprocessing techniques, including changes to how the data was split
into small segments and how spectrograms were generated. These
modifications aimed to improve the preprocessing pipeline and
enhance the performance of the ViT models. Despite these efforts, the
results remained unsatisfactory, indicating the need for further
investigation and experimentation to achieve better accuracy.

3. Mixing ResNet18 with TCN and Coord techniques

I decided to give ResNet a final attempt using the same approaches of tcn
(Temporal Convolutional Networks ) and coord on crema-d only and here
some of results in the following table:

Model (on CREMA-D) Test results


Resnet18 only (spec size(299,200), no ~ 54%
resize, no normalize,fixed length, edit
first conv. layer)
Resnet18 with coord approach ~ 54%
Resnet18 + coord ( resizing, change ~ 54%
batch size from 16 to 32)
Resnet18 +coord + tcn (no resizing ~ 42%
each segment shape is (24,200))
Resnet18 +coord +tcn (resizing) ~ 45%

As we can see from results, there was a notable issue with the results
obtained using the previous methods on crema-d. In response, I
decided to explore wav2vec2. The outcomes achieved using
wav2vec2 were exceptionally promising, as demonstrated on the first
page of the report.

You might also like