Speech Emotion Detection
The Neural Voices
Avirup Das & Subhradip Bhattacharyya
[email protected] &
[email protected] May 1, 2025
Abstract
This project focuses on developing a deep learning-based Speech Emotion Recognition
(SER) system using benchmark datasets such as RAVDESS, CREMA-D, and SAVEE.
We performed extensive feature extraction, including Mel-Frequency Cepstral Coecients
(MFCCs), to capture the emotional characteristics of speech signals. Both Arti cial Neural
Networks (ANNs) and Convolutional Neural Networks (CNNs) were designed and trained
to classify emotions such as happiness, sadness, anger, and fear. To improve generalization,
we implemented data augmentation techniques and conducted comparative experiments
between the original and augmented datasets. The models were evaluated using accu-
racy, precision, recall, and F1-score to assess their performance comprehensively. Our
results demonstrate the potential of deep learning models to signi cantly enhance human-
computer interaction by enabling machines to e ectively interpret human emotions from
speech.
1 Introduction
What?
This project involves building a Speech Emotion Recognition (SER) system using deep
learning techniques. The objective is to classify human emotions|such as happiness,
sadness, anger, and fear|based on audio recordings of speech.
Why?
Understanding emotions from speech is essential for improving human-computer interac-
tion. It has signi cant applications in areas such as virtual assistants, mental health moni-
toring, customer service automation, and intelligent tutoring systems. Machines equipped
with emotional awareness can respond more naturally and empathetically, enhancing the
user experience.
1
How?
We used benchmark speech emotion datasets, including RAVDESS, CREMA-D, and SAVEE,
which provide labeled audio samples for various emotional states. From these audio
recordings, we extracted meaningful features such as Mel-Frequency Cepstral Coecients
(MFCCs). Using these features, we trained both Arti cial Neural Networks (ANNs) and
Convolutional Neural Networks (CNNs) to classify emotions. Additionally, data augmen-
tation techniques were applied to improve model generalization. The trained models were
evaluated using standard classi cation metrics such as accuracy, precision, recall, and F1-
score to assess their performance.
2 Literature Review
• Early Approaches: Traditional SER methods used handcrafted features like pitch,
energy, and spectral features, classi ed using models such as Support Vector Machines
(SVM) and Hidden Markov Models (HMM). These approaches were often limited in
performance due to shallow representations.
• Deep Learning Models:
– Trigeorgis et al. (2016) introduced a Convolutional Recurrent Neural Network
(CRNN) for end-to-end emotion recognition from raw audio, combining CNNs
for spectral features and RNNs for temporal dynamics.
– Neumann and Vu (2017) implemented attention mechanisms in deep networks
to improve SER on benchmark datasets.
• State-of-the-Art on RAVDESS:
– The VQ-MAE-S-12 (Frame) model with the Query2Emo framework currently
holds the highest accuracy.
– This model leverages vector-quantized masked autoencoders and transformer-
based architectures to learn robust representations in a self-supervised fashion.
• State-of-the-Art on CREMA-D:
– The best-performing model is based on a Vision Transformer (ViT) with verti-
cally long patches, which treats speech spectrograms as images.
– This method excels in capturing both spectral and temporal features using
attention-based mechanisms.
• Limitations of Existing Work:
– Most SOTA models require extensive pretraining on large corpora and rely on
heavy computational resources.
2
– These methods are typically optimized for a single dataset and do not o er
generalizability insights across multiple datasets.
• What we try to do:
– We conduct a comparative study using Arti cial Neural Networks (ANNs) and
Convolutional Neural Networks (CNNs) across three standard datasets: RAVDESS,
CREMA-D, and SAVEE.
– We extract MFCC features from raw audio and apply data augmentation tech-
niques (e.g., noise addition, pitch shift) to enhance diversity and generalization.
– Unlike transformer-based models, our approach is lightweight and reproducible
in resource-constrained environments.
– We evaluate model performance using standard classi cation metrics (accuracy,
precision, recall, F1-score), and analyze the impact of augmentation and model
architecture on cross-dataset performance.
3 Proposed Methodology
This section outlines the methodological framework followed in the development of the
Speech Emotion Recognition (SER) system. A sequence of preprocessing, modeling, and
evaluation steps were performed to construct and validate models capable of classifying
speech audio signals into discrete emotional categories. Below are the detailed components
of the proposed methodology:
3.1 Feature Extraction using MFCC
Mel-Frequency Cepstral Coecients (MFCCs) were extracted from each audio sample.
MFCCs are a widely adopted feature representation in speech and audio analysis as they
e ectively encode the short-term power spectrum of sound based on human auditory per-
ception. The extraction process involved:
• Pre-emphasis of the audio signal to amplify high-frequency components.
• Framing and windowing to divide the signal into overlapping segments.
• Applying the Fast Fourier Transform (FFT) to obtain the power spectrum.
• Mapping powers to the Mel scale using triangular lter banks.
• Taking the logarithm of Mel spectrum energies.
• Applying the Discrete Cosine Transform (DCT) to decorrelate features.