Form No: Prj_S 02
Date :20.01.2025
QIS College of Engineering and Technology
(Autonomous)
Project Summary Report
Department: CSE - DS Section: 2
Project Domain: Machine Learning & Deep Functional Domain:
Learning
Mentor Name: Dr. Y. Sowjanya Kumari Batch Number: 12
Name: N. Satya Sai Umesh Chandra Roll No: 21491A4490
Finalized Title: Speech Emotion Recognition System
Abstract/Summary: In the realm of human-machine interface applications, emotion recognition
from speech signals has been a research focus for several years. Emotions play an essential role in
human communication and expression, making their recognition crucial for applications involving
human-computer interaction. This project explores Speech Emotion Recognition (SER), aiming to
classify emotional states from speech signals. The proposed system uses Mel-frequency cepstral
coefficients (MFCC), Chromogram, Mel-scaled spectrogram, spectral contrast, and tonal centroid
features to analyze audio inputs. A Deep Neural Network (DNN) is employed for classification,
trained using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). This
approach aims to address the limitations of traditional methods, improving accuracy while reducing
computational complexity.
Existing Method: Traditional models for emotion recognition primarily relied on machine learning
algorithms such as Support Vector Machines (SVM) and K-Nearest Neighbors (KNN). These models
demonstrated limited accuracy and were computationally intensive. Although deep learning models
have been explored in the past, they typically required extensive datasets and high-performance
hardware, which posed challenges in terms of scalability and practical implementation.
Proposed Method: The proposed approach introduces a deep neural network-based framework for
emotion recognition, beginning with audio pre-processing to remove noise and enhance quality. Key
features such as MFCC, Chromogram, Mel-scaled spectrogram, spectral contrast, and tonal centroid
are extracted to capture essential timbral, tonal, and harmonic information from the audio. A custom-
designed deep neural network with multiple hidden layers and ReLU activation functions is trained
using the Adam optimizer and Categorical Cross-Entropy loss function for multi-class classification.
The RAVDESS dataset ensures balanced representation of emotional classes, and regularization
techniques like dropout prevent overfitting. Model performance is evaluated using metrics such as
accuracy, precision, recall, and F1-score, making the system effective and efficient for real-world
applications.
Technique(s) Used:
Feature extraction:
Mel-requency cepstral coefficients (MFCC): Captures the timbral aspects of speech.
Chromogram: Identifies tonal content and harmonic structure.
Mel-scaled spectrogram: Provides a visual representation of frequencies over time.
Spectral contrast: Differentiates between peaks and valleys in the spectrum.
Tonal centroid: Represents tonal information in a compact form.
Classification:
Architecture: Multi-layer perceptrons with activation functions such as ReLU.
Optimization: Stochastic Gradient Descent (SGD) or Adam optimizer for efficient
training.
Loss Function: Categorical Cross-Entropy for multi-class classification.
Technology Used:
Programming language: Python (v3.6+): For scripting and implementing algorithms.
Deep learning libraries: TensorFlow and Keras: For designing and training the neural network.
Audio Processing Libraries:
Librosa: For extracting audio features and preprocessing.
NumPy and Pandas: For data manipulation and analysis.
Visualization Tools: Matplotlib and Seaborn: For plotting audio features and model
performance.
Development Environment: Jupyter Notebook or PyCharm for code development and testing.
Data Sets used:
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS):
The RAVDESS dataset is a validated multimodal dataset consisting of 24 professional
actors (12 male, 12 female) vocalizing two lexically-matched statements in a neutral
North American accent.
The dataset includes emotional expressions such as calm, happy, sad, angry, fearful,
surprise, and disgust in both speech and song modalities.
Each expression is available at two intensity levels and is balanced in terms of gender
distribution.
It is widely used in emotion recognition research for its clarity, diversity, and balanced
representation.
Format: WAV files with 48kHz sample rate.
Mentor HoD CSCD Examinar