Emotion Recognition Using Speech Processing
Emotion Recognition Using Speech Processing
Abstract— Since many years ago, emotion identification detecting emotions. Multiple approaches are put forth in this
from voice processing has been a significant factor in discipline to identify a speaker's emotional state from the
applications involving human-machine interfaces. In the speech or voice signal. Anger, joy, sorrow, surprise,
psychological well-being of humans, emotions are pertinent. indifference, disgust, fear, tension, etc. are a few universal
It serves as an agent for communicating one's viewpoint or emotions. Numerous automated systems have been
mental condition to others. Speech Emotion Recognition (SER) recommended by academicians during the past 20 years. The
is a manner of detecting the speaker's emotional state from the kind of characteristics that these various systems utilise to
speech signal. Any computer system with limited processing categorise speech signals vary. Mel-frequency cepstral
resources may be programmed to sense or generate the few
coefficients (MFCC) are among the frequently utilised
universal feelings, like Neutral, Anger, Happiness, and
Sadness as needed. The following characteristics are retrieved
spectral characteristics. Our research revealed that the
in this work: Mel-frequency cepstral coefficients (MFCC), artificial neural network used to interpret speech emotion has
Chromogram, Mel scaled spectrogram, Spectral contrast, and a significant advantage throughout the recognition phase.
Tonal Centroid. The emotion in this study is regarded using a Built-in emotion identification from human speech is
deep neural network. The categorization of the speech in the becoming more prevalent nowadays as it improves links
output layer is done using Softmax. We used the combined between both humans and machinery.
1440 audio recordings from 24 individuals in the RAVDASS
speech dataset for training. Speech DNN has a 96% accuracy II. LITERATURE SURVEY
rate for emotion identification, which is greater than other The authors[6] discuss using Deeply Sprint convolution
algorithms like KNN, LDA, and SMO. Automatic emotion neural networks (DSCNN) for the segmentation phase and
detection from human speech is becoming increasingly the initial processing of speech signals to remove not
prevalent these days as it boosts interactions between beneficial noise and give clean voice. The goal is to create a
computers and humans. voice emotion identification system employing an unaltered
Keywords— Emotion recognition, Deep neural network, spectrogram and updated DSCNN structure.
Audiofiles, Chromogram, spectrogram.
Numerous analyses of various CNN and LSTM-RNN
I. INTRODUCTION models were done in [15]. When contrasting with alternative
Nowadays, many programmes make use of human- LSTM and RNN designs, the CNN architectures performed
machine interaction. Speech is one of the interactional superior. There are multiple hypotheses offered in [3] for the
media. The primary challenge in human-machine tandem is a spectra and phoneme-based categorization of emotions.
knack to understand feelings in speech. Furthermore, a [13] demonstrated an implementation of an end-to-end
variety of organic instructions can be used to assess emotion deep neural network for emotion identification. The
[1][2]. The main objective in this work is to identify paralinguistic information found in the words was used to
emotions in speech. When two people engage with one create convolutional and recurrent networks by the
another, they may quickly identify the underlying emotion in researchers. A vector of attributes was created from the
the other person's words. The intent of an emotion activations of the last fully-connected hidden layer in [14],
recognition system is to resemble how people see things [3]. where the authors studied the extensive spectrum properties
Speech emotion recognition has several applications. Making created by passing the spectrum images over AlexNet for
decisions can be significantly influenced by emotions. A linguistic emotion identification.
system can respond appropriately if emotion can be
accurately detected in speech. Speech emotion recognition A novel approach for categorising phrases according to
has several applications. Making decisions can be mood. The algorithm used categorical HMM and short time
significantly influenced by feelings. A system can respond log frequency power coefficients (LFPC) to characterise the
appropriately if emotion can be accurately detected in algorithms and speech signals, accordingly. This technique
speech. Medical science, robotics engineering, contact center divided the emotions into six groups, then trained and tested
applications, and other fields may all gain from a robust a novel system on an individual dataset. LFPC is contrasted
emotion recognition system. [4][5]. Humans possess the with the mel-frequency Cepstral coefficients (MFCC) and
ability of rapidly recognising speaker emotion. A great deal the linear regression forecast Cepstral variables (LPCC) in
of practice and observation are required to accomplish this. order to assess the effectiveness of the suggested strategy.
Humans examine several aspects of an expression first, and The results show that an average and highest identification
then, via observations or prior experience, they identify the rate of 78% was attained. Additionally, data show that LFPC
speaker's emotion. is a superior feature than conventional characteristics for
emotion categorization [12].
Hence, it is necessary to develop an algorithm that is
human-like and has the capability of accurately and quickly
III. METHODOLOGY
Deep neural networks are used in the suggested modified
technique for voice emotion detection. To gain access to data
obtained from the recordings file, this technique combines
Mel-frequency cepstral coefficients (MFCC), Chromogram,
Mel scaled spectrogram, Spectral contrast, and Tonal
Centroid characteristics. The design divides the sound of the
remarks into 8 categories, including impartial peaceful, Fig. 2. BLOCK DIAGRAM OF MFCC
joyful, depressed, furious, afraid, disgusted, and astonished.
To derive voice intensity characteristics, a 5-layer a complex MFCC feature extraction process consists of a few steps
network is trained using DNN. as discussed below,
Pre-emphasis: For the indication to have more vitality,
pre-emphasis is necessary. This procedure involves passing
the spoken sample across an amplifier to boost its vibrancy.
This increase in state of energy provides additional details.
K(z) = 1 – pz–1
Framing: This step involves segmenting the voice sample
into 20–40 ms frames. This procedure is required to set the
size of speech because the voice of a person might fluctuate
in duration. Although the voice message's essence is non-
stationary—its amplitude might fluctuate across time—for a
short amount of duration, it behaves as a fixed transmission.
Windowing: This step is carried out follows the frame
procedure. The signal discontinuities at the beginning and
conclusion of every picture are lessened by the windowing
procedure. This method involves a 10ms phase transfer. This
suggests half of the preceding pixel's content is repeated in
every subsequent one.
Fast Fourier Transform (FFT): The frequency spectrum
of each frame is produced using the rapid Fourier transform
(FFT). The FFT converts every one of the samples of a frame
Fig. 1. Steps involved in speech emotion recognition between the duration phase to the frequency band. To
2
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
identify all frequencies present in an area, an FFT is utilised. Pitch and amplitude Resonance and its equivalent
Each image has 20–30 triangle filtration from the Mel scale subjective property pitch make up the most fundamental
filter bank implemented. The quantity of radiation is present tonality-related signal component. Another of the primary
in a particular picture is determined by the mel scale filter. fundamental qualities of vibrations, in addition to sound
volume, length, the tone, and geographic position, is tone. A
The following calculation can be used to translate the basic waveform is represented by a sinusoidal operate, x(t),
normal frequency f to the Mel scale m: whereby an is the highest magnitude, t is the duration (in
Mel(f)=1125*ln(1+f/700) seconds), f is the frequency, and is the beginning phase. The
amount of cycles that are replicated each second is known as
Calculation of log energy: The log function is applied to the frequency f of this straightforward pattern.
the filtering branch energies of every frame once they have
been obtained. Mammalian sense of hearing is another 5) Melody
source of inspiration. Hearing in humans cannot function on A series of frequency sounds is typically thought of as a
a linear scale. The ears of humans is not capable of detecting melody.
significant electrical fluctuations when the audio level is
great. Linear calculation of energy provides the qualities that According to this explanation, a melody is "a collection
a person can plainly hear. of pitches of pitched sounds arranged in musical time in
accordance to particular social norms and restrictions."
Discrete Cosine Transformation (DCT): The logarithmic
filtering bank values are used to compute the DCT in the 6) Harmony
final stage. We utilised 10ms of scrolling and 25ms of The phrase "harmony" refers to the instantaneous pairing
frames. Additionally, 26 band pass filters were utilised. 13 of sounds, known as chords, and over time, chord
MFCC were calculated for every image. We calculated 13 progressions. By addition to describing pitches and keys, the
MFCC characteristics from every frame. We additionally word primarily refers to a set of mathematical rules guiding
estimated the power contained in the frame itself. We how they are combined. Harmony, in the second meaning,
estimated 13 velocity components and 13 acceleration has an array of concept books. We just took note of the
components by calculating the time variations from energy components of the harmonic content which are to do with
and MFCC while obtaining 13 MFCC characteristics. how pitches are put together into harmonies and how it
relates to the item's tonality. As we've observed, rhythm and
Generate the delta characteristics according to the tune generally relate to a group of notes that are potentially
previous and subsequent M structures, where Cm(t) is the synchronous or sequentially. Because they interact, it can be
static coefficient of the panel. challenging to distinguish between harmony and melody in
this manner.
7) Tonality
Where Cm(t) denotes static coefficient of the frame, The ambiguity of the concept of tonality is one learning
calculate delta features based on preceding and following M the causes of an absence of consensus across various
frames . methodologies for musicality induction. Castil-Blaze coined
the phrase "tonality" for the first time in 1821. Today, it
2) Spectral Contrast frequently refers to a set of interactions among a number of
The spectral contrast in a sound measures its harmonic pitches that produce harmonies and melodies and include a
power at each epoch. Because the majority of recordings of tonic, or core pitch class, as its most significant (or
sound include a tone whose energy varies as time passes. permanent) component.
It seems challenging to gauge the vibration condition. It describes the configurations of pitch phenomena in the
This energy fluctuation may be measured using spectral broadest extent imaginable. The tonal centroids in order aid
contrast. Simple, widen-band impulses often have significant in harmonic change detection.
contrast amounts, while broad-band signals typically have
D. System Design:
low contrast values. Broad-band sound is typically associated
with poor contrast values, although elevated contrast values 1) Application for Emotion Recognition Model :
typically correlate to distinct, shorten-band impulses. Here, The user uploads an audio file that contains speech
the difference between the mean energy in the peak energy from a person, and the model is then utilised to forecast
frame and the bottom or flat power frame is used to the speaker's emotional state.
determine the power difference.
The model is used in a web application built using the
3) Tonal Centroid Flask architecture. The system and the user are its two
The tonal centroids helps in Detecting Harmonic Change components. The UI has a method for managing user login
in Musical Audio or variances due to tones in audio. and authentication. Initially an unfamiliar user needs to enter
their information, usually consists of name, email, and
password. The database known as MySQL houses this
individual's info. Someone who has already registered and
whose information is recently in the system's MySQL
database may access the internet-based application using
The following characteristics of the tonal centroid are authorised login details. They are only given permission to
included: the programme to forecast feelings when they properly log
in.
4) Frequency And Pitch
3
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
Five levels of a Sequence() model are produced. The set
of data is trained employing 700 epochs using the train
dataset. The model that we will use for prediction is the best
in terms of accuracy when tested. We utilise softmax to
classify the feelings in the output layer. Softmax is a accepts
a vector of numbers and transforms them into chances, which
are then applied to the output ofpicturegeneration.
F. Performance Metrics:
1) Hyper parameters
Classification rate
In the data from the test, the categorization ratio is
calculated by summing the numbers on each diagonal and
dividing the result by the sample size.
2) Recall
Recall is determined as the sum of true positives and
false negatives weighted by the total amount of actual
positives. Recall is defined as True Positives / (False
Negatives + True Positives). The outcome is a number that
ranges from 0.0 for no memory to 1.0 for complete or
flawless recall.
Fig. 3. Overall Workflow of the Emotion Recognition Application
3) Precision Value
E. Model for Deep Neural Network to Recognition Improving precision will reduce the amount of incorrect
Emotion: results, while improving recall will reduce the number of
false negatives as well.
Deep learning can boost efficiency while using fewer
processors. In order to develop models for activities that 4) F-Measures
typical machine learning algorithms cannot or are difficult to Recall and precision may be combined onto one metric
do, neural networks with deep connections (DNN) are using F-metric, which incorporates two attributes. F-Measure
frequently utilised in the field of deep learning. is equal to the product of the recall and accuracy factor.
1) Architecture Diagram of DNN 5) ROC
Rate of Change (ROC) estimates the percentage change
in price between the present price and the price after a certain
amount of periods.
4
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
network we operate, 700 epochs were established set. The occasionally inconsistent output when trained on human-
final output nodes feature an activation function called rated emotions. As the model was trained using the
"softmax" that we have employed. We also computed loss RAVDESS dataset, the speaker's accent might possibly
using classification linear sensitivity. Training performance provide unexpected results as the model is trained on North
peaked at 96% after 700 iterations, whereas test accuracy American accent database..
peaked at 80%. Confusion matrix for training data is
represented graphically in Figures. REFERENCES
[1] J. Rong, G. Li, and Y.-P. P. Chen, “Acoustic feature selection for
automatic emotion recognition from speech,” Inf. Process. Manag.,
vol. 45, no. 3, pp. 315– 328, May 2009.
[2] Benk, Sal &Elmir, Youssef &Dennai, Abdeslem. (2019). A Study on
Automatic Speech Recognition. 10. 77-85. 10.6025/jitr/2019/10/3/77-
85.
[3] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa,
"Speech emotion recognition using spectrogram & phoneme
embedding," 2018, doi: 10.21437/Interspeech.2018-1811.
[4] M. M. H. El Ayadi, M. S. Kamel, and F. Karray, “Speech Emotion
Recognition using Gaussian Mixture Vector Autoregressive Models,”
in 2007 IEEE International Conference on Acoustics, Speech and
Signal Processing - ICASSP ’07, 2007, vol. 4, pp. IV– 957–IV–960
[5] Mustaqeem and S. Kwon, "A CNN-assisted enhanced audio signal
processing for speech emotion recognition," Sensors (Switzerland),
2020, doi: 10.3390/s20010183.
[6] S. Emerich, E. Lupu, A. Apatean, “Emotions Recognitions by Speech
and Facial Expressions Analysis”, 17th European Signal Processing
Conference, 2009.
[7] M. E. Ayadi, M. S. Kamel, F. Karray, “Survey on Speech Emotion
Fig. 6. Accuracy Graph and table Recognition: Features, Classification Schemes, and Databases”,
Pattern Recognition 44, 2019
TABLE I. CONFUSION MATRIX FOR VALIDATION DATA [8] Ashish B. Ingale & D. S. Chaudhari (2012). Speech Emotion
Recognition. International Journal of Soft Computing and
Engineering (IJSCE) ISSN: 2231-2307, Volume-2 Issue- 1, March
2012
[9] Szegedy, Christian & Toshev, Alexander & Erhan,
Dumitru. (2013). Deep Neural Networks for Object Detection. 1-9.
[10] Szegedy, Christian & Toshev, Alexander & Erhan,
Dumitru. (2013). Deep Neural Networks forObject Detection. 1-9.
[11] Chiriacescu, “Automatic Emotion Analysis Based On Speech”, M.Sc.
THESIS Delft University of Technology, 2009.
[12] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion
The real recognition of emotion is represented by the recognition using hidden Markov models,” Speech Commun., vol. 41,
confusion matrix's diagonal components. Our network can no. 4, pp. 603–623, Nov. 2003.
identify the correct emotion with a high degree of accuracy [13] Network," 2017, doi: 10.1109/PlatCon.2017.7883728. A. Satt, S.
for the majority of emotions. Rozenberg, and R. Hoory, "Efficient emotion recognition from speech
using deep learning on spectrograms," 2017, doi:
10.21437/Interspeech.2017- 200.
IV. CONCLUSIONS
[14] J.-H. Yeh, T.-L. Pao, C.-Y. Lin, Y.-W. Tsai, and Y.-T. Chen,
Deep learning algorithms can produce effective “Segment-based emotion recognition from continuous Mandarin
outcomes. We have successfully created a deep learning Chinese speech,” Comput. Human Behav., vol. 27, no. 5, pp. 1545–
model for emotion identification that scored 96% on tests. 1552, Sep. 2011.
Please be aware that emotion prediction is arbitrary and that [15] P.Shen, Z. Changjun, X. Chen, “Automatic Speech Emotion
Recognition Using Support Vector Machine”, International
different people may grade the same music with different Conference On Electronic And Mechanical Engineering And
feelings. This is also the cause of the algorithm's InformationTechnology, 2011.
5
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.