Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views5 pages

Emotion Recognition Using Speech Processing

Uploaded by

virusphotos2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Emotion Recognition Using Speech Processing

Uploaded by

virusphotos2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2023 3rd International Conference on Intelligent Technologies (CONIT)

Karnataka, India. June 23-25, 2023

Emotion Recognition Using Speech Processing


S.Sharanyaa Tini J Mercy Samyukthaa V.G
IT Department, , IT Department, IT Department,
Panimalar Engineering College Panimalar Engineering College Panimalar Engineering College
Poonamalle, Poonamalle, Poonamalle,
Chennai, India. Chennai, India. Chennai, India.
[email protected] [email protected] [email protected]
2023 3rd International Conference on Intelligent Technologies (CONIT) | 979-8-3503-3860-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/CONIT59222.2023.10205935

Abstract— Since many years ago, emotion identification detecting emotions. Multiple approaches are put forth in this
from voice processing has been a significant factor in discipline to identify a speaker's emotional state from the
applications involving human-machine interfaces. In the speech or voice signal. Anger, joy, sorrow, surprise,
psychological well-being of humans, emotions are pertinent. indifference, disgust, fear, tension, etc. are a few universal
It serves as an agent for communicating one's viewpoint or emotions. Numerous automated systems have been
mental condition to others. Speech Emotion Recognition (SER) recommended by academicians during the past 20 years. The
is a manner of detecting the speaker's emotional state from the kind of characteristics that these various systems utilise to
speech signal. Any computer system with limited processing categorise speech signals vary. Mel-frequency cepstral
resources may be programmed to sense or generate the few
coefficients (MFCC) are among the frequently utilised
universal feelings, like Neutral, Anger, Happiness, and
Sadness as needed. The following characteristics are retrieved
spectral characteristics. Our research revealed that the
in this work: Mel-frequency cepstral coefficients (MFCC), artificial neural network used to interpret speech emotion has
Chromogram, Mel scaled spectrogram, Spectral contrast, and a significant advantage throughout the recognition phase.
Tonal Centroid. The emotion in this study is regarded using a Built-in emotion identification from human speech is
deep neural network. The categorization of the speech in the becoming more prevalent nowadays as it improves links
output layer is done using Softmax. We used the combined between both humans and machinery.
1440 audio recordings from 24 individuals in the RAVDASS
speech dataset for training. Speech DNN has a 96% accuracy II. LITERATURE SURVEY
rate for emotion identification, which is greater than other The authors[6] discuss using Deeply Sprint convolution
algorithms like KNN, LDA, and SMO. Automatic emotion neural networks (DSCNN) for the segmentation phase and
detection from human speech is becoming increasingly the initial processing of speech signals to remove not
prevalent these days as it boosts interactions between beneficial noise and give clean voice. The goal is to create a
computers and humans. voice emotion identification system employing an unaltered
Keywords— Emotion recognition, Deep neural network, spectrogram and updated DSCNN structure.
Audiofiles, Chromogram, spectrogram.
Numerous analyses of various CNN and LSTM-RNN
I. INTRODUCTION models were done in [15]. When contrasting with alternative
Nowadays, many programmes make use of human- LSTM and RNN designs, the CNN architectures performed
machine interaction. Speech is one of the interactional superior. There are multiple hypotheses offered in [3] for the
media. The primary challenge in human-machine tandem is a spectra and phoneme-based categorization of emotions.
knack to understand feelings in speech. Furthermore, a [13] demonstrated an implementation of an end-to-end
variety of organic instructions can be used to assess emotion deep neural network for emotion identification. The
[1][2]. The main objective in this work is to identify paralinguistic information found in the words was used to
emotions in speech. When two people engage with one create convolutional and recurrent networks by the
another, they may quickly identify the underlying emotion in researchers. A vector of attributes was created from the
the other person's words. The intent of an emotion activations of the last fully-connected hidden layer in [14],
recognition system is to resemble how people see things [3]. where the authors studied the extensive spectrum properties
Speech emotion recognition has several applications. Making created by passing the spectrum images over AlexNet for
decisions can be significantly influenced by emotions. A linguistic emotion identification.
system can respond appropriately if emotion can be
accurately detected in speech. Speech emotion recognition A novel approach for categorising phrases according to
has several applications. Making decisions can be mood. The algorithm used categorical HMM and short time
significantly influenced by feelings. A system can respond log frequency power coefficients (LFPC) to characterise the
appropriately if emotion can be accurately detected in algorithms and speech signals, accordingly. This technique
speech. Medical science, robotics engineering, contact center divided the emotions into six groups, then trained and tested
applications, and other fields may all gain from a robust a novel system on an individual dataset. LFPC is contrasted
emotion recognition system. [4][5]. Humans possess the with the mel-frequency Cepstral coefficients (MFCC) and
ability of rapidly recognising speaker emotion. A great deal the linear regression forecast Cepstral variables (LPCC) in
of practice and observation are required to accomplish this. order to assess the effectiveness of the suggested strategy.
Humans examine several aspects of an expression first, and The results show that an average and highest identification
then, via observations or prior experience, they identify the rate of 78% was attained. Additionally, data show that LFPC
speaker's emotion. is a superior feature than conventional characteristics for
emotion categorization [12].
Hence, it is necessary to develop an algorithm that is
human-like and has the capability of accurately and quickly

979-8-3503-3860-7/23/$31.00 ©2023 IEEE 1


Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
Despite adhering to any cultural or semantic data, Rong A. Data Collection:
et al. [1] proposed an ensemble random forest to trees The Ryerson Audio-Visual Database of Emotional
(ERFTrees) technique featuring an excessive range of Speech and Song (RAVDESS) information set, being
characteristics for emotion identification. This technique is deposited in.wav file format, was used in this study. The
used with tiny data sets that have a lot of characteristics. An librosa package in Python is used for accessing the files
investigation using an analysis of Chinese speakers' containing audio. The Ryerson Audio-Visual Database of
passionate statement was conducted to assess the suggested Emotional Speech and Song (RAVDESS) was utilised as the
technique, and the findings show that it improved the source data in this study. We picked the speech section,
velocity of emotion identification. Additionally, ERFTrees which has 24 performers (balanced by the concept of gender)
outperforms emerging ISOMap and well-known scale and 1440 audio recordings.
reduction techniques like PCA and complex scaling (MDS).
Whereas the weakest performance only achieves 16% on 84 B. Data Preprocessing:
parameters with the natural information set, the greatest Data preprocessing is a method for transforming unclean
accuracy with 16 features for female dataset reached an information into accurate data sets. The term "purifying the
optimal accurate score of 82.54%. information." relates to the removal of voids, the replacement
For the sorting issue of speech emotion recognition, of empty strings with useful information, the removal of
Ayadi et al. [4] suggested a Gaussian mixture vector identical numbers, the removal of variations, and the removal
autoregressive (GMVAR) technique, that uses a blend of of unneeded characteristics. Translate categorical variables to
GMM and linear average values. The main concept behind numerical values if the set of data comprises any information
GMVAR is its capacity to distribute information across that fall into one of the categories.In sum,our model is 5
several media and construct a system that depends on voice layers deep.
set attributes. The Berlin emotional dataset was utilised to C. Feature Extraction:
assess GMVAR. The research result indicates that the rate of
Mel-frequency cepstral coefficients (MFCC),
classification may approach 76%, compared to 67% for feed-
Chromogram, together with Spectral contrast and Tonal
back neural network networks, 67% for k-NN, and 71% for
Centroid characteristics, are the features that were retrieved.
HMM. In comparison to HMM, this technique has the
benefit of adequately separating high from unconscious 1) Mel-frequency cepstral coefficient
states with neutral feelings [4].

III. METHODOLOGY
Deep neural networks are used in the suggested modified
technique for voice emotion detection. To gain access to data
obtained from the recordings file, this technique combines
Mel-frequency cepstral coefficients (MFCC), Chromogram,
Mel scaled spectrogram, Spectral contrast, and Tonal
Centroid characteristics. The design divides the sound of the
remarks into 8 categories, including impartial peaceful, Fig. 2. BLOCK DIAGRAM OF MFCC
joyful, depressed, furious, afraid, disgusted, and astonished.
To derive voice intensity characteristics, a 5-layer a complex MFCC feature extraction process consists of a few steps
network is trained using DNN. as discussed below,
Pre-emphasis: For the indication to have more vitality,
pre-emphasis is necessary. This procedure involves passing
the spoken sample across an amplifier to boost its vibrancy.
This increase in state of energy provides additional details.
K(z) = 1 – pz–1
Framing: This step involves segmenting the voice sample
into 20–40 ms frames. This procedure is required to set the
size of speech because the voice of a person might fluctuate
in duration. Although the voice message's essence is non-
stationary—its amplitude might fluctuate across time—for a
short amount of duration, it behaves as a fixed transmission.
Windowing: This step is carried out follows the frame
procedure. The signal discontinuities at the beginning and
conclusion of every picture are lessened by the windowing
procedure. This method involves a 10ms phase transfer. This
suggests half of the preceding pixel's content is repeated in
every subsequent one.
Fast Fourier Transform (FFT): The frequency spectrum
of each frame is produced using the rapid Fourier transform
(FFT). The FFT converts every one of the samples of a frame
Fig. 1. Steps involved in speech emotion recognition between the duration phase to the frequency band. To

2
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
identify all frequencies present in an area, an FFT is utilised. Pitch and amplitude Resonance and its equivalent
Each image has 20–30 triangle filtration from the Mel scale subjective property pitch make up the most fundamental
filter bank implemented. The quantity of radiation is present tonality-related signal component. Another of the primary
in a particular picture is determined by the mel scale filter. fundamental qualities of vibrations, in addition to sound
volume, length, the tone, and geographic position, is tone. A
The following calculation can be used to translate the basic waveform is represented by a sinusoidal operate, x(t),
normal frequency f to the Mel scale m: whereby an is the highest magnitude, t is the duration (in
Mel(f)=1125*ln(1+f/700) seconds), f is the frequency, and is the beginning phase. The
amount of cycles that are replicated each second is known as
Calculation of log energy: The log function is applied to the frequency f of this straightforward pattern.
the filtering branch energies of every frame once they have
been obtained. Mammalian sense of hearing is another 5) Melody
source of inspiration. Hearing in humans cannot function on A series of frequency sounds is typically thought of as a
a linear scale. The ears of humans is not capable of detecting melody.
significant electrical fluctuations when the audio level is
great. Linear calculation of energy provides the qualities that According to this explanation, a melody is "a collection
a person can plainly hear. of pitches of pitched sounds arranged in musical time in
accordance to particular social norms and restrictions."
Discrete Cosine Transformation (DCT): The logarithmic
filtering bank values are used to compute the DCT in the 6) Harmony
final stage. We utilised 10ms of scrolling and 25ms of The phrase "harmony" refers to the instantaneous pairing
frames. Additionally, 26 band pass filters were utilised. 13 of sounds, known as chords, and over time, chord
MFCC were calculated for every image. We calculated 13 progressions. By addition to describing pitches and keys, the
MFCC characteristics from every frame. We additionally word primarily refers to a set of mathematical rules guiding
estimated the power contained in the frame itself. We how they are combined. Harmony, in the second meaning,
estimated 13 velocity components and 13 acceleration has an array of concept books. We just took note of the
components by calculating the time variations from energy components of the harmonic content which are to do with
and MFCC while obtaining 13 MFCC characteristics. how pitches are put together into harmonies and how it
relates to the item's tonality. As we've observed, rhythm and
Generate the delta characteristics according to the tune generally relate to a group of notes that are potentially
previous and subsequent M structures, where Cm(t) is the synchronous or sequentially. Because they interact, it can be
static coefficient of the panel. challenging to distinguish between harmony and melody in
this manner.
7) Tonality
Where Cm(t) denotes static coefficient of the frame, The ambiguity of the concept of tonality is one learning
calculate delta features based on preceding and following M the causes of an absence of consensus across various
frames . methodologies for musicality induction. Castil-Blaze coined
the phrase "tonality" for the first time in 1821. Today, it
2) Spectral Contrast frequently refers to a set of interactions among a number of
The spectral contrast in a sound measures its harmonic pitches that produce harmonies and melodies and include a
power at each epoch. Because the majority of recordings of tonic, or core pitch class, as its most significant (or
sound include a tone whose energy varies as time passes. permanent) component.
It seems challenging to gauge the vibration condition. It describes the configurations of pitch phenomena in the
This energy fluctuation may be measured using spectral broadest extent imaginable. The tonal centroids in order aid
contrast. Simple, widen-band impulses often have significant in harmonic change detection.
contrast amounts, while broad-band signals typically have
D. System Design:
low contrast values. Broad-band sound is typically associated
with poor contrast values, although elevated contrast values 1) Application for Emotion Recognition Model :
typically correlate to distinct, shorten-band impulses. Here, The user uploads an audio file that contains speech
the difference between the mean energy in the peak energy from a person, and the model is then utilised to forecast
frame and the bottom or flat power frame is used to the speaker's emotional state.
determine the power difference.
The model is used in a web application built using the
3) Tonal Centroid Flask architecture. The system and the user are its two
The tonal centroids helps in Detecting Harmonic Change components. The UI has a method for managing user login
in Musical Audio or variances due to tones in audio. and authentication. Initially an unfamiliar user needs to enter
their information, usually consists of name, email, and
password. The database known as MySQL houses this
individual's info. Someone who has already registered and
whose information is recently in the system's MySQL
database may access the internet-based application using
The following characteristics of the tonal centroid are authorised login details. They are only given permission to
included: the programme to forecast feelings when they properly log
in.
4) Frequency And Pitch

3
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
Five levels of a Sequence() model are produced. The set
of data is trained employing 700 epochs using the train
dataset. The model that we will use for prediction is the best
in terms of accuracy when tested. We utilise softmax to
classify the feelings in the output layer. Softmax is a accepts
a vector of numbers and transforms them into chances, which
are then applied to the output ofpicturegeneration.
F. Performance Metrics:
1) Hyper parameters
Classification rate
In the data from the test, the categorization ratio is
calculated by summing the numbers on each diagonal and
dividing the result by the sample size.
2) Recall
Recall is determined as the sum of true positives and
false negatives weighted by the total amount of actual
positives. Recall is defined as True Positives / (False
Negatives + True Positives). The outcome is a number that
ranges from 0.0 for no memory to 1.0 for complete or
flawless recall.
Fig. 3. Overall Workflow of the Emotion Recognition Application
3) Precision Value
E. Model for Deep Neural Network to Recognition Improving precision will reduce the amount of incorrect
Emotion: results, while improving recall will reduce the number of
false negatives as well.
Deep learning can boost efficiency while using fewer
processors. In order to develop models for activities that 4) F-Measures
typical machine learning algorithms cannot or are difficult to Recall and precision may be combined onto one metric
do, neural networks with deep connections (DNN) are using F-metric, which incorporates two attributes. F-Measure
frequently utilised in the field of deep learning. is equal to the product of the recall and accuracy factor.
1) Architecture Diagram of DNN 5) ROC
Rate of Change (ROC) estimates the percentage change
in price between the present price and the price after a certain
amount of periods.

Fig. 5. Performance Metrics

G. Experiments And Result


We initially separated the entire set of information into
two groups: 80% and 20%. Data were used for training
Fig. 4. Layers in Architecture diagram purposes in the amount of 80% and for the purpose of
validation in the amount of 20%. Then, for every file in the
Our 1440 audio recordings dataset was divided into training dataset and test data set, we calculated the MFCC
training information (1008 audio files) and testing characteristics. We gave the deep neural network that
information (432 audio files). In this case, the initial dataset processed the retrieved characteristics as its initial inputs.
contains 80% of the data. Using five convolutional layers, we utilise DNN. For the

4
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.
network we operate, 700 epochs were established set. The occasionally inconsistent output when trained on human-
final output nodes feature an activation function called rated emotions. As the model was trained using the
"softmax" that we have employed. We also computed loss RAVDESS dataset, the speaker's accent might possibly
using classification linear sensitivity. Training performance provide unexpected results as the model is trained on North
peaked at 96% after 700 iterations, whereas test accuracy American accent database..
peaked at 80%. Confusion matrix for training data is
represented graphically in Figures. REFERENCES
[1] J. Rong, G. Li, and Y.-P. P. Chen, “Acoustic feature selection for
automatic emotion recognition from speech,” Inf. Process. Manag.,
vol. 45, no. 3, pp. 315– 328, May 2009.
[2] Benk, Sal &Elmir, Youssef &Dennai, Abdeslem. (2019). A Study on
Automatic Speech Recognition. 10. 77-85. 10.6025/jitr/2019/10/3/77-
85.
[3] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa,
"Speech emotion recognition using spectrogram & phoneme
embedding," 2018, doi: 10.21437/Interspeech.2018-1811.
[4] M. M. H. El Ayadi, M. S. Kamel, and F. Karray, “Speech Emotion
Recognition using Gaussian Mixture Vector Autoregressive Models,”
in 2007 IEEE International Conference on Acoustics, Speech and
Signal Processing - ICASSP ’07, 2007, vol. 4, pp. IV– 957–IV–960
[5] Mustaqeem and S. Kwon, "A CNN-assisted enhanced audio signal
processing for speech emotion recognition," Sensors (Switzerland),
2020, doi: 10.3390/s20010183.
[6] S. Emerich, E. Lupu, A. Apatean, “Emotions Recognitions by Speech
and Facial Expressions Analysis”, 17th European Signal Processing
Conference, 2009.
[7] M. E. Ayadi, M. S. Kamel, F. Karray, “Survey on Speech Emotion
Fig. 6. Accuracy Graph and table Recognition: Features, Classification Schemes, and Databases”,
Pattern Recognition 44, 2019
TABLE I. CONFUSION MATRIX FOR VALIDATION DATA [8] Ashish B. Ingale & D. S. Chaudhari (2012). Speech Emotion
Recognition. International Journal of Soft Computing and
Engineering (IJSCE) ISSN: 2231-2307, Volume-2 Issue- 1, March
2012
[9] Szegedy, Christian & Toshev, Alexander & Erhan,
Dumitru. (2013). Deep Neural Networks for Object Detection. 1-9.
[10] Szegedy, Christian & Toshev, Alexander & Erhan,
Dumitru. (2013). Deep Neural Networks forObject Detection. 1-9.
[11] Chiriacescu, “Automatic Emotion Analysis Based On Speech”, M.Sc.
THESIS Delft University of Technology, 2009.
[12] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion
The real recognition of emotion is represented by the recognition using hidden Markov models,” Speech Commun., vol. 41,
confusion matrix's diagonal components. Our network can no. 4, pp. 603–623, Nov. 2003.
identify the correct emotion with a high degree of accuracy [13] Network," 2017, doi: 10.1109/PlatCon.2017.7883728. A. Satt, S.
for the majority of emotions. Rozenberg, and R. Hoory, "Efficient emotion recognition from speech
using deep learning on spectrograms," 2017, doi:
10.21437/Interspeech.2017- 200.
IV. CONCLUSIONS
[14] J.-H. Yeh, T.-L. Pao, C.-Y. Lin, Y.-W. Tsai, and Y.-T. Chen,
Deep learning algorithms can produce effective “Segment-based emotion recognition from continuous Mandarin
outcomes. We have successfully created a deep learning Chinese speech,” Comput. Human Behav., vol. 27, no. 5, pp. 1545–
model for emotion identification that scored 96% on tests. 1552, Sep. 2011.
Please be aware that emotion prediction is arbitrary and that [15] P.Shen, Z. Changjun, X. Chen, “Automatic Speech Emotion
Recognition Using Support Vector Machine”, International
different people may grade the same music with different Conference On Electronic And Mechanical Engineering And
feelings. This is also the cause of the algorithm's InformationTechnology, 2011.

5
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on May 21,2024 at 09:18:27 UTC from IEEE Xplore. Restrictions apply.

You might also like