Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
39 views50 pages

SER Group Documentation

The project report titled 'Speech to Emotion Recognition' details a system designed to classify emotions based on speech using machine learning techniques. It outlines the project's purpose, methodology, and system requirements, emphasizing the importance of accurate emotion detection for enhancing human-computer interaction. The report includes a comprehensive analysis of the existing systems, proposed improvements, and the use of the RAVDESS dataset for training the model.

Uploaded by

vamsiladi14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views50 pages

SER Group Documentation

The project report titled 'Speech to Emotion Recognition' details a system designed to classify emotions based on speech using machine learning techniques. It outlines the project's purpose, methodology, and system requirements, emphasizing the importance of accurate emotion detection for enhancing human-computer interaction. The report includes a comprehensive analysis of the existing systems, proposed improvements, and the use of the RAVDESS dataset for training the model.

Uploaded by

vamsiladi14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

A

Project Report
On

“SPEECH TO EMOTION REGONITION”

Submitted in partial fulfillment of


the requirements for the 8th Semester Sessional
Examination of

BACHELOR OF TECHNOLOGY
IN

Computer Science and Engineering


By
D SHIVA SATWIK(Registration No)
L VAMSI (20UG010413)

Under the esteemed guidance of

Mr. Sitanshu Kar

SCHOOL OF ENGINEERING AND TECHNOLOGY


Department of Computer Science and Engineering
GIET University, GUNUPUR – 765022
2023-24
GIET UNIVERSITY, GUNUPUR
School of Engineering and Technology
Department of Computer Science & Engineering
Approved by Govt. of Odisha

CERTIFICATE

This is to certify that the project work entitled “ SPEECH

TO EMOTION REGONITION” is done by D SHIVA SATWIK(20UG010391),

L VAMSI(20UG010413) in partial fulfillment of the requirements for

the 8th Semester Sessional Examination of Bachelor of Technology in

Computer Science and Engineering during the academic year 202324.

This work is submitted to the department as a part of the evaluation

of 8th Semester Major Project.

Mr.Sitanshu Kar Mr. D Anil Kumar Dr. K. Murali Gopal


Project Supervisor Project Co-ordinator HoD, CSE
ACKNOWLEDGEMENT

We express our sincere gratitude to our project guide Prof mr.


sitanshu kar of Computer Science and Engineering for giving me
an opportunity to accomplish the project. With his/her active
support and guidance, this project report has been successfully
completed.

We also thank (Dr.) K. Murali Gopal, HoD, CSE, and Mr. D Anil
Kumar, Project Coordinator for their consistent support,
guidance, and help.

D SHIVA SATWIK (20UG010391)

L VAMSI (20UG010413)

ABSTRACT

Communication is the key to express one’s thoughts and ideas clearly. Amongst all forms of
communication, speech is the most preferred and powerful form of communications in human. The era of the
Internet of Things (IoT) is rapidly advancing in bringing more intelligent systems available for everyday use.
These applications range from simple wearables and widgets to complex self- driving vehicles and automated
systems employed in various fields. Intelligent applications are interactive and require minimum user effort to
function, and mostly function on voice-based input.
This creates the necessity for these computer applications to completely comprehend human speech. A
speech percept can reveal information about the speaker including gender, age, language, and emotion.
Several existing speech recognition systems used in IoT applications are integrated with an emotion detection
system in order to analyze the emotional state of the speaker. The performance of the emotion detection
system can greatly influence the overall performance of the IoT application in many ways and can provide
many advantages over the functionalities of these applications.

This research presents a speech emotion detection system with improvements over an existing system
in terms of data, feature selection, and methodology that aims at classifying speech percepts based on
emotions, more accurately.

CONTENTS:

Table Page.No
1.Introduction--------------------------------------------------------------------------------
2. SYSTEM ANALYSIS-------------------------------------------------------------------
3. Methodology------------------------------------------------------------------------------
4.DFD diagram-------------------------------------------------------------------------------
5. Modules------------------------------------------------------------------------------------
6. dataset--------------------------------------------------------------------------------------
7.feature Extraction--------------------------------------------------------------------------
8. Algorithms-------------------------------------------------------------------------------
9. Classification report--------------------------------------------------------------------
10.System Design--------------------------------------------------------------------------
11. Analysis---------------------------------------------------------------------------------
12.Coding-----------------------------------------------------------------------------------
13.Conclusion------------------------------------------------------------------------------
14.Reference--------------------------------------------------------------------------------

ii
SPEECH TO EMOTION RECOGNITION

1. INTRODUCTION
Speech emotion recognition is an act of predicting human's emotion through their speech along
with the accuracy of prediction. It creates a better human computer interaction. Though it is difficult to
predict the emotion of a person as emotions are subjective and annotation audio is challenging,
“Speech Emotion Recognition(SER)” makes this possible.

This is the same theory which is used by animals like dogs, elephants and horses etc. Do to be
able to understand human emotion. There are various states to predict one's emotion, they are tone,
pitch, expression, behavior etc.

• Speaker Identification

• Speech Recognition

• Speech Emotion Detection

1.1. PURPOSE
• The primary objective of SER is to improve man-machine interface.

• It can also be used to monitor the psycho physiological state of a person in lie detectors.

• In recent time, speech emotion recognition also find its applications in medicine and forensics.

1.2. PROJEECT SCOPE


This project deals with the various functioning in College management process. The
main idea is to implement a proper process to system. In our existing system contains a many
operations registration, student search, fees, attendance, exam records, performance of the
student etc. All these activity takeout manually by administrator.

1.3. EXISTING SYSTEM


The speech emotion detection system is implemented as a Machine Learning (ML)
model. The steps of implementation are comparable to any other ML project, with additional
fine-tuning procedures to make the model function better. The flowchart represents a pictorial
overview of the process (see Figure 1). The first step is data collection, which is of prime
importance. The model being developed will learn from the data provided to it and all the
decisions and results that a developed model will produce is guided by the data. The second
step, called feature engineering, is a collection of several machine learning tasks that are

1
SPEECH TO EMOTION RECOGNITION
executed over the collected data. These procedures address the several data representation and
data quality issues. The third step is often considered the core of an ML project where an
algorithmic based model is developed. This model uses an ML algorithm to learn about the data
and train itself to respond to any new data it is exposed to. The final step is to evaluate the
functioning of the built model. Very often, developers repeat the steps of developing a model
and evaluating it to compare the performance of different algorithms. Comparison results help
to choose the appropriate ML algorithm most relevant to the problem.

1.4. PROPOSED SYSTEM


In this current study, we presented an automatic speech emotion 6 recognition (SER)
system using machine learning algorithms to classify the emotions. The performance of the
emotion detection system can greatly influence the overall performance of the application in
many ways and can provide many advantages over the functionalities of these applications.
This research presents a speech emotion detection system with improvements over an existing
system in terms of data, feature selection, and methodology that aims at classifying speech
percepts based on emotions, more accurately.

2
SPEECH TO EMOTION RECOGNITION
2. SYSTEM ANALYSIS
2.1. HARDWARE REQUIREMENTS
Processor Brand : Intel

Processor Type : Core i3

Processor Speed : 2GHZ

Processor Count : 1

RAM Size : 8GB

Memory Technology : DDR4

Computer Memory Type : DDR4

SDRAM Hard Drive Size : 160 GB

2.2. SOFTWARE REQUIREMENTS


Operating system : Windows 10 or 11

Application server : Jupyter Notebook

Frontend : Machine learning using Python

Datasets : RAVDESS Data set

3
SPEECH TO EMOTION RECOGNITION

3. METHODOLOGY
The speech emotion detection system is implemented as a Machine Learning (ML)
model. The steps of implementation are comparable to any other ML project, with
additional fine-tuning procedures to make the model function better. The flowchart
represents a pictorial overview of the process (see Figure 1). The first step is data
collection, which is of prime importance. The model being developed will learn from the
data provided to it and all the decisions and results that a developed model will produce is
guided by the data. The second step, called feature engineering, is a collection of several
machine learning tasks that are executed over the collected data. These procedures address
the several data representation and data quality issues. The third step is often considered
the core of an ML project where an algorithmic based model is developed. This model
uses an ML algorithm to learn about the data and train itself to respond to any new data it
is exposed to. The final step is to evaluate the functioning of the built model. Very often,
developers repeat the steps of developing a model and evaluating it to compare the
performance of different algorithms. Comparison results help to choose the appropriate
ML algorithm most relevant to the problem.

4. DFD DIAGRAM
4
SPEECH TO EMOTION RECOGNITION

The DFD is also called as bubble chart. It is a simple graphical formalism that can
be used to represent a system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this system. The data flow
diagram (DFD) is one of the most important modeling tools. It is used to model the system
components. These components are the system process, the data used by the process, an
external entity that interacts with the system and the information flows in the system. DFD
shows how the information moves through the system and how it is modified by a series of
transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output. DFD is also known as
bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD
may be partitioned into levels that represent increasing information flow and functional
detail.

Figure 3.1: Flow of implementation

5. MODULES
• Speech input Module
5
SPEECH TO EMOTION RECOGNITION

• Feature extraction and selection

• Classification

• Recognized emotional output

5.1. MODULE DESCRIPTION


• Speech input Module Input to the system is speech taken with the help of
audio. Then equivalent digital representation of received audio is produced
through sound file.

• Feature extraction and selection There are so many emotional states of


emotion and emotion relevance is used to select the extracted speech
features. For speech feature extraction to selection corresponding to
emotions all procedure revolves around the speech signal.

• Classification Module Finding a set of significant emotions for


classification is the main concern in speech emotion recognition system.
There are various emotional states contains in a typical set of emotions that
makes classification a complicated task.

• Recognized emotional output Fear, surprise, anger, joy, disgust and


sadness are primary emotions and naturalness of database level is the basis
for speech emotion recognition system evaluation.

6
SPEECH TO EMOTION RECOGNITION
6. DATASET

6.1. RAVDESS DATASET


(The Ryerson Audio-Visual Database of Emotional Speech and Song)

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains
7,356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12
male), vocalizing two lexically-matched statements in a neutral North American accent. Speech
includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains
calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of
emotional intensity (normal, strong), with an additional neutral expression. All conditions are
available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264,
AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

• The size of the dataset is large enough for the model to be trained effectively. The more
exposure to data given to a model helps it to perform better.

• All basic emotional categories of data are present. A combination of these emotions can be
used for further research like Sarcasm and Depression detection.

• Data is collected from two different age groups which will improve the classification.

• The audio files are mono signals, which ensures an error-free conversion with most of the
programming libraries.

7
SPEECH TO EMOTION RECOGNITION

7. FEATURE EXTRACTION

7.1. THE PROCESS:

Speech is a varying sound signal. Humans are capable of making modifications to the
sound signal using their vocal tract, tongue, and teeth to pronounce the phoneme. The features are
a way to quantify data. A better representation of the speech signals to get the most information
from the speech is through extracting features common among speech signals. Some
characteristics of good features include [14]:

• The features should be independent of each other. Most features in the feature vector are
correlated to each other. Therefore it is crucial to select a subset of features that are
individual and independent of each other.

• The features should be informative to the context. Only those features that are more
descriptive about the emotional content are to be selected for further analysis.

• The features should be consistent across all data samples. Features that are unique and
specific to certain data samples should be avoided.

• The values of the features should be processed. The initial feature selection process can
result in a raw feature vector that is unmanageable. The process of Feature Engineering will
remove any outliers, missing values, and null values

The features in a speech percept that is relevant to the emotional content can be grouped into
two main categories:

• Prosodic features

• Phonetic features.

The prosodic features are the energy, pitch, tempo, loudness, formant, and intensity. The
phonetic features are mostly related to the pronunciation of the words based on the language.
Therefore for the purpose of emotion detection, the analysis is performed on the prosodic
features or a combination of them. Mostly the pitch and loudness are the features that are very
relevant to the emotional content

8
SPEECH TO EMOTION RECOGNITION

7.2.MEL FREQUENCY CEPSTRUM COEFFICIENTS (MFCC) FEATURES

A subset of features that are used for speech emotion detection is grouped under a category called
the Mel Frequency Cepstrum Coefficients (MFCC) [16]. It can be explained as follows:

• The word Mel represents the scale used in Frequency vs Pitch measurement (see Figure 2)
[16]. The value measured in frequency scale can be converted into Mel scale using the
formula m = 2595 log10 (1 + (f/700))

• The word Cepstrum represents the Fourier Transform of the log spectrum of the speech
signal

9
SPEECH TO EMOTION RECOGNITION
8. ALGORITHMS
8.1. MLP CLASSIFIER
MLPClassifier stands for Multi-layer Perceptron classifier which in the name itself
connects to a Neural Network. Unlike other classification algorithms such as Support Vectors or
Naive Bayes Classifier, MLPClassifier relies on an underlying Neural Network to perform the task
of classification

We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

A multi-layer rather than a single layer network is required since a single layer perceptron
(SLP) can only compute a linear decision boundary, which is not flexible enough for most realistic
learning problems. For a problem that is linearly separable, (that is capable of being perfectly
separated by linear decision boundary), the perceptron convergence theorem guarantees
convergence. In its simplest form, SLP training is based on the simple idea of adding or subtracting
a pattern from the current weights when the target and predicted class disagrees, otherwise the
weights are unchanged. For a non-linearly separable problem, this simple algorithm can go on
cycling indefinitely.

The modification known as least mean square (LMS) algorithm uses a mean squared error cost
function to overcome this difficulty, but since there is only a single perceptron, the decision
boundary is still linear. An MLP is a universal approximator [6] that typically uses the same
squared error function as LMS.

However, the main difficulty with the MLP is that the learning algorithm has a complex error
surface, which can become stuck in local minima. There does not exist any MLP learning
algorithm that is guaranteed to converge, as with SLP. The popular MLP back propagation
algorithm has two phases, the first being a forward pass, which is a forward simulation for the
current training pattern and enables the error to be calculated. It is followed by a backward pass,
that calculates for each weight in the network how a small change will affect the error function.

The derivative calculation is based on the application of the chain rule, and training typically
proceeds by changing the weights proportional to the derivative.

10
SPEECH TO EMOTION RECOGNITION

Fig 6.1 MLP Classifier Confusion Matrix.

11
SPEECH TO EMOTION RECOGNITION

8.2. XGBOOST CLASSIFIER

• XGBoost is an optimized distributed gradient boosting library designed to be highly


efficient, flexible and portable. It implements machine learning algorithms under the
Gradient Boosting framework..

• XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many
data science problems in a fast and accurate way. The same code runs on major distributed
environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

• We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

Fig 6.2 XGBOOST Classifier Confusion Matrix.

12
SPEECH TO EMOTION RECOGNITION

8.3. LGBM CLASSIFIER


• LightGBM is a fast, distributed, high performance gradient boosting framework based on
decision tree algorithms, used for ranking, classification and many other machine learning
tasks. Another reason why Light GBM is so popular is because it focuses on accuracy of
results. LGBM also supports GPU learning and thus data scientists are widely using LGBM
for data science application development.

• We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

9.

Fig 6.3 LGBM Classifier Confusion Matrix.

13
SPEECH TO EMOTION RECOGNITION

8.4. RANDOMFOREST CLASSIFIER

Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset. Random
Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of
ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

The term ―Random Forest Classifier‖ refers to the classification algorithm made up of
several decision trees. The algorithm uses randomness to build each individual tree to promote
uncorrelated forests, which then uses the forest‘s predictive powers to make accurate decisions.

Random forest classifiers fall under the broad umbrella of ensemble based learning methods. They
are simple to implement, fast in operation, and have proven to be extremely successful in a variety
of domains. The key principle underlying the random forest approach comprises the construction
of many ―simple‖ decision trees in the training stage and the majority vote (mode) across them in
the classification stage. Among other benefits, this voting strategy has the effect of correcting for
the undesirable property of decision trees to overfit training data. In the training stage, random
forests apply the general technique known as bagging to individual trees in the ensemble. Bagging
repeatedly selects a random sample with replacement from the training set and fits trees to these
samples. Each tree is grown without any pruning. The number of trees in the ensemble is a free
parameter which is readily learned automatically using the so-called out-of-bag error .

We will use the confusion matrix to determine the accuracy which is measured as the total number
of correct predictions divided by the total number of predictions.

Much like in the case of naïve Bayes– and k-nearest neighbor–based algorithms, random
forests are popular in part due to their simplicity on the one hand, and generally good performance
on the other. However, unlike the former two approaches, random forests exhibit a degree of
unpredictability as regards the structure of the final trained model. This is an inherent consequence
of the stochastic nature of tree building. As we will explore in more detail shortly, one of the key
reasons why this characteristic of random forests can be a problem in regulatory reasons—clinical
adoption often demands a high degree of repeatability not only in terms of the ultimate

14
SPEECH TO EMOTION RECOGNITION
performance of an algorithm but also in terms of the mechanics as to how a specific decision is
made.

Fig 6.4 RandomForest Classifier Confusion Matrix.

15
SPEECH TO EMOTION RECOGNITION

8.5. KNN CLASSIFIER

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique. It is simple to implement. It is robust to the noisy training data. It
can be more effective if the training data is large.

The concept of the k-nearest neighbor classifier can hardly be simpler described. This is an
old saying, which can be found in many languages and many cultures.This means that the concept
of the k-nearest neighbor classifier is part of our everyday life and judging: Imagine you meet a
group of people, they are all very young, stylish and sportive. They talk about there friend Ben,
who isn't with them. So, what is your imagination of Ben? Right, you imagine him as being yong,
stylish and sportive as well.If you learn that Ben lives in a neighborhood where people vote
conservative and that the average income is above 200000 dollars a year? Both his neighbors make
even more than 300,000 dollars per year? What do you think of Ben? Most probably, you do not
consider him to be an underdog and you may suspect him to be a conservative as well?

The principle behind nearest neighbor classification consists in finding a predefined


number, i.e. the 'k' - of training samples closest in distance to a new sample, which has to be
classified. The label of the new sample will be defined from these neighbors. k-nearest neighbor
classifiers have a fixed user defined constant for the number of neighbors which have to be
determined. There are also radius-based neighbor learning algorithms, which have a varying
number of neighbors based on the local density of points, all the samples inside of a fixed radius.
The distance can, in general, be any metric measure: standard Euclidean distance is the most
common choice. Neighbors-based methods are known as non-generalizing machine learning
methods, since they simply "remember" all of its training data. Classification can be computed by a
majority vote of the nearest neighbors of the unknown sample.

The k-NN algorithm is among the simplest of all machine learning algorithms, but despite
its simplicity, it has been quite successful in a large number of classification and regression
problems, for example character recognition or image analysis.

16
SPEECH TO EMOTION RECOGNITION
The algorithm for the k-nearest neighbor classifier is among the simplest of all machine
learning algorithms. k-NN is a type of instance-based learning, or lazy learning. In machine
learning, lazy learning is understood to be a learning method in which generalization of the
training data is delayed until a query is made to the system. On the other hand, we have eager
learning, where the system usually generalizes the training data before receiving queries. In other
words: The function is only approximated locally and all the computations are performed, when
the actual classification is being performed.

We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

Fig 6.5 KNN Classifier Confusion Matrix

17
SPEECH TO EMOTION RECOGNITION

9. CLASSIFICATION REPORT

9.1. MLP CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report.

Table 7.1 MLPClassifier Classification Report

18
SPEECH TO EMOTION RECOGNITION

9.2. XGBOOST CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report

Table 7.2 XGBClassifier Classification Report

19
SPEECH TO EMOTION RECOGNITION

9.3. LGBM CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report

20
SPEECH TO EMOTION RECOGNITION

Table 7.3 LGBMClassifier Classification Report

9.4. RANDOMFOREST CLASSIFIER


A Classification report is used to measure the quality of predictions from a classification
algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report.

21
SPEECH TO EMOTION RECOGNITION

Table 7.4 Random Forest Classifier Classification Report

9.5. KNN CLASSIFIER


A Classification report is used to measure the quality of predictions from a classification
algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report

22
SPEECH TO EMOTION RECOGNITION

Table 7.5 KNN Classifier Classification Report

10. SYSTEM DESIGN


10.1. INPUT DESIGN
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to put

23
SPEECH TO EMOTION RECOGNITION
transaction data in to a usable form for processing can be achieved by inspecting the computer to
read data from a written or printed document or it can occur by having people keying the data
directly into the system. The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The
input is designed in such a way so that it provides security and ease of use with retaining the
privacy. Input Design considered the following things: What data should be given as input? How
the data should be arranged or coded? The dialog to guide the operating personnel in providing
input. Methods for preparing input validations and steps to follow when error occur.

10.2. OUTPUT DESIGN


A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to other
system through outputs. In output design it is determined how the information is to be displaced
for immediate need and also the hard copy output. It is the most important and direct source
information to the user. Efficient and intelligent output design improves the system‘s relationship
to help user decision-making. The output form of an information system should accomplish one or
more of the following objectives. Convey information about past activities, current status or
projections of the Future. Signal important events, opportunities, problems, or warnings. Trigger an
action. Confirm an action

11. ANALYSIS
11.1. FINAL REPORT
In Speech to Emotion Recognition after analyzing these five Classifiers we find that MLP classifier
gives the higher accuracy.

24
SPEECH TO EMOTION RECOGNITION
XGBOOST & LGBM classifier gives nearly higher accuracy shown in below bar chart.

Figure 8.1: Final Report Analysis

12. CODING

Loading Libraries:

25
SPEECH TO EMOTION RECOGNITION

Loading a Audio:

26
SPEECH TO EMOTION RECOGNITION

27
SPEECH TO EMOTION RECOGNITION

Feature Preprocessing:

28
SPEECH TO EMOTION RECOGNITION

29
SPEECH TO EMOTION RECOGNITION

30
SPEECH TO EMOTION RECOGNITION

31
SPEECH TO EMOTION RECOGNITION

32
SPEECH TO EMOTION RECOGNITION

Feature Extraction

33
SPEECH TO EMOTION RECOGNITION

34
SPEECH TO EMOTION RECOGNITION

35
SPEECH TO EMOTION RECOGNITION

36
SPEECH TO EMOTION RECOGNITION

37
SPEECH TO EMOTION RECOGNITION

38
SPEECH TO EMOTION RECOGNITION

39
SPEECH TO EMOTION RECOGNITION
Evaluation

40
SPEECH TO EMOTION RECOGNITION

41
SPEECH TO EMOTION RECOGNITION
Testing

42
SPEECH TO EMOTION RECOGNITION

13. CONCLUSION
The emerging growth and development in the field of AI and machine learning have led to
the new era of automation. Most of these automated devices work based on voice commands from
43
SPEECH TO EMOTION RECOGNITION
the user. Many advantages can be built over the existing systems if besides recognizing the words,
the machines could comprehend the emotion of the speaker (user). Some applications of a speech
emotion detection system are computer- based tutorial applications, automated call center
conversations, a diagnostic tool used for therapy and automatic translation system.

In this thesis, the steps of building a speech emotion detection system were discussed in
detail and some experiments were carried out to understand the impact of each step. Initially, the
limited number of publically available speech database made it challenging to implement a
welltrained model. Next, several novel approaches to feature extraction had been proposed in the
earlier works, and selecting the best approach included performing many experiments. Finally, the
classifier selection involved learning about the strength and weakness of each classifying algorithm
with respect to emotion recognition. At the end of the experimentation, it can be concluded that an
integrated feature space will produce a better recognition rate when compared to a single feature.

14. REFERENCE
• Code for Interview YouTube Channel.

44
SPEECH TO EMOTION RECOGNITION
• Soegaard, M. and Friis Dam, R. (2013). The Encyclopedia of Human-Computer Interaction. 2nd
ed.

• Internet & World Wide Web: How to Program Deitel, PJ Deitel.

• T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden Markov
models,” Speech Commun., vol. 41, no. 4, pp. 603–623, Nov. 2003.

• www.data-flair.training.com

• www.researchgate.net

45

You might also like