Sensors 21 01997
Sensors 21 01997
Article
Modeling of Recommendation System Based on Emotional
Information and Collaborative Filtering
Tae-Yeun Kim 1 , Hoon Ko 2 , Sung-Hwan Kim 1 and Ho-Da Kim 1, *
1 National Program of Excellence in Software Center, Chosun University, Gwangju 61452, Korea;
[email protected] (T.-Y.K.); [email protected] (S.-H.K.)
2 IT Research Institute, Chosun University, Gwangju 61452, Korea; [email protected]
* Correspondence: [email protected]; Tel.: +82-62-230-7096
Abstract: Emotion information represents a user’s current emotional state and can be used in a variety
of applications, such as cultural content services that recommend music according to user emotional
states and user emotion monitoring. To increase user satisfaction, recommendation methods must
understand and reflect user characteristics and circumstances, such as individual preferences and
emotions. However, most recommendation methods do not reflect such characteristics accurately
and are unable to increase user satisfaction. In this paper, six human emotions (neutral, happy,
sad, angry, surprised, and bored) are broadly defined to consider user speech emotion information
and recommend matching content. The “genetic algorithms as a feature selection method” (GAFS)
algorithm was used to classify normalized speech according to speech emotion information. We used
a support vector machine (SVM) algorithm and selected an optimal kernel function for recognizing
the six target emotions. Performance evaluation results for each kernel function revealed that the
radial basis function (RBF) kernel function yielded the highest emotion recognition accuracy of
86.98%. Additionally, content data (images and music) were classified based on emotion information
Citation: Kim, T.-Y.; Ko, H.; Kim, using factor analysis, correspondence analysis, and Euclidean distance. Finally, speech information
S.-H.; Kim, H.-D. Modeling of that was classified based on emotions and emotion information that was recognized through a
Recommendation System Based on collaborative filtering technique were used to predict user emotional preferences and recommend
Emotional Information and content that matched user emotions in a mobile application.
Collaborative Filtering. Sensors 2021,
21, 1997. https://doi.org/10.3390/ Keywords: collaborative filtering; emotion recognition; support vector machine algorithm; speech
s21061997
emotion information
in speech is very active. The results of previous speech recognition studies can serve as
a starting point for speech-based emotion recognition. However, previous studies vary
widely in terms of their selection of feature extraction and pattern recognition algorithms.
Regarding the selection of feature vectors, speech recognition methods mainly use elements
that model phonemes, whereas emotion recognition uses prosody elements. In addition
to feature selection, the selection of pattern recognition algorithms is an important aspect.
Different pattern recognition algorithms may be selected according to the methods used
for modeling emotions based on extracted features [7,8]. Emotion information represents
a user’s current emotional state and can be used in a variety of applications, such as
cultural content services that recommend music according to user emotional states and
user emotion monitoring [9].
Research on recommendation techniques that consider user tendencies to incorpo-
rate various user requirements effectively is also underway. Application programs that
include recommendation techniques are used to predict items that will interest users and
recommend those items [10,11]. A typical recommendation technique is content-based
collaborative filtering. Content-based recommendation techniques directly analyze content
to examine the similarities between content items and between content items and user
preferences. New content is then recommended based on the results of this analysis. Col-
laborative filtering analyzes users who have tendencies that are similar to those of other
users and estimates their content preferences [12,13]. To increase user satisfaction, recom-
mendation techniques must understand and reflect user characteristics and circumstances,
such as individual preferences and emotions. However, most recommendation techniques
do not consider these characteristics and are unable to increase user satisfaction.
Emotion recognition is a technology that identifies the emotional state by analyzing
information related to speech and gestures. The gestures can vary according to the culture.
In adults, emotion-related information extracted through the speech is more consistent
than that using facial expressions such as gestures, because adults tend to control their
emotions. The objective of speech emotion recognition (SER) is to extract features from
speech signals and then define, learn, and classify emotion models [14]. For emotion
modeling, the hidden Markov model (HMM) was mainly used in the past. However, the
recent emergence of deep neural networks (DNNs) and recurrent neural networks (RNNs)
has enabled remarkable progress in research on recognition systems of time-series data,
such as speech signals [15]. The study by Issa et al. introduced an architecture that extracted
mel-frequency cepstral coefficients (MFCC), chromagram, mel-scale spectrogram, Tonnetz
representation, and spectral contrast features from speech files and used them as inputs
for one-dimensional convolutional neural networks (CNN). In addition, an incremental
method that employed samples from the datasets of the Ryerson Audio-Visual Database
of Emotional Speech and Song (RAV-DESS), Berlin (EMO-DB), and Interactive Emotional
Dyadic Motion Capture (IEMOCAP) to modify the initial model for improving emotion
recognition and classification accuracy was used [16]. The study by Sajjad et al. proposed a
framework for speech emotion recognition (SER) using key sequence segment selection
based on the redial-based function network (RBFN) similarity measure. The selected
sequence was converted into a spectrum program by applying the short-time Fourier
transform (STFT) algorithm and transferred to a CNN model to extract distinctive and
prominent features from the speech spectrum program. By normalizing the CNN function
and supplying it to a bi-directional long short-term memory (BiLSTM), time information for
recognizing the final emotional state was learned [17]. The study by Wang et al. proposed
a bimodal fusion algorithm for realizing voice emotion recognition by a weighted decision
fusion method for facial expressions and voice information. This algorithm achieved facial
emotion recognition by combining CNN and long short-term memory (LATM) RNN, and
then transformed the speech signal into an image using MFCC [18]. These studies used
frame unit features, pronunciation unit features, and a combination of LSTM RNN, DNN,
and the simple concentration structure model and conducted a performance evaluation
using datasets such as EMO-DB or IEMOCAP.
Sensors 2021, 21, 1997 3 of 25
With the recent rapid developments in deep learning, excellent performance has been
achieved after actual implementation, and many studies on artificial intelligence (AI) are
being actively conducted. However, selecting the feature vectors of speech signals that
express emotions well is as important as selecting the accurate classification engine in a
speech-signal-based emotion recognition system. These systems show a lower recognition
rate than other emotion recognition systems, such as facial expression recognition, not
because of the low performance of the system itself but due to the inefficient extraction and
selection of speech features.
Therefore, the present study aims to find an efficient and appropriate feature vector
set for emotion classification with the goal of improving the performance of the emotion
recognition system using speech signals, and it expects to achieve a higher emotion recog-
nition rate. The speech data used in this study are from a Korean-style emotion speech
database appropriate for Korean language and culture. The emotions were categorized
into the six categories of normal, joy, sadness, anger, surprise, and boredom. A total of
2400 files with 400 data for each emotion were used as data for this recognition system,
consisting of an equivalent proportion of male and female speech. An SVM classifier was
used as the classification algorithm in this study. For image emotion information, 20 color
emotion models were selected as the representative elements. Factor and correspondence
analyses were conducted using a five-point-scale questionnaire survey, and the emotional
spaces for each color were generated and measured. Furthermore, for music emotion
information, the Euclidean distance was used to recommend the appropriate music for
the current emotion according to the speech emotion information of the user based on
their emotion history. Thus, using the property of emotional information, i.e., the preferred
item varying according to the emotion of the user, we attempted to propose a system that
recommends different content according to the emotion of the user. This was conducted by
merging collaboration filtering with static emotional information that was received in real
time from users. We also attempted to improve the performance through experiments.
End-Point Detection
The speech of a speaker vocalized through a microphone includes slice and
Sensors 2021, 21, 1997 5 of 25
speech signals such that they are useful for the next process based on the language and
speaker information.
End-Point Detection
The speech of a speaker vocalized through a microphone includes slice and noise
sections in addition to speech sections, which include language or speaker information.
In the end-point extraction process, it is necessary to distinguish between the noise and
speech sections from the input signals. The performance of the recognition system strongly
depends on the accuracy of the end-point extraction, and generally, parameters such as
the short-section algebraic energy and zero crossing rate are used. The algebraic energy
is used to distinguish between the speech and noise sections, while zero crossing rate is
used to distinguish between the speech interval and non-speech interval sound sections. In
noiseless speech signals, the end-points can be accurately extracted to some degree using
only the algebraic energy and zero crossing rate. However, if there is presence of noise, the
end-point extraction becomes very difficult.
The short-section algebraic energy is the energy of a certain short section (frame). The
end-points are extracted using a large energy change between the silent and speech sections
based on the fact that the energy value of a silent section is lesser than that of a speech
section. If the short-section energy is E f , it is obtained by the following Equation (1).
!
N −1
E f = 10 log ∑ x2 (n) (1)
n =0
where N is the total number of samples in one frame, and x(n) is the nth sample value of
the input speech.
Similar to the algebraic energy, the zero crossing rate is calculated for each frame. It
represents the number of times the speech signal input in one frame crosses the horizontal
axis (zero point). This is used to distinguish between the speech interval and non-speech
interval sound sections. The speech interval sound section has a large zero crossing rate
because the energy is concentrated in a low-frequency band. Furthermore, the zero crossing
rate in a silent section is modified by the surrounding environment. It is generally smaller
than that of the non-speech interval sound and larger than that of the speech interval sound.
When the zero crossing rate in each frame is Z, it is expressed as Equation (2).
1 1, x (t) > 0
Z= |sgn[ x (t − n)] − sgn[ x (t − n − 1)]|, sgn[ x (t)] = (2)
2 −1, otherwise
where N is the total number of samples in one frame, and x(n) is the nth sample value of
the input speech.
This information can then be used for emotion recognition. Additionally, the Gaussian
mixture model (GMM), hidden Markov model (HMM), support vector machine (SVM),
and artificial neural network (ANN) algorithms are used in speech recognition and speaker
recognition as identification methods during the pattern recognition stage [21,22].
As representative features of speech signals, we extracted pitch and energy, which
contain prosodic features, as well as MFCCs, which contain phoneme features. We also
calculated the delta values of each feature coefficient. In this study, the average, standard
deviation, and maximum values of each feature coefficient were calculated and used as
emotion recognition features to perform optimization via the “genetic algorithms as a
feature selection method” (GAFS). Additionally, we used the SVM classifier to perform
pattern recognition. Specifically, the accuracy of each kernel function in the SVM clas-
sifier was analyzed to identify features that can be used to classify and recognize each
emotion accurately. A speech emotion database was constructed from the speech emotion
information learned in this manner.
Preprocessing Process
The speech preprocess to extract the reliable feature vectors is composed of the division
of speech signals in the frame unit, Hamming window, and end-point detection.
First, input speech signals are sampled at 16 kHz and used to extract features via the
16-bit pulse-code modulation method. A Wiener filter is then used to remove noise from
the sampled speech signals.
We used the Hamming window when extracting pitches from the sampled speech
signals [23,24]. Additionally, a Hamming window that overlaps neighboring frames by
50% is also applied. Next, end-point detection is performed to distinguish speech intervals
and non-speech intervals in the speech signals and extract feature vectors from only the
speech intervals. This prevents poor system performance based on invalid speech analysis
and feature vector extraction during non-speech intervals.
Feature Extraction
To perform speech-based emotion recognition, it is necessary to identify how each
emotion affects speech precisely. The emotions contained in speech are largely expressed
through prosody information, such as pitch changes, energy changes, and pronunciation
speed changes. To perform emotion recognition, it is necessary to identify the features in
speech that accurately reflect this prosody information and perform appropriate modeling.
Regarding the correlation between prosody information and emotional speech, it is
known from a statistical perspective that happy or angry speech generally has high energy
and pitch with rapid pronunciation speed, whereas sad or bored speech generally has low
energy and pitch with slow pronunciation speed [25,26]. The pitch and energy levels of
speech can be modeled using statistical information, such as the average pitch and average
energy of all pronounced speech intervals.
To create speech feature vectors, we extracted each frame unit’s pitch and energy,
which include prosodic features, as well as MFCCs, which include phoneme features. We
also calculated feature coefficient delta values. Ultimately, the average, standard deviation,
and maximum values of each feature coefficient were calculated and feature vectors were
created from these values.
We used the MFCC, a representative speech feature extraction method, to extract the
speech features. The reason for using the MFCC as the feature vector is that the nonlinear
mel unit is robust to noise because it reflects the human hearing characteristics well and can
easily distinguish between information about basic frequencies. The MFCC was created
based on the fact that the human hearing organs are sensitive to the low-frequency band
but are relatively insensitive to the high-frequency band. It is a speech feature expressed
by the mel scale. The mel scale, which was named by Stevens and others, expresses the
relationship between the physical sound height and auditory perceived sound height.
Sensors 2021, 21, 1997 7 of 25
In order to obtain the MFCC feature, the speech signals pass through an anti-aliasing
filter and are then converted to digital signals x(n) through analog–digital conversion.
The digital speech signals pass through a digital pre-emphasis filter that has high-pass
filter characteristics.
This filter was used due to the following reasons. First, high-pass filtering is done
to model the frequency characteristics of human outer/middle ears. This compensates
for the attenuation of 20 dB/decade by radiation from the lips and only the vocal tract
characteristics are obtained from the speech. Furthermore, it also compensates to some
extent for the sensitivity of the hearing system to the spectrum range above 1 kHz. The
characteristic of the pre-emphasis filter, H(z), is expressed as Equation (3), where a is in the
range of 0.95–0.98.
H (z) = 1 − az−1 , 0.9 ≤ a ≤ 1.0 (3)
The pre-emphasis signals are covered by the Hamming window and are divided into
frames of block units.
In the preprocess for extracting the speech feature vector, the speeches are processed
in the frame units under the assumption that they are the normal sections for a duration
of 10–30 ms. Generally, the frame size is 20–30 ms, and 10 ms is often used for the
frame movement.
When speech signals are analyzed only in the time domain, it is difficult to sufficiently
analyze the information contained in the signals. Hence, a technique for converting the
signals of the time domain to the frequency domain is used for signal analysis. Among
the methods used to express the power spectrum of the section signals, the MFCC, which
is widely used for speech recognition, was selected to express the characteristics of the
phonemes. Unlike the general cepstrum, the MFCC evenly divides the frequency bands at
the mel scale. It can be used for emotion recognition because even the same phoneme can
have a different form depending on the emotion contained in it.
In this study, the feature vectors were extracted by covering the signals with a 20-ms
Hamming window and shifting them with 50% overlapping (10 ms). Speech signals
are quasi-periodic signals composed of periodic and non-periodic signals. Therefore,
the speech signals must be made more periodic by considering a periodic window. The
formula for considering the window is Equation (4).
The speech signals of one frame were converted to the frequency domain using fast
Fourier transform (FFT). The frequency band was divided into multiple filter banks and the
energy of each bank was determined. The shape of the filter bank and the method of setting
the center frequency were determined considering the frequency characteristics of the
cochlea. A triangular shape filter was used, and the center frequency was located linearly
until 1 kHz. Above this, it consisted of 20 banks distributed at the mel scale. Equation (6)
is the mel-frequency linear transformation formula. To reduce the discontinuity of the
border between the filters, the triangular filters were generally overlapped. The width of
each filter bank was set from the center frequency of the previous filter bank to the center
frequency of the next filter bank.
f
mel = 2595 log10 1 + (6)
700
The final MFCC was obtained by performing an inverse discrete cosine transform
(IDCT) after taking the log of the band energy. Regarding the MFCC coefficients, 12 co-
Sensors 2021, 21, 1997 8 of 25
efficients from c1 to c12 were used. In addition, the frame log energy was used and the
feature vector used as the input for speech recognition became the 13th vector. To reflect
the characteristics of the modified values of the speech signals, derivatives called delta or
SDC were added, and as a result, there were a total of 39 feature vectors using the MFCC.
The parameters used to extract the MFCC features are listed in Table 1.
Speechsignal
Figure3.3.Speech
Figure signalfeature
featurevector
vectorextraction
extractionand
andemotion
emotionclassification
classificationflow
flowchart.
chart.
Pitch is the height of the sound. The sound is high when the frequency of the vocal
Our pitch extraction process uses a 60-ms Hamming window to extract two to three
cords is large, and the sound is low when the frequency is small. The commonly used
pitches from each frame and passes speech signals through a low-pass filter with a block-
methods include the harmonic product spectrum, average magnitude difference function
ing frequency of 800 Hz. Next, the average magnitude difference function (AMDF) is used
(AMDF), and the sub-harmonic to harmonic ratio. In this study, we adopted the AMDF
to select the pitch with the minimum value among the extracted pitch candidates, as
method because it shows a high emotion recognition rate in a noisy environment.
shown in Equation (7).
Our pitch extraction process uses a 60-ms Hamming window to extract two to three
pitches from each frame and passes
𝐴𝑀𝐷𝐹 (𝑗) = speech
∑ |𝑥 signals
(𝑖 + 𝑗)|,through
1 ≤ 𝑗 ≤a 𝑀𝐴𝑋𝐿𝐴𝐺
low-pass filter with a blocking(7)
frequency of 800 Hz. Next, the average magnitude difference function (AMDF) is used to
Here,
select N is the
the pitch withnumber of samples
the minimum valueand 𝑥 (𝑖)the
among is extracted
the n-th frame’s i-th sample
pitch candidates, as value.
shown
𝑀𝐴𝑋𝐿𝐴𝐺 denotes
in Equation (7). the maximum value of the pitch period that can be extracted.
The extracted pitch candidates are smoothed to prevent the pitch from changing rap-
idly between frames. If there is a short 1 Nspeechless frame interval (one to two frames) be-
N i∑
AMDFn ( j) = | x (i + j)|, 1 ≤ j ≤ MAXLAG (7)
tween speech intervals, it is processed as anspeech interval with the average pitch value of
=1
the adjacent frames.
Here,
To N is the
calculate number
energy, we useof samples
the common and xlog
n (i )energy
is the n-th i-th sample
frame’senergy
and Teager value.
measures.
MAXLAG denotes the maximum value of the pitch period that can be
Log energy is calculated as the log of the sum of the absolute values of the sample signal extracted.
The extracted
in a frame. pitch is
Teager energy candidates
a measureare smoothed
proposed to prevent
by Kaiser. the pitchthis
To calculate from changing
measure, a
rapidly
filter bankbetween frames.
is applied If there issinusoidal
to a complex a short speechless
signal andframe interval
the result (one to by
is divided twoa frames)
single
frequency. The energy value is then calculated as shown in Equation (8) [27].
𝑇𝐸 (𝑖) = 𝑓 (𝑖) − 𝑓 (𝑖)𝑓 (𝑖), 𝑖 = 1 … 𝐹𝐵 (8)
Here, 𝑓 (𝑖) is the nth frame’s i-th filter bank coefficient and 𝐹𝐵 is the number of fre-
Sensors 2021, 21, 1997 9 of 25
between speech intervals, it is processed as a speech interval with the average pitch value
of the adjacent frames.
To calculate energy, we use the common log energy and Teager energy measures. Log
energy is calculated as the log of the sum of the absolute values of the sample signal in
a frame. Teager energy is a measure proposed by Kaiser. To calculate this measure, a
filter bank is applied to a complex sinusoidal signal and the result is divided by a single
frequency. The energy value is then calculated as shown in Equation (8) [27].
Here, f n (i ) is the nth frame’s i-th filter bank coefficient and FB is the number of
frequency bands. Teager energy signals have features that are robust against noise and
speech signals are improved dynamically.
MFCCs include phoneme features and are widely used in the field of speech recog-
nition. They can accurately represent speech characteristics at mel frequencies that are
similar to the characteristics of human hearing.
The features of the speech were extracted in the frame unit. The mean, standard
deviation, and interquartile range (IQR), which are statistical values, were calculated from
the extracted baseline feature vector column and used as features for emotion classification.
SVM Classifier
The SVM classifier was used to recognize patterns in the emotion information con-
tained in the optimized feature vectors. The SVM classifier finds an optimal hyperplane
that minimizes the number of decision errors between two classes. Additionally, an SVM
has a very simple structure compared to a neural network and has advantages in terms of
generalization. This makes the SVM a popular choice in many application fields [30–33].
To classify the optimized feature vectors of the speech, the data access patterns of the
SVM classifier are analyzed.
First, the discriminating equation that forms the basis of classification is defined as
Equation (9).
M
f (x) = ∑ ai∗ zi K(Xi∗ , X ) + b∗ (9)
i =1
is defined. A Ga begins with a population consisting of a set of individuals
a computer in a search space representing all possible solutions to the targ
selects only suitable objects according to the objective function, which measu
Sensors 2021, 21, 1997 objects fit in their environment. Ultimately, the algorithm selects an25 optim
10 of
repeating a process of evolution toward more optimized objects [28,29].
We used the GAFS to optimize feature vectors, as shown in Figure 4. Wh
set isXi∗initially created,
is the ith vector among thethelength
M support ofvectors
the chromosomes is adjusted
obtained by learning. The optimiza-to match
tion bias b ∗ and the Lagrange multiplier a∗ are the solutions of the quadratic programming
required by the objective function. Because the objective function consists
problem determined by learning. When the radial basis function (RBF) is used as the kernel
tures,
function,theK length
Xi∗ , X can ofbethe chromosomes
expressed as Equationis set to 10. The first stage consists of c
(10).
determined population size N of objects with!chromosome lengths of1110.
Sensors 2021, 21, x FOR PEER REVIEW of 26
∗ 2
|| Xi − X ||to the objective function to de
stage, the N objects are K ( Xanalyzed
∗
i , X ) = exp −
according (10)
σ2
fitness. The fitness values found in this manner are used to select an elit
Crossover and mutation
σ is a parameter related to the are then
width ∗
applied
of the RBF. Theaccording
∗
reason to preset
for using the RBF ascrossover
the a
kernel function is that it is the preferred 𝐾(𝑋 , 𝑋) function
kernel = 𝑒𝑥𝑝 − when the linear classification of the (10)
rates. Fitness is then rechecked, and steps two to five are repeated until the e
input signals is impossible. The pseudocode corresponding to the discriminating equations
𝜎
are issatisfied.
a parameter
composed of Equationsrelated to the
(9) and (10) width
is shown ofinthe RBF.5. The reason for using the RBF as the
Figure
kernel function is that it is the preferred kernel function when the linear classification of
the input signals is impossible. The pseudocode corresponding to the discriminating
equations composed of Equations (9) and (10) is shown in Figure 5.
𝑁𝑈𝑀 , 𝑁𝑈𝑀 , and 𝑁𝑈𝑀 represent the number of input vectors, the number
of support vectors, and the dimensions of the support and input vectors. SV is a structure
that stores the support vector and is composed of a support vector with a dimension o
𝑁𝑈𝑀 and the corresponding Lagrange multiplier. IN is a structure that stores the
input vector. The vector in each structure is stored as an array, and its name is the feature
Dist is a value corresponding to ‖𝑋 ∗ − 𝑍‖ in Equation (10), KF is the resulting value o
the kernel function expressed as Equation (10), and F is the f(x) on the left side of Equation
(9).
The pseudocode has three loops. The first loop substitutes the input vectors in the
discriminating equation sequentially. The second loop matches each input vector to every
support vector one by one. The last loop performs the vector operation of one input vector
and one support vector. The support vectors in this study are sequentially loaded for each
input vector and used for calculation in units of the vector elements. In other words, the
first element of the first support vector to the last element of the last support vector are
read in turn for one input vector, and they are also read in the same sequence for the nex
Figure 4.
input
Figure 4.Genetic
vector.
Geneticalgorithms as a feature selection method (GAFS) algorithm.
algorithms as a feature selection method (GAFS) algorithm.
SVM Classifier
The SVM classifier was used to recognize patterns in the emotion info
tained in the optimized feature vectors. The SVM classifier finds an optim
that minimizes the number of decision errors between two classes. Addition
has a very simple structure compared to a neural network and has advantag
generalization. This makes the SVM a popular choice in many application f
To classify the optimized feature vectors of the speech, the data access p
SVM classifier are analyzed.
First, the discriminating equation that forms the basis of classification
Equation (9).
𝑓 (𝑥 ) = ∑ 𝑎∗ 𝑧 𝐾(𝑋 ∗ , 𝑋) + 𝑏 ∗
𝑋 ∗ is the ith vector among the M support vectors obtained by learning. The
bias
Figure𝑏 ∗ Pseudocode
and thefor
Figure 5.5.Pseudocode
Lagrange
for support
support
multiplier
vector
vector machine
machine
∗
𝑎(SVM)-based
are classifier.
(SVM)-based
the solutions
classifier. of the quadratic p
problem determined by learning. When the radial basis function (RBF) is us
The kernel 𝐾(𝑋 ∗
nel function, , 𝑋) that
functions canare
begenerally
expressed usedasinEquation
SVMs include linear kernels, Gaussian
(10).
radial basis function (RBF) kernels, polynomial kernels, and sigmoid kernels, as shown in
Table 2 [34,35]. We tested linear, Gaussian RBF, polynomial, and sigmoid kernels to clas
sify emotions and perform recognition.
Sensors 2021, 21, 1997 11 of 25
NU Min , NU Msv , and NU M f eature represent the number of input vectors, the number
of support vectors, and the dimensions of the support and input vectors. SV is a structure
that stores the support vector and is composed of a support vector with a dimension
of NU M f eature and the corresponding Lagrange multiplier. IN is a structure that stores
the input vector. The vector in each structure is stored as an array, and its name is the
feature. Dist is a value corresponding to || Xi∗ − Z ||2 in Equation (10), KF is the resulting
value of the kernel function expressed as Equation (10), and F is the f (x) on the left side of
Equation (9).
The pseudocode has three loops. The first loop substitutes the input vectors in the
discriminating equation sequentially. The second loop matches each input vector to every
support vector one by one. The last loop performs the vector operation of one input vector
and one support vector. The support vectors in this study are sequentially loaded for each
input vector and used for calculation in units of the vector elements. In other words, the
first element of the first support vector to the last element of the last support vector are
read in turn for one input vector, and they are also read in the same sequence for the next
input vector.
The kernel functions that are generally used in SVMs include linear kernels, Gaussian
radial basis function (RBF) kernels, polynomial kernels, and sigmoid kernels, as shown in
Table 2 [34,35]. We tested linear, Gaussian RBF, polynomial, and sigmoid kernels to classify
emotions and perform recognition.
representative words and antonyms. Table 4 lists the extracted emotional words based on
the color information.
RGB Value
No Color Name Distance
R G B
1 Bright Red 255 35 40 260.5
2 Blue 0 93 199 219.7
3 Brown 96 47 25 109.8
4 Bright Yellow 255 214 10 333.0
5 Orange 255 91 24 271.8
6 Purple 140 43 137 200.5
7 Beige 232 203 173 353.5
8 Lime 94 168 34 195.4
9 Lavender 130 101 182 245.4
10 Olive Green 84 82 28 120.7
11 Burgundy 127 37 36 137.1
12 Green 0 130 63 144.5
13 Light Pink 251 188 172 357.7
14 Fuchsia 245 119 158 314.9
15 Light Blue 128 192 217 316.8
16 Navy 0 38 100 107.0
Greenish
17 199 181 0 269.0
Yellow
18 Terracotta 172 165 26 239.8
19 Teal Blue 0 177 162 239.9
20 Neutral Gray 128 128 128 211.7
After converting the emotion elements and emotion words into a 2D space, this new
space was used to obtain the coordinates of the emotion words and emotion elements. These
coordinates were used to measure distance and the resulting distances were considered
to represent the relationships between emotion elements correlated with emotion words.
A smaller distance between coordinates indicates that the corresponding relationship is
more significant (inversely proportional). A larger color distribution in an image indicates
that the contained relationships are more significant (directly proportional). The inverse of
distance was calculated to measure distance ratios, as shown in Equation (11).
−1
dik
Dik = −1
(11)
∑20
j=1 dij
This equation yields the distance ratio of the emotion word i relative to the emotion
element k. The numerator represents the distance between the actual emotion word i and
emotion element k. The denominator represents the sum of the inverses of the distances
between the emotion word i and the 20 emotion models.
Sensors 2021, 21, 1997 13 of 25
Tables 5 and 6 show the coordinates of each emotional word and the color emotion
element measured in the color–emotion space.
6
uEmotione
Scorei = ∑ 100
× hEmotioni, e (12)
e =1
Equation (12) is used to calculate scores for the songs that a user has listened to based
on that user’s current state (i.e., uEmotione ). Here, i is a song in the music emotion database
and e is the emotion information for the i-th song in the music emotion database. Therefore,
Equation (12) can be used to set the priority of all songs in the music emotion database.
To collect a music list up to rank x, emotion information standardization is performed
Sensors 2021, 21, x FOR PEER REVIEW 15 of 26
by using Equation (13) to calculate Euclidean distance.
Emotioni, k
nEmotioni,k = (13)
∑8m=1 Emotioni, m
,
𝑛𝐸𝑚𝑜𝑡𝑖𝑜𝑛 , = ∑ (13)
Here, Emotioni,k is the value that corresponds to the ,i-th song’s k-th emotion category.
TheHere, 𝐸𝑚𝑜𝑡𝑖𝑜𝑛
proposed , is the
algorithm value that
calculates corresponds
Euclidean to the
distances i-thgenerates
and song’s k-th emotion cate-
a recommendation
gory. The proposed
list sorted algorithm
in ascending ordercalculates Euclidean distances and generates a recommen-
of distance.
dation list sorted in ascending
v order of distance.
u 8
∑ ∑ (𝑛𝐸𝑚𝑜𝑡𝑖𝑜𝑛 nEmotioni,, m )2
u
t (nEmotion1st, , m − 𝑛𝐸𝑚𝑜𝑡𝑖𝑜𝑛 (14)
(14)
m =1
Equation (14) calculates the similarity between songs based on the song with the
Equation (14) calculates the similarity between songs based on the song with the
highest value according to Equation (12) and the standardized emotion information. Here,
highest value according to Equation (12) and the standardized emotion information. Here,
i represents all of the songs in the music emotion database.
i represents all of the songs in the music emotion database.
2.2. Emotion
2.2. Emotion Collaboration
Collaboration Filtering
Filtering Moule
Moule
Collaborative
Collaborative filtering
filtering is ais method
a method that predicts
that predictspreferences
preferences regarding
regarding items
itemsbyby
col-
col-
lecting the preferences of other users with similar tastes. It begins with
lecting the preferences of other users with similar tastes. It begins with the assumption the assumption
that there
that areare
there general
generaltrends andand
trends patterns in tastes
patterns and that
in tastes andpeople will maintain
that people their past
will maintain their
tastes in the future. Based on the principle that the preferred items
past tastes in the future. Based on the principle that the preferred items vary dependingvary depending on
user emotions, content is recommended according to user emotions
on user emotions, content is recommended according to user emotions by incorporating by incorporating col-
laborative filtering
collaborative and static
filtering and emotion information
static emotion receivedreceived
information from users fromin real
userstime. Figure
in real time.
6 presents
Figure 6 the structure
presents of the emotion
the structure of the collaborative filtering module.
emotion collaborative filtering module.
Figure
Figure 6. Emotion
6. Emotion collaborative
collaborative filtering
filtering module
module configuration.
configuration.
Pearson’s correlation coefficients are calculated using the evaluation values for items
that were evaluated by two users. This allows the proposed algorithm to detect similari-
ties between pairs of users. By including the emotion information calculated previously,
Sensors 2021, 21, 1997 15 of 25
Pearson’s correlation coefficients are calculated using the evaluation values for items
that were evaluated by two users. This allows the proposed algorithm to detect similarities
between pairs of users. By including the emotion information calculated previously, the
Pearson correlation algorithm performs clustering using dynamic emotion information
received from users in real time and user personal information. The evaluation scores of
the created user groups are then used to measure the similarities between users regarding
content according to user emotions. The measured degrees of similarity have values
between −1 and 1. As the value approaches one, users are considered to be more similar.
As the value approaches −1, users are considered to be more dissimilar. When the level
of similarity is zero, it indicates that users have no correlation. Equation (15) is used to
calculate the emotional Pearson correlation coefficients incorporating emotion information.
m
(r a,i,e − r a,e ) ∗ (ru,i,e − r u,e )
wa,u,e = ∑q 2
q
2
(15)
i =1 ∑im=1 (r a,i,e − r a,e ) ∑im=1 (ru,i,e − r u,e )
Here, wa,u,e is the similarity between a user and a neighboring user, a is the target
user, u is the neighboring user, e is the emotion, m is the number of items evaluated by
both a and u, r a,i,e is the evaluation score of user a for item i when considering e, ru,i,e is the
evaluation score of user u for item i when considering e, r a,e is the overall evaluation score
of user a for e, and r u,e is the overall evaluation score of user u for e. The formula in the
denominator refers to the standard deviation of user a for e and the standard deviation of
user u for e.
After performing clustering based on the measured levels of similarity between users
and user personal information, the evaluation data from the created groups of users
are used to predict preferences (i.e., evaluation scores) for items that the users have not
seen. By using evaluation scores that were directly provided by the users and emotions
that correspond to the current circumstances, it is possible to recommend personalized
content. Equation (16) defines the prediction algorithm for evaluating scores by considering
emotion information.
Here, p a,i,e is the predicted evaluation score for item I, a is the target user, u is the
neighboring user, e is the emotion, n is the number of neighboring users with evaluation
scores, r a,e is the overall average evaluation score of user a for e, r u,e is the overall average
evaluation score of user u for e, and ru,i,e is the evaluation score of user u for item i
considering e. Finally, wa,u is the level of similarity between users in terms of the emotional
Pearson correlation coefficients (i.e., the level of similarity between a and u).
recommendation list, and measurement graphs are presented to the user by a content
recommendation mobile application interface.
As shown in Table 7, the emotional speech database consists of five items that represent
the user registration, greetings, living information, commands, and emotions.
The 9000 emotional speech data in the database were evaluated on a five-point scale
by a group of seven emotion evaluators regarding how well the speech expressed the
emotions. A total of 2400 good-quality data that expressed the emotions well with no
background noise were selected, with a 1:1 ratio of male and female data.
The evaluation of emotions was performed according to the standard chart as shown
in Table 8. The final speech database was selected separately for male and female speech
data. A total of 400 data were selected, with 200 male and female data each for each
emotion. Table 9 shows the standard scores for the database selection and the mean scores
of the final speech data.
Table 9. The standard scores for the database selection and the mean scores of the final speech data.
The emotions of the experimental data were categorized as neutral, happy, sad, angry,
surprised, and bored, which are the emotion categories of the emotional speech database.
For each emotion, 400 data were selected as the experiment data, and the data were
analyzed with a window size of 250, time step of 78, and a frame of 15 ms unit.
The collected data were divided into learning-stage data and recognition-stage data.
The classification accuracy of the recognition stage’s features was calculated and verified
through classification and comparison using the GAFS and SVM algorithm, which were
trained during the learning stage. If the learning data account for less than 10% of the
total data, accuracy is very low. Therefore, a sufficient amount of learning data must be
provided. When the learning data ratio is 50%, the accuracy reaches 0.975. However,
although accuracy generally increases as the amount of learning data increases, accuracy
tends to decrease as the learning data ratio approaches 100%. Therefore, the ratio of learning
data to recognition data was set to 50:50. The trained model and newly entered recognition
data were used to calculate emotion recognition accuracy rates and the feasibility of the
trained model was thoroughly reviewed.
In this study, performance was evaluated using precision, recall, and F-measure
values, which are the main performance analysis metrics used in automatic classification
and machine learning evaluations, to select an optimal SVM kernel function. In most
cases, precision and recall can be calculated using a 2 × 2 contingency table for each
category [40,41]. Table 10 compares the ground truth classification results to the recognition
system classification results.
In Table 10, a denotes the number of data that are correctly classified for a particular
emotion category, b denotes the number of data that are incorrectly classified for a particular
emotion category, c denotes the number of data that should be classified as a certain emotion
category but are incorrectly classified, and d denotes the number of data that do not actually
belong in a particular emotion category and that the system cannot find. Equations (17)–(19)
are used to calculate precision, recall, and F-measure values, respectively.
a
Precision ( P) = (17)
a+b
a
Recall ( R) = (18)
a+c
2RP
F − measure ( F ) = (19)
R+P
The speech emotions contained in the speech emotion database were extracted from
50 types of speech performed by users. To choose a kernel function suitable for the target
user emotions, emotion classification accuracy was verified by testing different SVM kernel
functions, as shown in Table 11. As shown in the results in Table 11, the RBF kernel yields a
recognition accuracy of 86.98%, making it the best-performing kernel function. The lowest
recognition accuracy (77.74%) can be observed for the sigmoid kernel function.
The kernel function recognition results for each specific emotion are discussed below.
Table 12 lists the linear kernel function classification results for each emotion. The
average accuracy is 83.99%. The recall for each emotion is as follows: neutral (82.19%),
happy (86.33%), sad (86.87%), angry (82.37%), surprised (82.67%), and bored (83.50%).
Sensors 2021, 21, 1997 19 of 25
The polynomial kernel function yields an accuracy of 86.17%. The recall for each
emotion is as follows: neutral (85.89%), happy (86.26%), sad (87.61%), angry (86.47%),
surprised (85.43%), and bored (85.34%). These recall results are shown in Table 13.
The RBF kernel yields an emotion classification accuracy of 94.77%. The recall for
each emotion is as follows: neutral (90.84%), happy (100%), sad (95.83%), angry (97.50%),
surprised (93.22%), and bored (91.20%). Table 14 lists the RBF kernel results.
The sigmoid kernel function yields an emotion classification accuracy of 82.97%. The
recall for emotion is as follows: neutral (81.80%), happy (82.60%), sad (86.66%), angry
(81.64%), surprised (82.04%), and bored (83.10%). Table 15 lists these results.
Sensors 2021, 21, 1997 20 of 25
Table 16 lists the precision, recall, and F-measure results, which are performance
evaluation metrics for each kernel function. The nonlinear SVM RBF kernel exhibits the
best performance
Table in terms
16. Precision, ofand
recall, precision (94.70%),
F-measure recall (94.77%),
by kernel function.and F-measure (94.71%).
The sigmoid kernel function exhibits the worst performance in terms of precision (84.00%),
Precision
recall (84.87%), and F-measure (82.82%). Recall F-Measu
Linear 86.23 85.93 83.44
Table 16. Precision, recall, and F-measure by kernel function.
Polynomial 87.00 87.46 84.82
RBF Precision 94.70 Recall 94.77 F-Measure 94.71
Sigmoid
Linear 86.23 84.00 85.93 84.87 83.44 82.82
Polynomial 87.00 87.46 84.82
RBF 94.70 94.77 94.71
Furthermore,
Sigmoid
when the accuracy
84.00
rate, reproduction
84.87
rate, and
82.82
the
F-measure
are the performance measures, were analyzed using the RBF kernel function, the a
rate Furthermore,
was 94.70%,when the reproduction rate
the accuracy rate, was 94.77%,
reproduction rate,and theF-measure,
and the F-measure was 94.71%
which
are the performance measures, were analyzed using the RBF kernel function, the accuracy in thi
To evaluate the performance of the recommendation system proposed
we
rate adopted the
was 94.70%, themean absolute
reproduction rateerror (MAE)and
was 94.77%, metric. To determine
the F-measure the accuracy of
was 94.71%.
To evaluate the performance of the recommendation system
ommendation system, predicted preferences and actual preferences were proposed in this paper, measu
we adopted the mean absolute error (MAE) metric. To determine the accuracy of the
compared for each item. The results indicate how similar the predicted evaluation
recommendation system, predicted preferences and actual preferences were measured
and actual evaluation
and compared for each item.scores are onindicate
The results average.
howThe dataset
similar used in
the predicted our experimen
evaluation
tained content
scores and actual data (images
evaluation and
scores aremusic) for each
on average. emotion
The dataset usedgenerated from the spee
in our experiments
contained
tion content data (images and music) for each emotion generated from the speech
information.
emotion information.
Experiments were performed by randomly selecting 80% of the dataset for
Experiments were performed by randomly selecting 80% of the dataset for training
and predicting
and predicting thethe remaining
remaining 20%,
20%, as shownas in
shown
Figure in
7. Figure 7.
Figure 7.7.Removal
Figure Removalof 20% of theof
of 20% data
thefrom
datathefrom
original
thedata.
original data.
The performance of the proposed system was evaluated by comparing the predicted
20% ofThe
the performance
data to the 20%of
of the proposed
the original datasystem was
that were evaluated
withheld by comparing
for training. Figure 8 the pr
20% of the
presents data toofthe
an example 20% ofpredicted
comparing the original datadata
preference thattowere withheld
withheld fordata.
preference training. F
presents an example of comparing predicted preference data to withheld preferen
Figure 7. Removal of 20% of the data from the original data.
The performance of the proposed system was evaluated by comparing the predicted
Sensors 2021, 21, 1997 21 of 25
20% of the data to the 20% of the original data that were withheld for training. Figure 8
presents an example of comparing predicted preference data to withheld preference data.
Comparisonof
Figure8.8.Comparison
Figure oforiginal
originaland
andpredicted
predictedaffinity
affinitydata.
data.
MAEisis the
MAE the average
average ofof the
the absolute
absolute errors
errors between
between two
two groups
groups of
of values
values that
that are
are
comparison targets. It is an index that represents how similar predicted evaluation
comparison targets. It is an index that represents how similar predicted evaluation scores scores
areto
are toactual
actualuser
userevaluation
evaluationscores
scoresononaverage.
average.The
Theperformance
performanceofofthe
therecommendation
recommendation
system is considered to be better when the MAE value is smaller. An MAE value of zero
indicates that the recommendation system is perfectly accurate. Equation (20) defines the
MAE calculation.
∑ n | pi − qi |
MAE = i=1 (20)
n
Here, pi is the actual preference of user p, qi is the predicted preference of user q, and
n is the number of content items used by user p.
In this study, the MAE values were normalized to a range of zero to one and inverted
such that zero indicates that none of the values match and one indicates that all of the values
match. In Equation (21), the normalization formula is included in the MAE calculation.
n
1 |p − q |
MAE = 1 −
n ∑ ( MAXi − MI
i
N
) (21)
i =1
Table 17. Performance evaluation of recommendation system by emotion using mean absolute error
(MAE) algorithm.
In this study, a content recommendation system that uses individual speech emo-
tion information and collaborative filtering was implemented as a mobile application.
Recognized emotion information values were used to predict user preferences and
Sensors 2021, 21, 1997 22 of 25
sors 2021, 21, x FOR PEER REVIEW recommend content. Figure 9 presents the results achieved by the proposed system. 23 o
Specifically, Figure 9a presents the recommended emotion content and measurement
values, Figure 9b presents the emotion content recommendation list, and Figure 9c
presents a measurement graph of emotion content.
each individual. The recommendation method proposed in this paper has high emotion
recognition precision and recall compared to existing methods and has a very simple
structure because it uses existing machine learning algorithms. In the future, the proposed
method may be extended to human-oriented applications in a variety of environments,
such as emotional interactions that occur between people based on human emotions. It
may also be used effectively in intelligent systems that recognize emotional exchanges
during interactions between humans and machines. The proposed system is helpful for
considering user characteristics and increasing user satisfaction by recommending content
matching user emotions.
In future studies, it will be necessary to analyze and study various algorithms for
increasing recognition rates, as indicated by the speech emotion recognition results pre-
sented in this paper. Therefore, emotions that are extracted from facial expressions and
speech will be used to implement systems with more stable recognition rates. Additionally,
it will be necessary to collect various biometric data and analyze their characteristics to
determine whether it is possible to judge emotions based on a small number of objective
features. Data should be collected using a more sophisticated experimental design than
that used in this study and research should focus on the direction of selecting models
and features based on data that can reduce individual differences. Additionally, various
machine learning algorithms other than the SVM should be considered.
Author Contributions: Conceptualization, T.-Y.K. and H.-D.K.; methodology, T.-Y.K. and H.K.; soft-
ware, T.-Y.K. and H.K.; validation, T.-Y.K. and H.-D.K.; formal analysis, T.-Y.K. and H.K.; investigation,
S.-H.K. and H.-D.K.; resources, S.-H.K. and H.-D.K.; data curation, T.-Y.K.; writing—original draft
preparation, T.-Y.K.; writing—review and editing, T.-Y.K. and H.-D.K.; visualization, T.-Y.K. and
H.-D.K.; supervision, H.K. and S.-H.K.; project administration, H.K. and H.-D.K.; funding acquisition,
T.-Y.K. and S.-H.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the Basic Science Research Program through the
National Research Foundation of Korea (NRF), funded by the Ministry of Education (NRF-
2017R1A6A1A03015496). This research was supported by the National Research Foundation of
Korea (NRF) grant funded by the Korean government (MSIT) (No. 2019R1F1A1041186).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Kim, A.Y.; Jang, E.H.; Sohn, J.H. Classification of Negative Emotions based on Arousal Score and Physiological Signals using
Neural Network. Sci. Emot. Sensib. 2018, 21, 177–186. [CrossRef]
2. Esposito, A.; Fortunati, L.; Lugano, G. Modeling Emotion, Behavior and Context in Socially Believable Robots and ICT Interfaces.
Cogn. Comput. 2014, 6, 623–627. [CrossRef]
3. Guo, K.; Chai, R.; Candra, H.; Guo, Y.; Song, R.; Nauyen, H.; Su, S. A Hybrid Fuzzy Cognitive Map/Support Vector Machine
Approach for EEG-Based Emotion Classification Using Compressed Sensing. Int. J. Fuzzy Syst. 2019, 21, 263–273. [CrossRef]
4. Shin, D.M.; Shin, D.I.; Sinn, D.K. Development of emotion recognition interface using complex EEG/ECG bio-signal for interactive
contents. Multimed. Tools Appl. 2016, 76, 11449–11470. [CrossRef]
5. Wang, P.; Dong, L.; Xu, Y.; Liu, W.; Jing, N. Clustering-Based Emotion Recognition Micro-Service Cloud Framework for Mobile
Computing. IEEE Access 2020, 8, 49695–49704. [CrossRef]
6. Kim, T.Y.; Lee, K.S.; An, Y.E. A study on the Recommendation of Contents using Speech Emotion Information and Emotion
Collaborative Filtering. J. Digit. Contents Soc. 2018, 19, 2247–2256. [CrossRef]
7. Liu, Z.T.; Xie, Q.; Wu, M.; Cao, W.H.; Mei, Y.; Mao, J.W. Speech emotion recognition based on an improved brain emotion learning
model. Neurocomputing 2018, 309, 145–156. [CrossRef]
8. Mencattini, A.; Marinelli, E.; Costantini, G.; Todisco, M.; Basile, B.; Bozzali, M.; Natale, C.D. Speech emotion recognition using
amplitude modulation parameters and a combined feature selection procedure. Knowl. Based Syst. 2014, 63, 68–81. [CrossRef]
9. Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmad, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.I.; Baik, S.W. Deep features-based speech
emotion recognition for smart affective services. Multimed. Tools Appl. 2019, 78, 5571–5589. [CrossRef]
Sensors 2021, 21, 1997 24 of 25
10. Hsu, Y.L.; Wang, J.S.; Chiang, W.C.; Hung, C.H. Automatic ECG-Based Emotion Recognition in Music Listening. IEEE Tarns.
Affect. Comput. 2020, 11, 85–99. [CrossRef]
11. Lee, S.Z.; Seong, Y.H.; Kim, H.J. Modeling and Measuring User Sensitivity for Customized Service of Music Contents. J. Korean
Soc. Comput. Game 2013, 26, 163–171. [CrossRef]
12. Zhang, Y.; Wang, Y.; Wang, S. Improvement of Collaborative Filtering Recommendation Algorithm Based on Intuitionistic Fuzzy
Reasoning Under Missing Data. IEEE Access 2020, 8, 51324–51332. [CrossRef]
13. Ku, M.J.; Ahn, H.C. A Hybrid Recommender System based on Collaborative Filtering with Selective Utilization of Content-based
Predicted Ratings. J. Intell. Inf. Syst. 2018, 24, 85–109. [CrossRef]
14. Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting
modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [CrossRef]
15. Drakopoulos, G.; Pikramenos, G.; Spyrou, E.D.; Perantonis, S.J. Emotion Recognition from Speech: A Survey. WEBIST 2019, 1,
432–439. [CrossRef]
16. Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process.
Control 2020, 59, 101894. [CrossRef]
17. Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE
Access 2020, 8, 79861–79875. [CrossRef]
18. Wang, X.; Chen, X.; Cao, C. Human emotion recognition by optimally fusing facial expression and speech feature. Signal Process.
Image Commun. 2020, 84, 115831. [CrossRef]
19. Posner, J.; Russell, J.A.; Peterson, B.S. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive
development, and psychopathology. Dev. Psychopathol. 2005, 17, 715–749. [CrossRef] [PubMed]
20. Kim, T.Y.; Ko, H.; Kim, S.H. Data Analysis for Emotion Classification Based on Bio-Information in Self-Driving Vehicles. J. Adv.
Transp. 2020, 2020, 8167295. [CrossRef]
21. Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech Emotion Recognition Using Deep Convolutional Neural Network and Discrimi-
nant Temporal Pyramid Matching. IEEE Trans. Multimed. 2018, 20, 1576–1590. [CrossRef]
22. Zhang, W.; Zhao, D.; Chai, Z.; Yang, L.T.; Liu, X.; Gong, F.; Yang, S. Deep learning and SVM-based emotion recognition from
Chinese speech for smart affective services. Softw. Pract. Exp. 2017, 47, 1127–1138. [CrossRef]
23. Özbay, Y.; Ceylan, M. Effects of window types on classification of carotid artery Doppler signals in the early phase of atherosclerosis
using complex-valued artificial neural network. Comput. Biol. Med. 2006, 37, 287–382. [CrossRef] [PubMed]
24. Tan, J.; Wen, B.; Tian, Y.; Tian, M. Frequency Convolution for Implementing Window Functions in Spectral Analysis. Circuits Syst.
Signal Process. 2016, 36, 2198–2208. [CrossRef]
25. Ho, N.H.; Yang, H.J.; Kim, S.H.; Lee, G.S. Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head
Fusion Attention-Based Recurrent Neural Network. IEEE Access 2020, 8, 61672–61686. [CrossRef]
26. Wankhade, S.B.; Doye, D.D. IKKN Predictor: An EEG Signal Based Emotion Recognition for HCI. Wirel. Pers. Commun. 2019, 107,
1135–1153. [CrossRef]
27. Rosen, S.; Hui, S.N.C. Sine-wave and noise-vocoded sine-wave speech in a tone language: Acoustic details matter. J. Acoust. Soc.
Am. 2015, 138, 3698–4400. [CrossRef] [PubMed]
28. Park, C.H.; Sim, K.B. The Pattern Recognition Methods for Emotion Recognition with Speech Signal. Int. J. Fuzzy Log. Intell. Syst.
2006, 6, 150–154. [CrossRef]
29. Murthy, Y.V.S.; Koolagudi, S.G. Classification of vocal and non-vocal segments in audio clips using genetic algorithm based
feature selection (GAFS). Expert Syst. Appl. 2018, 106, 77–91. [CrossRef]
30. Rauber, T.W.; Assis Bololt, F.; Varejao, F.M. Heterogeneous Feature Models and Feature Selection Applied to Bearing Fault
Diagnosis. IEEE Trans. Ind. Electron. 2015, 62, 637–646. [CrossRef]
31. Venkatesan, S.K.; Lee, M.B.; Park, J.W.; Shin, C.S.; Cho, Y. A Comparative Study based on Random Forest and Support Vector
Machine for Strawberry Production Forecasting. J. Inf. Technol. Appl. Eng. 2019, 9, 45–52.
32. Jan, S.U.; Lee, Y.D.; Koo, I.S. Sensor Fault Classification Based on Support Vector Machine and Statistical Time-Domain Features.
IEEE Access 2017, 5, 8682–8690. [CrossRef]
33. Poudel, S.; Lee, S.W. A Novel Integrated Convolutional Neural Network via Deep Transfer Learning in Colorectal Images. J. Inf.
Technol. Appl. Eng. 2019, 9, 9–22.
34. Amani, R.J.; Josef, H.; Elisabeth, L.; Lina, V.S. Forward deterministic pricing of options using Gaussian radial basis functions. J.
Comput. Sci. 2018, 24, 209–217. [CrossRef]
35. Wei, W.; Jia, Q. Weighted Feature Gaussian Kernel SVM for Emotion Recognition. Comput. Intell. Neurosci. 2016, 2016, 7696035.
[CrossRef]
36. Li, J.; Wang, J.Z. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach.
Intell. 2003, 25, 1075–1088. [CrossRef]
37. Muszynski, M.; Tian, L.; Lai, C.; Moore, J.; Kostoulas, T.; Lombardo, P.; Pun, T.; Chanel, G. Recognizing induced emotions of
movie audiences from multimodal information. IEEE Trans. Affect. Comput. 2019. [CrossRef]
38. Liu, N.H. Comparison of content-based music recommendation using different distance estimation methods. Appl. Intell. 2012,
38, 160–174. [CrossRef]
Sensors 2021, 21, 1997 25 of 25
39. Xing, B.; Zhang, K.; Zhang, L.; Wu, X.; Dou, J.; Sun, S. Image–Music Synesthesia-Aware Learning Based on Emotional Similarity
Recognition. IEEE Access 2019, 7, 136378–136390. [CrossRef]
40. Veroniki, A.A.; Pavlides, M.; Patsopoulos, N.A.; Salantim, G. Reconstructing 2 × 2 contingency tables from odds ratios using the
Di Pietrantonj method: Difficulties, constraints and impact in meta-analysis results. Res. Synth. Methods 2013, 4, 78–94. [CrossRef]
41. Louis, E.; Heng, S.; Cyril, R. A more powerful unconditional exact test of homogeneity for 2 × c contingency table analysis. J.
Appl. Stat. 2019, 46, 2572–2582.