1802.05630v2 - Speech Emotion Detection

Uploaded by

SYB32Parimal Mate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

1802.05630v2 - Speech Emotion Detection

Uploaded by

SYB32Parimal Mate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CNN+LSTM Architecture for Speech Emotion Recognition with Data

Augmentation
Caroline Etienne 1,2,∗ , Guillaume Fidanza 2,∗ , Andrei Petrovskii 2,∗ ,
Laurence Devillers 1 , Benoit Schmauch 2 .
1
LIMSI, CNRS, Paris-Sud University, Paris-Saclay University / F-91405 Orsay, France
2
DreamQuark, 29 rue de Courcelles, 75008 Paris, France
∗
Equal contribution
[email protected], [email protected],
[email protected], [email protected],
[email protected]

Abstract the training procedure. On the other hand, it is also possible to

arXiv:1802.05630v2 [cs.SD] 11 Sep 2018

In this work we design a neural network for recognizing emotions use a deep CNN for extracting high-level features, which are
in speech, using the IEMOCAP dataset. Following the latest then fed to a RNN for final time aggregation. We test a variety
advances in audio analysis, we use an architecture involving of architectures with different depths for the convolutional (1-6
both convolutional layers, for extracting high-level features from layers) and recurrent modules (1-4 Bi-LSTM layers), achieving
raw spectrograms, and recurrent ones for aggregating long-term the best scores with a 4+1 scenario2 .
dependencies. We examine the techniques of data augmentation To address challenges of class imbalance and data scarcity,
with vocal track length perturbation, layer-wise optimizer adjust- we explored a vocal tract length perturbation for the purpose of
ment, batch normalization of recurrent layers and obtain highly data augmentation, and showed that it significantly improves the
competitive results of 64.5% for weighted accuracy and 61.7% performance. In line with [10, 1, 11, 12] we examined batch
for unweighted accuracy on four emotions. normalization applied to the recurrent layers of the network.
Finally, we noticed that parameters of convolutional and Bi-
LSTM layers are trained at a very different pace. We tried to
1. Introduction take advantage of this observation by per-layer adjustment of the
Providing high quality interaction between a human and a ma- update rule parameters, but unfortunately were not able to make
chine is a very challenging and active field of research with a definite conclusion in favor of this idea.
numerous applications. An important part of this domain is
recognition of human speech emotions by computer systems. In 1.1. Dataset description
the last years, impressive progress has been achieved in speech
IEMOCAP (Interactive Emotional Dyadic Motion Capture), col-
recognition by means of deep learning [1, 2, 3, 4]. These achieve-
lected at the University of Southern California (USC) [8], is one
ments also include significant results on speech emotion recog-
of the standard datasets for emotion recognition. It consists of
nition (SER), see e.g. [5, 6, 7]. In this work we build a neural
twelve hours of audio and video recordings performed by 10
network for SER on the IEMOCAP dataset [8] and achieve the
professional actors (five women and five men) and organized in
result highly competitive to the state of the art. 1
5 sessions of dialogues between two actors of different genders,
When treating a SER problem with deep learning, one either
either playing a script or improvising. Each sample of the audio
creates hand-crafted acoustic features (MFCC, pitch, energy,
set is an utterance assigned with an emotion label. Labeling
ZCR...), which are used as inputs to a neural network, or sends
was made by six students of USC, three at a time for each ut-
the data, after some preprocessing (e.g. Fourier transform), di-
terance. The annotators were allowed to assign multiple labels
rectly to a neural network. We apply the second strategy by
if necessary. The final true label for each utterance was chosen
transforming the audio signal to a spectrogram, which is then
by majority vote if the emotion category with the highest vote
used as an input to convolutional layers, followed by recur-
was unique. Since the annotators reached consensus more of-
rent ones. Such a choice of an architecture, which has recently
ten when labeling improvised utterances (83.1%) than scripted
demonstrated very competitive performance [9, 1, 7], assumes
ones (66.9%) [8], we concentrate only on the improvised part
two main interpretations. On one hand, adding few convolutional
of the dataset. For the sake of comparison with the prior state-
layers in the beginning of the network is an efficient way to re-
of-the-art approaches, we predict four of the most represented
duce dimensionality of the data and can significantly simplify
emotions: neutral, sadness, anger and happiness, which leave us
1 To our knowledge, the present state of the art has been achieved in 2280 utterances in total.
[6]. However the cross-validation procedure performed in this paper (as
in other works presenting the results obtained on the IEMOCAP dataset)
includes only five folds of the dataset out of the ten possible. On the
2. Data augmentation
other hand, our experiments showed (see section 3) that the performance The IEMOCAP dataset has two main drawbacks: class imbal-
strongly depends on the part of the data which is used for measuring the
ance (see Fig. 1) and small size. To cope with both obstacles,
scores. As a consequence the results obtained by 5-fold cross-validation
without clarification what data has been used for the measurement are we examined data augmentation by means of vocal tract length
not possible to compare with. Therefore we propose to use 10-fold cross perturbation (VTLP), at the same time oversampling the least
validation as the correct way for measuring the scores on IEMOCAP
dataset and present our results correspondingly. 24 convolutional and 1 Bi-LSTM layers
3. Model description and experiments
As it has been mentioned above, the IEMOCAP dataset consists
of five sessions, each being a conversation between a man and
a woman, giving 10 speakers in total. In order to see how well
the model can generalize to different speakers, we took the
validation and test sets to correspond to two different speakers of
one of the sessions. The training set was composed of the four
remaining sessions. In the course of experiments, we observed
that the performance strongly depends on which speakers are
chosen for the test set (see Tab. 2). Therefore we choose 10-fold
cross-validation strategy, in order to average over all possible
choices of the dataset splitting. Interestingly, to the best of
our knowledge, all the other results reported on the IEMOCAP
dataset were obtained by 5-fold cross-validation. In this case the
choice of the validation and test sets is not rigorously defined3
and the scores obtained in this way are not possible to compare
with.
Figure 1: Class distribution of the utterances in the improvised
For evaluating the model performance, we chose weighted
part of the IEMOCAP dataset
(WA) and unweighted (UA) accuracies. WA is the standard
accuracy computed over the whole test set. UA is an average
over accuracies computed for each emotion separately. First, we
represented classes of the dataset: happiness and anger. VTLP compute the metrics for each fold and then present the scores
is based on the speaker normalization technique considered in as the average over all the folds. Since for imbalanced datasets
[13], where it was implemented to reduce interspeaker variability. UA is a more relevant characteristic, we rather concentrated our
The difference in human’s vocal tract length can be modeled by efforts on getting a high UA, in line with most of the other works
rescaling the peaks of significant formants along the frequency on IEMOCAP.
axis with a factor α taking values in the approximate range We considered architectures with 1-6 convolutional layers,
(0.9, 1.1). Therefore, in order to get rid of this variablility, one 1-4 Bi-LSTM layers and a dense layer with softmax nonlinearity
should estimate the factor for each speaker and accordingly nor- on top of the network (see Fig. 2). As an optimization procedure,
malize the spectrograms. Applied inversely, the same idea can we used stochastic gradient descent with momentum and the
be used for data augmentation [14, 15, 16]: in order to generate batch size of 164 . For the regularization of weights we used
new samples, one simply has to perform rescaling of the original L2-regularization.
spectrograms along the frequency axis while keeping the scaling Due to the significant variety of the data samples in the time
factor in the range (0.9, 1.1). Both approaches, normalization length (from 21 to 909 time steps for window size N = 64ms
and augmentation, pursue the same objective: to enforce the and shift S = 32ms) we performed zero-padding of the samples
invariance of the model to speaker-dependent features, since along the time axis. In order to avoid the aggregation of the
they are not relevant to the classification criterion. Augmenta- artificially added time steps by Bi-LSTM, we put a masking
tion, however, is easier to implement because we don’t need to layer between the convolutional and Bi-LSTM modules. The
estimate the scaling factor of each speaker, and therefore we size of the mask has been derived from the temporal size of
stick to this option. the corresponding spectrogram and action of the convolutional
strides on it.
Rescaling of frequencies has been performed as follows Finally we normalized the samples according to the general
[13]: statistics of the dataset:
x − x̂
( xn = √ , (2)
αf 0 ≤ f ≤ f0 σ2 +
G(f ) = fmax −αf0
fmax −f0
(f − f0 ) + αf0 f0 ≤ f ≤ fmax , where x̂ and σ are the average and standard deviation of the
(1) spectrogram pixels computed over the whole dataset along both
where fmax is the upper cut-off frequency and f0 is defined to time and frequency axes. Such normalization significantly im-
f0
be larger than the highest significant formants (we took fmax = proves the convergence time of the model. However, applied to
0.9). Therefore, we rescale the frequencies below f0 with α ∈ networks of small depth (≤ 2 convolutional layers), it results in
(0.9, 1.1), and then rescale the rest to ensure that the considered strong overfitting.
diapason stays constant. As we have mentioned above, we conducted a variety of
We tried two strategies of data augmentation. In the first experiments with different depths of convolutional and Bi-LSTM
one, a single uniformly distributed value α ∈ (0.9, 1.1) was modules. The presence of pooling layers alternating with the
sampled at each epoch and used to rescale all training examples, convolutions noticeably decreased the performance and has been
and no rescaling was applied to the validation set. In the second discarded in the beginning of the experiments. We examined
strategy, each spectrogram was rescaled with an individually different scenarios: ”shallow CNN + deep Bi-LSTM”, ”deep
generated α for the training, as well as for the validation sets. For CNN + shallow Bi-LSTM” and ”deep CNN + deep Bi-LSTM”,
evaluation, we used the majority vote of the model predictions 3 For instance, one could systematically use female speakers as vali-
on eleven copies of the test set with α = 0.9, 0.92, 0.94, ..., 1.1. dation and male speakers as test, or inversely
We present the scores obtained with the second augmentation 4 We chose the small batch size in order to achieve high variability in
strategy, which provided the best result. the gradient descent directions
Figure 2: Network architecture

mostly concentrating on the second option. The best results has

been achieved with a choice of 4 convolutional and 1 Bi-LSTM
layers.
In Tab. 1 we present the results of the best model and also
contributions to the performance of the techniques we applied.
One can see that oversampling allowed to increase UA by 2.1%,
but resulted in 2.9% decrease of WA. Data augmentation with
VTLP led to increase of both metrics by 1.1% and 0.7% for
UA and WA correspondingly. Considering a larger range of the
frequencies (8kHz) increased the UA by 0.8%. Finaly in Tab. 2
we present the results per fold, the scores obtained by averaging
our 5 best folds and the results obtained in the other works by 5
fold cross-validation.
We also tried out batch normalization implemented for the
Bi-LSTM layers of the network. During the experiments we
observered that the data of interest are sensitive to normalization. Figure 3: Per-layer gradient evolution
Therefore we choose the most conservative normalizing strategy
which implies averaging the samples over all the axes:
3.1. Difference in the gradient scaling of the convolutional
n πs,t,f − π̂
πs,t,f = √ , (3) and recurrent layers
σ2 +
Monitoring the gradient of the network parameters, we observed
where that the gradient with respect to the weights of the convolutional
layers is much larger than with respect to the weights of Bi-
1 X 1 X
π̂ = πs,t,f , σ= (πs,t,f − π̂)2 . (4) LSTM (see Fig. 3). This observation allows an interpretation
btf btf that regarding the convolutional weights the loss surface should
s,t,f s,t,f
be steeper and deeper than regarding the weights of the Bi-
Here, s, t and f are the batch, temporal and frequency index LSTM. Therefore it gave us a nudge that it might be interesting
respectively, π is preactivation and btf is a product of the sum of to consider different update parameters, namely learning rate and
the sample time lengths over the batch and the feature number. momentum, for convolutional and recurrent modules. Apart from
Then batch normalization is applied only to the input contribution varying the conventional update parameters of the momentum
to the hidden state: optimizer we also considered its modification by introducing the
new parameter β:
ht = a(Wh ht−1 + BN (Wx xt )), (5)

where BN stands for the standard batch normalization operation vt = vt−1 γ + η∇w J(w), w = w − βvt , (6)
[17], a(π), ht , xt are activation, hidden state and input, and Wh ,
Wx are the corresponding weights. where γ, η and vt stands for the momentum, learning rate and
We examined batch normalization applied to the architec- velosity correspondingly. The coefficient β brought into use
tures with 4 convolutional and 1-4 Bi-LSTM layers. The ex- in this way does not accumulate in the velocity expression and
periments with the initial batch size of 16 demonstrated faster provide better control of the momentum term of the optimizer.
overfitting and degradation of the performance compared to the Unfortunately, from our experiments we were not able to draw
baseline. The further experiments with larger batch size showed any definite conclusion in favor of layer-wise adjustment of η,
that it strongly influences the performance (see Tab. 3), despite γ or β. Nevertheless, we find that this is an interesting direc-
the fact that the normalization has been performed along all the tion to persue and more thorough experiments might give more
axes of the batch. One can see that the scores obtained with preferable result.
the batch size of 64 almost reaches the performance of our best It also might be interesting to test the update rule modifica-
model 1. Therefore, it is possible that further augmenting the tion introduced in eq. (6) in the other settings in order to see
batch size would lead to even better results. Unfortunately, due whether it can provide an actual improvement of the momentum
to GPU memory restrictions, we could not verify it. optimizer.
Table 1: 10-cross validation scores depending on the techniques applied (for each experiment we present the results corresponding to its
best run).

Baseline Best model

Augmentation during training - - + +
Oversampling (×2) of happiness and anger - + + +
Frequency range (kHz) 4 4 4 8
Weighted accuracy 66.4 63.5 64.2 64.5
Unweighted accuracy 57.7 59.8 60.9 61.7

Table 2: The performance of the best model per fold and com- preserve the signal structure as much as possible we performed
parison to the other works. The gender column indicates which the normalization layer-wise as well as batch-wise. Nevertheless,
speaker is used as test set in the fold. we did not manage to increase performance compared to the
baseline, which might be caused by the small batch size we had
Fold Session Gender WA (%) UA (%) to use in order to fit into the available GPU memory.
1 1 F 64.1 66.4
2 1 M 68.8 67.7 5. References
3 2 F 70.3 71.3 [1] D. Amodei and etc., “Deep speech 2: End-to-end speech recogni-
4 2 M 62 67.6 tion in english and mandarin,” in Proceedings of the 33rd Interna-
5 3 F 64.8 52.1 tional Conference on Machine Learning. New York, NY, USA:
6 3 M 66.4 56 JMLR, 2015.
7 4 F 68.5 59.7 [2] I. Medennikov, A. Prudnikov, and A. Zatvornitskiy, “Improving en-
8 4 M 64.3 67.3 glish conversational telephone speech recognition,” in Interspeech
2016. San Francisco, CA, USA: ISCA - International Speech
9 5 F 64.8 64.2 Communication Association, 2016, pp. 2–6.
10 5 M 51 44.2
[3] G. Saon, T. Sercu, S. Rennie, and H.-K. J. Kuo, “The ibm 2016
10 fold cross-valid. 64.5 61.7 english conversational telephone speech recognition system,” in
5 best folds 66.9 65.3 Interspeech 2016. San Francisco, CA, USA: ISCA - International
[6] (5 fold cross-valid.) 62.9 63.9 Speech Communication Association, 2016, pp. 7–11.
[7] (5 fold cross-valid.) 67.3 62.0 [4] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-
based speech recognition with gated convnets,” CoRR, vol.
Table 3: The performance of the best model equipped with the abs/1712.09444, 2017.
batch normalization (for each experiment we present the results [5] Y. Kim, H. Lee, and E. Mower Provost, “Deep learning for robust
corresponding to its best run). feature generation in audiovisual emotion recognition,” in 2013
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). Vancouver, BC, Canada: IEEE, 2013, pp.
Minibatch size 16 32 64 3687–3691.
[6] J. Lee and I. Tashev, “High-level feature representation using re-
Weighted 63.6 65.1 65.4 current neural network for speech emotion recognition,” in Inter-
Unweighted 58.9 59 60.8 speech 2015. Dresden, Germany: ISCA - International Speech
Communication Association, 2015.
[7] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recogni-
4. Conclusion tion from speech using deep learning on spectrograms,” in Inter-
speech 2017. Stockholm, Sweden: ISCA - International Speech
In this work we built a neural network for recognizing emotions Communication Association, 2017, pp. 1–5.
in speech, using the IEMOCAP dataset. Unlike the prior results, [8] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim,
in order to measure the model performance we performed 10- J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: interactive
fold cross-validation, which is more appropriate for this dataset. emotional dyadic motion capture database,” Language Resources
To adress the issues of scarcity and class imbalance we em- and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
ployed data augmentation by means of VTLP and minor class [9] T. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional
oversampling. long short-term memory, fully connected deep neural networks,”
in 2015 IEEE International Conference on Acoustics, Speech and
Following the modern trends in speech analysis, we used a Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584.
mixed CNN-LSTM architecture, exploiting the capacity of con-
[10] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch
volutional layers to extract high-level representations from raw normalized recurrent neural networks,” in 2016 IEEE International
inputs. Interestingly, we noticed that parameters of convolutional Conference on Acoustics, Speech and Signal Processing (ICASSP).
and LSTM layers are trained at a very different pace. We tried Shanghai, China: IEEE, 2016.
to take advantage of this observation by per-layer adjustment of [11] T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville,
the update rule parameters, but unfortunately were not able to “Recurrent batch normalization,” CoRR, vol. abs/1603.09025, 2016.
make a definite conclusion in favor of this idea. Nevertheless, [12] J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR,
we find that this is an interesting direction to persue and more vol. abs/1607.06450, 2016.
thorough experiments might give more preferable result. [13] L. Lee and R. Rose, “A frequency warping approach to speaker nor-
We also investigated the effect of batch normalization, an malization,” IEEE Transactions on Speech and Audio Processing,
indispensable tool in most image recognition tasks. In order to vol. 6, no. 1, pp. 49–60, 1998.
[14] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP)
improves speech recognition,” in Proceedings of the 30th Inter-
national Conference on Machine Learning, Atlanta, GA, USA,
2013.
[15] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep
neural network acoustic modeling,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
Florence, Italy: IEEE, 2014, pp. 5582–5586.
[16] H. Harutyunyan and E. Sanogh, “Khoskits lezvi chanachum khory
usutsman metvodnerov, BS thesis,” 2016.
[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” CoRR, vol.
abs/1502.03167, 2015.

Emotional Speech Recognition Using Deep Neural Networks
No ratings yet
Emotional Speech Recognition Using Deep Neural Networks
20 pages
Research Paper Seminar
No ratings yet
Research Paper Seminar
17 pages
162,163,174 - 1 - Revised Paper
No ratings yet
162,163,174 - 1 - Revised Paper
13 pages
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
No ratings yet
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
5 pages
162,163,174 - 1 - Revised Paper
No ratings yet
162,163,174 - 1 - Revised Paper
14 pages
Sentispeak Tone Mood Detector
No ratings yet
Sentispeak Tone Mood Detector
16 pages
Y4 Autumn WhiteRose
No ratings yet
Y4 Autumn WhiteRose
155 pages
Attention-Based Multi-Level Feature Fusion For Multilingual Speech Emotion Recognition
No ratings yet
Attention-Based Multi-Level Feature Fusion For Multilingual Speech Emotion Recognition
6 pages
Paper5 Implementation
No ratings yet
Paper5 Implementation
7 pages
Multi-Modal Emotion Recognition On IEMOCAP Dataset
No ratings yet
Multi-Modal Emotion Recognition On IEMOCAP Dataset
7 pages
Literature Review (2) Smaple
No ratings yet
Literature Review (2) Smaple
9 pages
FP-05 4
No ratings yet
FP-05 4
6 pages
20240815-Temporal Attention Convolutional Network For Speech Emotion Recognition With Latent Representation
No ratings yet
20240815-Temporal Attention Convolutional Network For Speech Emotion Recognition With Latent Representation
5 pages
2 SER Using LSTM
No ratings yet
2 SER Using LSTM
5 pages
Deep Learning for Emotion Detection
No ratings yet
Deep Learning for Emotion Detection
5 pages
Project Report SSUC-12
No ratings yet
Project Report SSUC-12
2 pages
Objectives and Principles of ECCE
100% (1)
Objectives and Principles of ECCE
11 pages
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
No ratings yet
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
12 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Electronics 12 00839 v2
No ratings yet
Electronics 12 00839 v2
17 pages
Learning From Others and Reviewing The Literature: Grade
33% (3)
Learning From Others and Reviewing The Literature: Grade
41 pages
Speech Emotion Recognition Based On Graph-LSTM Neural Network
No ratings yet
Speech Emotion Recognition Based On Graph-LSTM Neural Network
10 pages
Voice Emotion Recognition
No ratings yet
Voice Emotion Recognition
11 pages
JETIR2106163
No ratings yet
JETIR2106163
5 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
Reality
No ratings yet
Reality
11 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Deep Learning for Emotion Detection
No ratings yet
Deep Learning for Emotion Detection
9 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
Deep Learning Techniques For Speech Emotion Recognition A Review
No ratings yet
Deep Learning Techniques For Speech Emotion Recognition A Review
6 pages
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
No ratings yet
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
5 pages
Speech Emotion Recognition (Sound C
No ratings yet
Speech Emotion Recognition (Sound C
2 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
Team Teaching
100% (2)
Team Teaching
31 pages
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
No ratings yet
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
10 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
Li21j Interspeech
No ratings yet
Li21j Interspeech
5 pages
Practice Teacher Profile Form
No ratings yet
Practice Teacher Profile Form
1 page
An Enhanced Speech Emotion Recognition Using Vision Transformer
No ratings yet
An Enhanced Speech Emotion Recognition Using Vision Transformer
17 pages
Applsci 12 04338 v3
No ratings yet
Applsci 12 04338 v3
18 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
Early Grades Reading Assessment: (EGRA) - Components
100% (10)
Early Grades Reading Assessment: (EGRA) - Components
55 pages
Irjet V7i6804
No ratings yet
Irjet V7i6804
7 pages
Pile cp16
No ratings yet
Pile cp16
13 pages
Speech Emotion Recognition via CNN LSTM
No ratings yet
Speech Emotion Recognition via CNN LSTM
12 pages
Multimodal Emotion Recognition
No ratings yet
Multimodal Emotion Recognition
5 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
Multimodal Emotion Detection With An Emphasis On Speech Modal
No ratings yet
Multimodal Emotion Detection With An Emphasis On Speech Modal
38 pages
Pe and Health: Quarter 1 - Module 1: Exercise For Fitness
100% (1)
Pe and Health: Quarter 1 - Module 1: Exercise For Fitness
16 pages
TLE 9 EIM 9 Q3 M8 - Gerardo Guevara
No ratings yet
TLE 9 EIM 9 Q3 M8 - Gerardo Guevara
12 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Letter of Motivation
No ratings yet
Letter of Motivation
2 pages
Blood Pressure Rubric
No ratings yet
Blood Pressure Rubric
1 page
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
No ratings yet
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
2 pages
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
No ratings yet
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
4 pages
Psyc 5121 Mo
No ratings yet
Psyc 5121 Mo
23 pages
Presentation The Special Education Process
No ratings yet
Presentation The Special Education Process
25 pages
Journal of Advanced Zoology: Multimodal Emotion Recognition System Using Machine Learning Classifier
No ratings yet
Journal of Advanced Zoology: Multimodal Emotion Recognition System Using Machine Learning Classifier
4 pages
Topic: School Improvement Plan (Sip) and Managing Programs and Projects
No ratings yet
Topic: School Improvement Plan (Sip) and Managing Programs and Projects
4 pages
7 E.pilar Limin 12. 2.19
No ratings yet
7 E.pilar Limin 12. 2.19
28 pages
Ser DRCNN
No ratings yet
Ser DRCNN
7 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Systems Practice Workbook
No ratings yet
Systems Practice Workbook
94 pages
1 Grade8 CollectOrganizeandDisplayData AODA
No ratings yet
1 Grade8 CollectOrganizeandDisplayData AODA
9 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
Lesson Plan 15-1-2025 - G2
No ratings yet
Lesson Plan 15-1-2025 - G2
2 pages
Entropy 21 00479 PDF
No ratings yet
Entropy 21 00479 PDF
17 pages
Sensors: Speech Emotion Recognition With Heterogeneous Feature Unification of Deep Neural Network
No ratings yet
Sensors: Speech Emotion Recognition With Heterogeneous Feature Unification of Deep Neural Network
15 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Principles of High Quality Assessment (Presentation)
No ratings yet
Principles of High Quality Assessment (Presentation)
55 pages
Goal Based AI-Agents Mid Term
No ratings yet
Goal Based AI-Agents Mid Term
11 pages
Coaching Agile Leaders Assessment Guide
100% (1)
Coaching Agile Leaders Assessment Guide
6 pages
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
No ratings yet
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
9 pages
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
No ratings yet
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
12 pages
EGEE 101H Reflective Essay: End of The Semester 4/13/19
No ratings yet
EGEE 101H Reflective Essay: End of The Semester 4/13/19
2 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Acr 3RD Portfolio Day
No ratings yet
Acr 3RD Portfolio Day
2 pages
Davis Advantage For Pediatric Nursing Critical Components of Nursing Care 3rd Edition Rudd Full Download
33% (3)
Davis Advantage For Pediatric Nursing Critical Components of Nursing Care 3rd Edition Rudd Full Download
407 pages
Inquiries, Investigation and Immersion: Contextualized Detailed Lesson Plan W/ Ims Sy 2020-2021
No ratings yet
Inquiries, Investigation and Immersion: Contextualized Detailed Lesson Plan W/ Ims Sy 2020-2021
32 pages
Student HTML Project Report
No ratings yet
Student HTML Project Report
18 pages
Bab 1 Awal
No ratings yet
Bab 1 Awal
6 pages
CBC Blended Oap Moi
No ratings yet
CBC Blended Oap Moi
86 pages
Babalola and Hafsatu PDF
No ratings yet
Babalola and Hafsatu PDF
11 pages
Anatomy Physiology For Health Professions Mindtap Course List by Jonathan Bubb Ebook and TestBank Bundle Verified PDF
0% (1)
Anatomy Physiology For Health Professions Mindtap Course List by Jonathan Bubb Ebook and TestBank Bundle Verified PDF
400 pages

1802.05630v2 - Speech Emotion Detection

Uploaded by

1802.05630v2 - Speech Emotion Detection

Uploaded by

CNN+LSTM Architecture for Speech Emotion Recognition with Data

Abstract the training procedure. On the other hand, it is also possible to

mostly concentrating on the second option. The best results has

Baseline Best model

You might also like