CNN+LSTM Architecture for Speech Emotion Recognition with Data
Augmentation
Caroline Etienne 1,2,∗ , Guillaume Fidanza 2,∗ , Andrei Petrovskii 2,∗ ,
Laurence Devillers 1 , Benoit Schmauch 2 .
1
LIMSI, CNRS, Paris-Sud University, Paris-Saclay University / F-91405 Orsay, France
2
DreamQuark, 29 rue de Courcelles, 75008 Paris, France
∗
Equal contribution
[email protected], [email protected],
[email protected], [email protected],
[email protected]
Abstract the training procedure. On the other hand, it is also possible to
arXiv:1802.05630v2 [cs.SD] 11 Sep 2018
In this work we design a neural network for recognizing emotions use a deep CNN for extracting high-level features, which are
in speech, using the IEMOCAP dataset. Following the latest then fed to a RNN for final time aggregation. We test a variety
advances in audio analysis, we use an architecture involving of architectures with different depths for the convolutional (1-6
both convolutional layers, for extracting high-level features from layers) and recurrent modules (1-4 Bi-LSTM layers), achieving
raw spectrograms, and recurrent ones for aggregating long-term the best scores with a 4+1 scenario2 .
dependencies. We examine the techniques of data augmentation To address challenges of class imbalance and data scarcity,
with vocal track length perturbation, layer-wise optimizer adjust- we explored a vocal tract length perturbation for the purpose of
ment, batch normalization of recurrent layers and obtain highly data augmentation, and showed that it significantly improves the
competitive results of 64.5% for weighted accuracy and 61.7% performance. In line with [10, 1, 11, 12] we examined batch
for unweighted accuracy on four emotions. normalization applied to the recurrent layers of the network.
Finally, we noticed that parameters of convolutional and Bi-
LSTM layers are trained at a very different pace. We tried to
1. Introduction take advantage of this observation by per-layer adjustment of the
Providing high quality interaction between a human and a ma- update rule parameters, but unfortunately were not able to make
chine is a very challenging and active field of research with a definite conclusion in favor of this idea.
numerous applications. An important part of this domain is
recognition of human speech emotions by computer systems. In 1.1. Dataset description
the last years, impressive progress has been achieved in speech
IEMOCAP (Interactive Emotional Dyadic Motion Capture), col-
recognition by means of deep learning [1, 2, 3, 4]. These achieve-
lected at the University of Southern California (USC) [8], is one
ments also include significant results on speech emotion recog-
of the standard datasets for emotion recognition. It consists of
nition (SER), see e.g. [5, 6, 7]. In this work we build a neural
twelve hours of audio and video recordings performed by 10
network for SER on the IEMOCAP dataset [8] and achieve the
professional actors (five women and five men) and organized in
result highly competitive to the state of the art. 1
5 sessions of dialogues between two actors of different genders,
When treating a SER problem with deep learning, one either
either playing a script or improvising. Each sample of the audio
creates hand-crafted acoustic features (MFCC, pitch, energy,
set is an utterance assigned with an emotion label. Labeling
ZCR...), which are used as inputs to a neural network, or sends
was made by six students of USC, three at a time for each ut-
the data, after some preprocessing (e.g. Fourier transform), di-
terance. The annotators were allowed to assign multiple labels
rectly to a neural network. We apply the second strategy by
if necessary. The final true label for each utterance was chosen
transforming the audio signal to a spectrogram, which is then
by majority vote if the emotion category with the highest vote
used as an input to convolutional layers, followed by recur-
was unique. Since the annotators reached consensus more of-
rent ones. Such a choice of an architecture, which has recently
ten when labeling improvised utterances (83.1%) than scripted
demonstrated very competitive performance [9, 1, 7], assumes
ones (66.9%) [8], we concentrate only on the improvised part
two main interpretations. On one hand, adding few convolutional
of the dataset. For the sake of comparison with the prior state-
layers in the beginning of the network is an efficient way to re-
of-the-art approaches, we predict four of the most represented
duce dimensionality of the data and can significantly simplify
emotions: neutral, sadness, anger and happiness, which leave us
1 To our knowledge, the present state of the art has been achieved in 2280 utterances in total.
[6]. However the cross-validation procedure performed in this paper (as
in other works presenting the results obtained on the IEMOCAP dataset)
includes only five folds of the dataset out of the ten possible. On the
2. Data augmentation
other hand, our experiments showed (see section 3) that the performance The IEMOCAP dataset has two main drawbacks: class imbal-
strongly depends on the part of the data which is used for measuring the
ance (see Fig. 1) and small size. To cope with both obstacles,
scores. As a consequence the results obtained by 5-fold cross-validation
without clarification what data has been used for the measurement are we examined data augmentation by means of vocal tract length
not possible to compare with. Therefore we propose to use 10-fold cross perturbation (VTLP), at the same time oversampling the least
validation as the correct way for measuring the scores on IEMOCAP
dataset and present our results correspondingly. 24 convolutional and 1 Bi-LSTM layers
3. Model description and experiments
As it has been mentioned above, the IEMOCAP dataset consists
of five sessions, each being a conversation between a man and
a woman, giving 10 speakers in total. In order to see how well
the model can generalize to different speakers, we took the
validation and test sets to correspond to two different speakers of
one of the sessions. The training set was composed of the four
remaining sessions. In the course of experiments, we observed
that the performance strongly depends on which speakers are
chosen for the test set (see Tab. 2). Therefore we choose 10-fold
cross-validation strategy, in order to average over all possible
choices of the dataset splitting. Interestingly, to the best of
our knowledge, all the other results reported on the IEMOCAP
dataset were obtained by 5-fold cross-validation. In this case the
choice of the validation and test sets is not rigorously defined3
and the scores obtained in this way are not possible to compare
with.
Figure 1: Class distribution of the utterances in the improvised
For evaluating the model performance, we chose weighted
part of the IEMOCAP dataset
(WA) and unweighted (UA) accuracies. WA is the standard
accuracy computed over the whole test set. UA is an average
over accuracies computed for each emotion separately. First, we
represented classes of the dataset: happiness and anger. VTLP compute the metrics for each fold and then present the scores
is based on the speaker normalization technique considered in as the average over all the folds. Since for imbalanced datasets
[13], where it was implemented to reduce interspeaker variability. UA is a more relevant characteristic, we rather concentrated our
The difference in human’s vocal tract length can be modeled by efforts on getting a high UA, in line with most of the other works
rescaling the peaks of significant formants along the frequency on IEMOCAP.
axis with a factor α taking values in the approximate range We considered architectures with 1-6 convolutional layers,
(0.9, 1.1). Therefore, in order to get rid of this variablility, one 1-4 Bi-LSTM layers and a dense layer with softmax nonlinearity
should estimate the factor for each speaker and accordingly nor- on top of the network (see Fig. 2). As an optimization procedure,
malize the spectrograms. Applied inversely, the same idea can we used stochastic gradient descent with momentum and the
be used for data augmentation [14, 15, 16]: in order to generate batch size of 164 . For the regularization of weights we used
new samples, one simply has to perform rescaling of the original L2-regularization.
spectrograms along the frequency axis while keeping the scaling Due to the significant variety of the data samples in the time
factor in the range (0.9, 1.1). Both approaches, normalization length (from 21 to 909 time steps for window size N = 64ms
and augmentation, pursue the same objective: to enforce the and shift S = 32ms) we performed zero-padding of the samples
invariance of the model to speaker-dependent features, since along the time axis. In order to avoid the aggregation of the
they are not relevant to the classification criterion. Augmenta- artificially added time steps by Bi-LSTM, we put a masking
tion, however, is easier to implement because we don’t need to layer between the convolutional and Bi-LSTM modules. The
estimate the scaling factor of each speaker, and therefore we size of the mask has been derived from the temporal size of
stick to this option. the corresponding spectrogram and action of the convolutional
strides on it.
Rescaling of frequencies has been performed as follows Finally we normalized the samples according to the general
[13]: statistics of the dataset:
x − x̂
( xn = √ , (2)
αf 0 ≤ f ≤ f0 σ2 +
G(f ) = fmax −αf0
fmax −f0
(f − f0 ) + αf0 f0 ≤ f ≤ fmax , where x̂ and σ are the average and standard deviation of the
(1) spectrogram pixels computed over the whole dataset along both
where fmax is the upper cut-off frequency and f0 is defined to time and frequency axes. Such normalization significantly im-
f0
be larger than the highest significant formants (we took fmax = proves the convergence time of the model. However, applied to
0.9). Therefore, we rescale the frequencies below f0 with α ∈ networks of small depth (≤ 2 convolutional layers), it results in
(0.9, 1.1), and then rescale the rest to ensure that the considered strong overfitting.
diapason stays constant. As we have mentioned above, we conducted a variety of
We tried two strategies of data augmentation. In the first experiments with different depths of convolutional and Bi-LSTM
one, a single uniformly distributed value α ∈ (0.9, 1.1) was modules. The presence of pooling layers alternating with the
sampled at each epoch and used to rescale all training examples, convolutions noticeably decreased the performance and has been
and no rescaling was applied to the validation set. In the second discarded in the beginning of the experiments. We examined
strategy, each spectrogram was rescaled with an individually different scenarios: ”shallow CNN + deep Bi-LSTM”, ”deep
generated α for the training, as well as for the validation sets. For CNN + shallow Bi-LSTM” and ”deep CNN + deep Bi-LSTM”,
evaluation, we used the majority vote of the model predictions 3 For instance, one could systematically use female speakers as vali-
on eleven copies of the test set with α = 0.9, 0.92, 0.94, ..., 1.1. dation and male speakers as test, or inversely
We present the scores obtained with the second augmentation 4 We chose the small batch size in order to achieve high variability in
strategy, which provided the best result. the gradient descent directions
Figure 2: Network architecture
mostly concentrating on the second option. The best results has
been achieved with a choice of 4 convolutional and 1 Bi-LSTM
layers.
In Tab. 1 we present the results of the best model and also
contributions to the performance of the techniques we applied.
One can see that oversampling allowed to increase UA by 2.1%,
but resulted in 2.9% decrease of WA. Data augmentation with
VTLP led to increase of both metrics by 1.1% and 0.7% for
UA and WA correspondingly. Considering a larger range of the
frequencies (8kHz) increased the UA by 0.8%. Finaly in Tab. 2
we present the results per fold, the scores obtained by averaging
our 5 best folds and the results obtained in the other works by 5
fold cross-validation.
We also tried out batch normalization implemented for the
Bi-LSTM layers of the network. During the experiments we
observered that the data of interest are sensitive to normalization. Figure 3: Per-layer gradient evolution
Therefore we choose the most conservative normalizing strategy
which implies averaging the samples over all the axes:
3.1. Difference in the gradient scaling of the convolutional
n πs,t,f − π̂
πs,t,f = √ , (3) and recurrent layers
σ2 +
Monitoring the gradient of the network parameters, we observed
where that the gradient with respect to the weights of the convolutional
layers is much larger than with respect to the weights of Bi-
1 X 1 X
π̂ = πs,t,f , σ= (πs,t,f − π̂)2 . (4) LSTM (see Fig. 3). This observation allows an interpretation
btf btf that regarding the convolutional weights the loss surface should
s,t,f s,t,f
be steeper and deeper than regarding the weights of the Bi-
Here, s, t and f are the batch, temporal and frequency index LSTM. Therefore it gave us a nudge that it might be interesting
respectively, π is preactivation and btf is a product of the sum of to consider different update parameters, namely learning rate and
the sample time lengths over the batch and the feature number. momentum, for convolutional and recurrent modules. Apart from
Then batch normalization is applied only to the input contribution varying the conventional update parameters of the momentum
to the hidden state: optimizer we also considered its modification by introducing the
new parameter β:
ht = a(Wh ht−1 + BN (Wx xt )), (5)
where BN stands for the standard batch normalization operation vt = vt−1 γ + η∇w J(w), w = w − βvt , (6)
[17], a(π), ht , xt are activation, hidden state and input, and Wh ,
Wx are the corresponding weights. where γ, η and vt stands for the momentum, learning rate and
We examined batch normalization applied to the architec- velosity correspondingly. The coefficient β brought into use
tures with 4 convolutional and 1-4 Bi-LSTM layers. The ex- in this way does not accumulate in the velocity expression and
periments with the initial batch size of 16 demonstrated faster provide better control of the momentum term of the optimizer.
overfitting and degradation of the performance compared to the Unfortunately, from our experiments we were not able to draw
baseline. The further experiments with larger batch size showed any definite conclusion in favor of layer-wise adjustment of η,
that it strongly influences the performance (see Tab. 3), despite γ or β. Nevertheless, we find that this is an interesting direc-
the fact that the normalization has been performed along all the tion to persue and more thorough experiments might give more
axes of the batch. One can see that the scores obtained with preferable result.
the batch size of 64 almost reaches the performance of our best It also might be interesting to test the update rule modifica-
model 1. Therefore, it is possible that further augmenting the tion introduced in eq. (6) in the other settings in order to see
batch size would lead to even better results. Unfortunately, due whether it can provide an actual improvement of the momentum
to GPU memory restrictions, we could not verify it. optimizer.
Table 1: 10-cross validation scores depending on the techniques applied (for each experiment we present the results corresponding to its
best run).
Baseline Best model
Augmentation during training - - + +
Oversampling (×2) of happiness and anger - + + +
Frequency range (kHz) 4 4 4 8
Weighted accuracy 66.4 63.5 64.2 64.5
Unweighted accuracy 57.7 59.8 60.9 61.7
Table 2: The performance of the best model per fold and com- preserve the signal structure as much as possible we performed
parison to the other works. The gender column indicates which the normalization layer-wise as well as batch-wise. Nevertheless,
speaker is used as test set in the fold. we did not manage to increase performance compared to the
baseline, which might be caused by the small batch size we had
Fold Session Gender WA (%) UA (%) to use in order to fit into the available GPU memory.
1 1 F 64.1 66.4
2 1 M 68.8 67.7 5. References
3 2 F 70.3 71.3 [1] D. Amodei and etc., “Deep speech 2: End-to-end speech recogni-
4 2 M 62 67.6 tion in english and mandarin,” in Proceedings of the 33rd Interna-
5 3 F 64.8 52.1 tional Conference on Machine Learning. New York, NY, USA:
6 3 M 66.4 56 JMLR, 2015.
7 4 F 68.5 59.7 [2] I. Medennikov, A. Prudnikov, and A. Zatvornitskiy, “Improving en-
8 4 M 64.3 67.3 glish conversational telephone speech recognition,” in Interspeech
2016. San Francisco, CA, USA: ISCA - International Speech
9 5 F 64.8 64.2 Communication Association, 2016, pp. 2–6.
10 5 M 51 44.2
[3] G. Saon, T. Sercu, S. Rennie, and H.-K. J. Kuo, “The ibm 2016
10 fold cross-valid. 64.5 61.7 english conversational telephone speech recognition system,” in
5 best folds 66.9 65.3 Interspeech 2016. San Francisco, CA, USA: ISCA - International
[6] (5 fold cross-valid.) 62.9 63.9 Speech Communication Association, 2016, pp. 7–11.
[7] (5 fold cross-valid.) 67.3 62.0 [4] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-
based speech recognition with gated convnets,” CoRR, vol.
Table 3: The performance of the best model equipped with the abs/1712.09444, 2017.
batch normalization (for each experiment we present the results [5] Y. Kim, H. Lee, and E. Mower Provost, “Deep learning for robust
corresponding to its best run). feature generation in audiovisual emotion recognition,” in 2013
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). Vancouver, BC, Canada: IEEE, 2013, pp.
Minibatch size 16 32 64 3687–3691.
[6] J. Lee and I. Tashev, “High-level feature representation using re-
Weighted 63.6 65.1 65.4 current neural network for speech emotion recognition,” in Inter-
Unweighted 58.9 59 60.8 speech 2015. Dresden, Germany: ISCA - International Speech
Communication Association, 2015.
[7] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recogni-
4. Conclusion tion from speech using deep learning on spectrograms,” in Inter-
speech 2017. Stockholm, Sweden: ISCA - International Speech
In this work we built a neural network for recognizing emotions Communication Association, 2017, pp. 1–5.
in speech, using the IEMOCAP dataset. Unlike the prior results, [8] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim,
in order to measure the model performance we performed 10- J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: interactive
fold cross-validation, which is more appropriate for this dataset. emotional dyadic motion capture database,” Language Resources
To adress the issues of scarcity and class imbalance we em- and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
ployed data augmentation by means of VTLP and minor class [9] T. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional
oversampling. long short-term memory, fully connected deep neural networks,”
in 2015 IEEE International Conference on Acoustics, Speech and
Following the modern trends in speech analysis, we used a Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584.
mixed CNN-LSTM architecture, exploiting the capacity of con-
[10] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch
volutional layers to extract high-level representations from raw normalized recurrent neural networks,” in 2016 IEEE International
inputs. Interestingly, we noticed that parameters of convolutional Conference on Acoustics, Speech and Signal Processing (ICASSP).
and LSTM layers are trained at a very different pace. We tried Shanghai, China: IEEE, 2016.
to take advantage of this observation by per-layer adjustment of [11] T. Cooijmans, N. Ballas, C. Laurent, C. Gulcehre, and A. Courville,
the update rule parameters, but unfortunately were not able to “Recurrent batch normalization,” CoRR, vol. abs/1603.09025, 2016.
make a definite conclusion in favor of this idea. Nevertheless, [12] J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR,
we find that this is an interesting direction to persue and more vol. abs/1607.06450, 2016.
thorough experiments might give more preferable result. [13] L. Lee and R. Rose, “A frequency warping approach to speaker nor-
We also investigated the effect of batch normalization, an malization,” IEEE Transactions on Speech and Audio Processing,
indispensable tool in most image recognition tasks. In order to vol. 6, no. 1, pp. 49–60, 1998.
[14] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP)
improves speech recognition,” in Proceedings of the 30th Inter-
national Conference on Machine Learning, Atlanta, GA, USA,
2013.
[15] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep
neural network acoustic modeling,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
Florence, Italy: IEEE, 2014, pp. 5582–5586.
[16] H. Harutyunyan and E. Sanogh, “Khoskits lezvi chanachum khory
usutsman metvodnerov, BS thesis,” 2016.
[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” CoRR, vol.
abs/1502.03167, 2015.