Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views5 pages

2 SER Using LSTM

This document presents a novel dual-level model for Speech Emotion Recognition (SER) that utilizes both MFCC features and mel-spectrograms processed through a Dual-Sequence LSTM architecture. The proposed model achieves a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%, outperforming existing unimodal models by 6% and showing comparable performance to multimodal models. The research emphasizes the importance of leveraging audio signals effectively and introduces new preprocessing techniques to enhance prediction accuracy.

Uploaded by

vishalvyshnav257
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

2 SER Using LSTM

This document presents a novel dual-level model for Speech Emotion Recognition (SER) that utilizes both MFCC features and mel-spectrograms processed through a Dual-Sequence LSTM architecture. The proposed model achieves a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%, outperforming existing unimodal models by 6% and showing comparable performance to multimodal models. The research emphasizes the importance of leveraging audio signals effectively and introduces new preprocessing techniques to enhance prediction accuracy.

Uploaded by

vishalvyshnav257
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

SPEECH EMOTION RECOGNITION WITH DUAL-SEQUENCE LSTM ARCHITECTURE

Jianyou Wang1 , Michael Xue1 , Ryan Culhane1 , Enmao Diao1 , Jie Ding2 , Vahid Tarokh1
1
Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
2
School of Statistics, University of Minnesota Twin Cities, Minneapolis, MN, USA
arXiv:1910.08874v4 [eess.AS] 13 Feb 2020

ABSTRACT data uses only raw audio signals, whereas research in mul-
timodal data leverages both audio signals and lexical infor-
Speech Emotion Recognition (SER) has emerged as a critical
component of the next generation of human-machine inter- mation, and in some cases, visual information. Not surpris-
facing technologies. In this work, we propose a new dual- ingly, since they take advantage of more information, multi-
level model that predicts emotions based on both MFCC fea- modal models generally outperform unimodal models by 6-
7%. Traditionally, unimodal models extract high level in-
tures and mel-spectrograms produced from raw audio signals.
formation from raw audio signals, such as MFCC features,
Each utterance is preprocessed into MFCC features and two
mel-spectrograms at different time-frequency resolutions. A and then pass the output through a recurrent neural network
standard LSTM processes the MFCC features, while a novel [2]. Recently, researchers have begun transforming raw audio
LSTM architecture, denoted as Dual-Sequence LSTM (DS- signals into spectrograms or mel-spectrograms [3, 4], which
LSTM), processes the two mel-spectrograms simultaneously. contain low level information and can be converted back to
raw audio. These spectrograms are then mapped into a latent
The outputs are later averaged to produce a final classification
time series through several convolutional layers before going
of the utterance. Our proposed model achieves, on average, a
weighted accuracy of 72.7% and an unweighted accuracy of through a recurrent layer.
73.3%—a 6% improvement over current state-of-the-art uni- Some researchers think that audio data alone is not
modal models—and is comparable with multimodal models enough to make an accurate prediction [5], and thus many
that leverage textual information as well as audio signals. have turned to using textual information as well. However,
it is possible that two utterances with the same textual con-
Index Terms— Speech Emotion Recognition, Mel- tent can have entirely different meanings when fueled with
Spectrogram, LSTM, Dual-Sequence LSTM, Dual-Level different emotions. Therefore, using textual information too
Model liberally may lead to misleading predictions. It is our opin-
ion that the full potential of audio signals has not been fully
1. INTRODUCTION explored, and we propose several changes to the existing
state-of-the-art framework for unimodal SER [6, 7].
As the field of Automatic Speech Recognition (ASR) rapidly In this paper, we make three major contributions to the
matures, people are beginning to realize that the information existing unimodal SER framework. First, we propose a new
conveyed in speech goes beyond its textual content. Recently, dual-level model that contains two independent neural net-
by employing deep learning, researchers have found promis- works that process the MFCC features and mel-spectrograms
ing directions within the topic of Speech Emotion Recogni- separately, but are trained jointly. Similar to other dual-
tion (SER). As one of the most fundamental characteristics level architectures [8], we found that our proposed dual-level
that distinguishes intelligent life forms from the rest, emo- model provides a significant increase in accuracy. Second,
tion is an integral part of our daily conversations. From the inspired by the time-frequency trade-off [9], from each ut-
broad perspective of general-purposed artificial intelligence, terance we calculate two mel-spectrograms of different time-
the ability to detect the emotional contents of human speech frequency resolutions instead of just one. Since these two
has far-reaching applications and benefits. Furthermore, the spectrograms contain complementary information—namely,
notion that machines can understand and perhaps some day one has a better resolution along the time axis and the other
produce emotions can profoundly change the way humans has a better resolution along the frequency axis—we propose
and machines interact. a novel variant of LSTM [10], denoted as Dual-Sequence
Previous work in SER models on the benchmark IEMO- LSTM (DS-LSTM), that can process these two sequences of
CAP dataset [1] can be generally divided into two categories: data simultaneously and harness their complementary infor-
unimodal and multimodal. Research that focuses on unimodal mation effectively. It should be noted that previous research
This work was supported in part by Office of Naval Research Grant No. in multi-dimensional LSTM (MD-LSTM) [11, 12, 13], es-
N00014-18-1-2244. pecially in ASR [14, 15], focused on adapting the LSTM to
Fig. 1: Dual-level model with DS-LSTM cell

a multi-dimensional data format. Although similar in con- For each utterance, we also propose to derive two mel-
cept, our proposed DS-LSTM has a distinct architecture, and spectrograms of different time-frequency resolutions instead
is designed to process two sequences of one-dimensional of just one, as done in previous research. One (denoted by
data instead of multi-dimensional data. Third, we propose S1i ) is a mel-scaled spectrogram with a narrower window and
a novel mechanism for data preprocessing that uses nearest- thus a better time resolution, while the other (denoted by S2i )
neighbor interpolation to address the problem of variable is a mel-scaled spectrogram with a wider window and thus a
lengths between different audio signals. We have found that better frequency resolution. In our work, S1i and S2i are cal-
interpolation works better than more typical methods such as culated from a short-time Fourier transform with 256 and 512
truncating and padding data, which lose information and also FFT points, respectively. The hop length and the number of
increase the computational cost. mel channels are 50% and 25% of the number of FFT points,
respectively.
2. RESEARCH METHODOLOGY The standard method to deal with variable length in ut-
terances is padding or truncation. Since there are rises and
2.1. Dataset Description cadences in human conversation, we cannot assume the emo-
tional contents are uniformly distributed within each utter-
We used the Interactive Emotional Dyadic Motion Capture ance. Therefore, by truncating data, critical information is
(IEMOCAP) dataset [1] in this work, a benchmark dataset inevitably lost. On the other hand, padding is computation-
containing about 12 hours of audio and video data, as well as ally expensive. We propose a different approach to deal with
text transcriptions. The dataset contains five sessions, each variable length between utterances: nearest-neighbor interpo-
of which involves two distinct professional actors conversing lation, in which we interpolate along the time axis for each
with one another in both scripted and improvised manners. In mel-spectrogram to the median number of time steps for all
this work, we utilize data from both scripted and improvised the spectrograms, followed by a logarithmic transformation.
conversations, as well as only audio data to stay consistent
with the vast majority of prior work. We also train and eval-
2.3. Proposed Model
uate our model on four emotions: happy, neutral, angry, and
sad, resulting in a total of 5531 utterances (happy: 29.5%, 2.3.1. Dual-Level Architecture
neutral: 30.8%, angry: 19.9%, sad: 19.5%). We denote these
5531 utterances in the set {u1 , . . . , u5531 }. Our proposed dual-level architecture is illustrated in Figure 1.
It contains two separate models, MLSTM and MDS-LSTM , the
first for the MFCC features and the second for the two mel-
2.2. Preprocessing spectrograms. Each of these two models has a classification
For extracting MFCC features, we used the openSMILE layer, the outputs of which are averaged to make the final pre-
toolkit [12], a software that automatically extracts features diction. The loss function is also the average of two different
from an audio signal. Using the MFCC12 E D A config- cross entropy losses from the two models.
uration file, we extracted 13 Mel-Frequency Cepstral Co-
efficients (MFCCs), as well as 13 delta and 13 accelera- 2.3.2. LSTM for MFCC Features
tion coefficients, for a total of 39 acoustic features. These
features are extracted from 25 ms frames, resulting in a The MFCC features for each utterance are represented by
sequence of 39-dimensional MFCC features per utterance Z = {z1 , . . . , zT }, with each zi ∈ R39 . Each Z is fed into a
ui ∈ {u1 , . . . , u5531 }. standard two-layer single-directional LSTM, whose outputs,
H = {h01 , . . . , h0T }, as specified by Figure 1, are mean pooled
before being fed into the final classification layer [2].

2.3.3. CNN for Mel-Spectrograms


As mentioned earlier, for each utterance ui , we produce two
mel-spectrograms with different time-frequency resolutions.
We pass these two spectrograms into two independent 2D
CNN blocks, each of which consist of two convolution and
max-pooling layers. After both spectrograms go through the
two convolution and max-pooling layers, they have a differ-
ent number of time steps, one with T1 and the other with T2 , Fig. 2: The graphical representation of one DS-LSTM cell
where T1 ≈ 2T2 . Before passing both sequences into the DS-
LSTM, we use an alignment procedure to ensure they have
the same number of time steps, taking the average of adja- While an LSTM is a four-gated RNN, the DS-LSTM is
cent time steps in the sequence of length T1 . After alignment, a six-gated RNN, with one extra input gate iFt at (3) and
both sequences have the same number of time steps T3 , where one extra intermediate memory cell C˜Ft at (6). The two in-
T3 ≈ T2 . termediate memory cells C˜Tt and C˜Ft are derived from X
and Y , respectively, with the intuition that C˜Tt will capture
2.3.4. Dual-Sequence LSTM more information along the time axis, while C˜Ft will capture
more information along the frequency axis. Empirical experi-
Following the alignment operation, we obtain two sequences ments suggest that the forget gate, two input gates, and output
of data, X = {x1 , . . . , xT3 } and Y = {y1 , . . . , yT3 }, with gate should incorporate the maximum amount of information,
the same number of time steps. Here, X comes from mel- which is the concatenation of xt , yt , and ht−1 .
spectrogram S1i , which records more information along the A recurrent batch normalization layer (rbn) is used to nor-
time axis, and Y comes from mel-spectrogram S2i , which malize the output of the forget gate, input gates, and output
records more information along the frequency axis. It is en- gate in order to speed up training and provide the model with
tirely conceivable that sequences X and Y will complement a more robust regularization effect.
each other, and therefore it will be beneficial to process them In summary, Section 2.3.2 describes the vanilla model
through a recurrent network simultaneously. MLSTM . Sections 2.3.3 and 2.3.4 describe the architecture for
As Figure 2 indicates, we propose a Dual-Sequence our proposed DS-LSTM model, denoted as MDS-LSTM . To-
LSTM (DS-LSTM) that can process two sequences of data si- gether, MLSTM +MDS-LSTM describes our proposed Dual-Level
multaneously. Let denote the Hadamard product, [a, b] the model as illustrated in Figure 1.
concatenation of vectors, σ the sigmoid activation function,
tanh the hyperbolic tangent activation function, and rbn the
3. EXPERIMENTAL SETUP AND RESULTS
recurrent batch normalization layer, which keeps a separate
running mean and variance for each time step [16]. 3.1. Experimental Setup
ft = rbn(σ(Wf [xt , yt , ht−1 ] + bf )) (1) For the CNN block used to process the mel-spectrograms, a
4×4 kernel is used without padding, and the max pooling ker-
iTt = rbn(σ(WiT [xt , yt , ht−1 ] + biT )) (2) nel is 2×2 with a 2×2 stride. For each layer of the CNN, the
iFt = rbn(σ(WiF [xt , yt , ht−1 ] + biF )) (3) output channels are 64 and 16, respectively. All gate neu-
ral networks within LSTM and DS-LSTM have 200 hidden
ot = rbn(σ(Wo [xt , yt , ht−1 ] + bo )) (4) nodes. Each LSTM is single-directional with two layers. The
C˜Tt = tanh (WT [xt , ht−1 ] + bT ) (5) weight and bias for the recurrent batch normalization param-
eters are initialized as 0.1 and 0, respectively, as suggested by
C˜Ft = tanh (WF [yt , ht−1 ] + bF ) (6) the original paper [16]. An Adam optimizer is used with the
Ct = ft Ct−1 + iTt C˜Tt + iFt C˜Ft (7) learning rate set at 0.0001.

ht = ot tanh (Ct ) (8) 3.2. Baseline Methods


After the execution of (8), ht is the hidden state for the
Since several modifications are proposed, we create six base-
next time step, but ht also goes through a batch normalization
line models that consist of various parts of the whole model
layer to be the input for the next layer of the DS-LSTM at
in order to better evaluate the value of each modification.
time t.
Base 1: MLSTM , which is the LSTM-based model with the to its access to both audio data and textual data, we see that
MFCC features. our proposed model achieves comparable performance with
Base 2: CNN+LSTM, whose inputs, {S11 , . . . , S1n }, are [20] in mean weighted accuracy.
spectrograms with 256 FFT points. Inputs are passed through Before further investigating the effectiveness of each inte-
a CNN followed by an LSTM. Models such as these are de- grated part of the proposed dual-level model MLSTM +MDS-LSTM ,
veloped in [6] and [17]. we note that Base 1∼3 and 6 have less parameters than our
Base 3: CNN+LSTM, whose inputs, {S21 , . . . , S2n }, are proposed models. However, we have verified that simply
spectrograms with 512 FFT points. Inputs are passed through adding more nodes or layers to these models does not make
a CNN followed by an LSTM. Note that the architecture is any empirical difference in its predictive power, which sug-
the same as Base 2. gests that these aforementioned baseline models have already
Base 4: A combination of models of Base 2 and Base 3: reached their full potential. Therefore, we can objectively
2 × (CNN+LSTM), whose inputs are {S11 , . . . , S1n } and compare these models.
{S21 , . . . , S2n }. In this model, two LSTMs process two se- Both Base 2 and Base 3 take a single sequence of mel-
quences of mel-spectrograms separately, and their respective spectrograms, and both perform slightly worse than Base 1,
outputs are averaged to make final classifications. Note this which only uses MFCC features. This supports the claim that
is different from our proposed DS-LSTM, which processes mel-spectrograms are harder to learn than MFCC features.
these two sequences within a single DS-LSTM cell. Base 4 is a naive combination of Base 2 and Base 3, and be-
Base 5: A combination of models of Base 1 and Base 4. cause the two LSTMs in Base 4 do not interact with each
Base 6: A combination of models of Base 1 and Base 2. other, the complementary information between these two se-
quences of mel-spectrograms is not fully explored; therefore,
In addition to the above six baseline models, we propose two Base 4 is also slightly worse than Base 1. Base 5 and Base 6
models, MDS-LSTM and the dual-level model, MLSTM +MDS-LSTM . are both dual-level models that consider both MFCC features
We compare these models with the baseline models, as well and mel-spectrograms, and they both outperform Base 1∼4,
as four state-of-the-art models that use standard 5-fold cross- demonstrating the effectiveness of the dual-level model.
validation for evaluation. More importantly, we observe that the proposed MDS-LSTM
significantly outperforms Base 1∼4. Comparing MDS-LSTM
3.3. Results and Analysis with Base 4, we see that when two separate LSTMs are re-
placed by the DS-LSTM, which has only six neural networks
in its cell instead of eight neural networks in two LSTMs
Mean WA Mean UA together, the weighted accuracy increases by 5% and the
Base 1 = MLSTM 64.7±1.4 65.5±1.7 parameters are reduced by 25%. This shows that the DS-
Base 2 = CNN+LSTM 63.5±1.6 64.5±1.5 LSTM is a successful upgrade from two separate LSTMs.
Base 3 = CNN+LSTM 62.9±1.0 64.3±0.9 When we consider the dual-level model MLSTM +MDS-LSTM ,
Base 4 = Base 2 + Base 3 64.4±1.8 65.2±1.8 it outperforms all baseline methods significantly.
Base 5 = Base 1 + Base 4 68.3±1.3 69.3±1.2
Base 6 = Base 1 + Base 2 68.5±0.8 68.9±1.2
D. Dai et. al (2019) [18] 65.4 66.9 4. CONCLUSION
S. Mao et. al (2019) [19] 65.9 66.9
In this paper, we have demonstrated the effectiveness of com-
R. Li et. al (2019) [6] — 67.4
bining MFCC features and mel-spectrograms produced from
S. Yoon et. al (2018) [20] * 71.8±1.9 —
audio signals for emotion recognition. Furthermore, we in-
Proposed MDS-LSTM 69.4±0.6 69.5±1.1
troduced a novel LSTM architecture, denoted as DS-LSTM,
Proposed MLSTM +MDS-LSTM 72.7±0.7 73.3±0.8
which can process two mel-spectrograms simultaneously. We
also outlined several modifications to the data preprocessing
Table 1: Mean WA and Mean UA are the average of weighted ac-
curacy and unweighted accuracy, respectively, for 5-fold cross vali- step. Our proposed model significantly outperforms baseline
dation. Most results are reported with one standard deviation. models and current state-of-the-art unimodal models on the
* indicates the model uses textual information. IEMOCAP dataset, and is comparable with multimodal mod-
els, showing that unimodal models, which only rely on audio
Table 1 indicates our proposed model M +M signals, have not reached their full potential.
LSTM DS-LSTM
outperforms all baseline models by at least 4.2% in mean
weighted accuracy, and by at least 4.0% in mean unweighted 5. ACKNOWLEDGEMENTS
accuracy. It also outperforms state-of-the-art unimodal SER
models [18, 19, 6] by at least 6.8% in mean weighted ac- We are grateful for the assistance of Mr. Reza Soleimani and
curacy and 5.9% in mean unweighted accuracy. Although the support of Professor Robert Calderbank from the Rhodes
multimodal SER models typically have a higher accuracy due Information Initiative at Duke University.
6. REFERENCES [12] Alex Graves and Jürgen Schmidhuber, “Offline hand-
writing recognition with multidimensional recurrent
[1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, neural networks,” pp. 545–552, 2009.
E. Mower, S. Kim, J. N. Chang, S. Lee, and
S. Narayanan, “Iemocap: interactive emotional dyadic [13] Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki,
motion capture database.,” Language Resources and and Jürgen Schmidhuber, “Parallel multi-dimensional
Evaluation, vol. 42, no. 4, pp. 335–359, 2008. lstm, with application to fast biomedical volumetric im-
age segmentation,” NIPS, 2015.
[2] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic
speech emotion recognition using recurrent neural net- [14] Jinyu Li, Abdul Raheem Mohammad, Geoffrey Zweig,
works with local attention,” in 2017 IEEE International and Yifan Gong, “Exploring multidimensional lstms for
Conference on Acoustics, Speech and Signal Processing large vocabulary asr,” in ICASSP, March 2016.
(ICASSP), March 2017, pp. 2227–2231.
[15] Bo Li and Tara N. Sainath, “Reducing the computa-
[3] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion tional complexity of two-dimensional lstms,” in INTER-
recognition from speech using deep learning on spectro- SPEECH, 2017.
grams,” in INTERSPEECH, 2017.
[16] Tim Cooijmans, Nicolas Ballas, César Laurent, and
[4] Jianfeng Zhao, Xia Mao, and Lijiang Chen, “Speech Aaron C. Courville, “Recurrent batch normalization,”
emotion recognition using deep 1d and 2d cnn lstm net- CoRR, vol. abs/1603.09025, 2016.
works,” Biomed. Signal Proc. and Control, vol. 47, pp.
312–323, 2019. [17] Caroline Etienne, Guillaume Fidanza, Andrei Petro-
vskii, Laurence Devillers, and Benoit Schmauch,
[5] E. Kim and J. W. Shin, “Dnn-based emotion recogni- “Cnn+lstm architecture for speech emotion recognition
tion based on bottleneck acoustic features and lexical with data augmentation,” 2018.
features,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing [18] D. Dai, Z. Wu, R. Li, X. Wu, J. Jia, and H. Meng,
(ICASSP), May 2019, pp. 6720–6724. “Learning discriminative features from spectrograms
using center loss for speech emotion recognition,” in
[6] R. Li, Z. Wu, J. Jia, S. Zhao, and H. Meng, “Di- ICASSP 2019 - 2019 IEEE International Conference
lated residual network with multi-head self-attention for on Acoustics, Speech and Signal Processing (ICASSP),
speech emotion recognition,” in ICASSP 2019 - 2019 May 2019, pp. 7405–7409.
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), May 2019, pp. 6675– [19] S. Mao, D. Tao, G. Zhang, P. C. Ching, and T. Lee,
6679. “Revisiting hidden markov models for speech emotion
recognition,” in ICASSP 2019 - 2019 IEEE Interna-
[7] S. Yeh, Y. Lin, and C. Lee, “An interaction-aware atten-
tional Conference on Acoustics, Speech and Signal Pro-
tion network for speech emotion recognition in spoken
cessing (ICASSP), May 2019, pp. 6715–6719.
dialogs,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing [20] S. Yoon, S. Byun, and K. Jung, “Multimodal speech
(ICASSP), May 2019, pp. 6685–6689. emotion recognition using audio and text,” CoRR, vol.
abs/1810.04635, 2018.
[8] I. Choi S. H. Bae and N. S. Kim, “acoustic scene
classification using parallel combination of lstm and
cnn,” Proceedings of the Detection and Classifica-
tion of Acoustic Scenes and Events 2016 Workshop
(DCASE2016), pp. 11–15, 2016.
[9] D. Donoho and P. Stark, “Uncertainty principles and
signal recovery,” SIAM Journal on Applied Mathemat-
ics, vol. 49, no. 3, pp. 906–931, 1989.
[10] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-
term memory,” Neural Computation, vol. 9, pp. 1735–
1780, 1997.
[11] Alex Graves, Santiago Fernández, and Jürgen Schmid-
huber, “Multi-dimensional recurrent neural networks,”
ICANN, 2007.

You might also like