2 SER Using LSTM

This document presents a novel dual-level model for Speech Emotion Recognition (SER) that utilizes both MFCC features and mel-spectrograms processed through a Dual-Sequence LSTM architecture. The proposed model achieves a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%, outperforming existing unimodal models by 6% and showing comparable performance to multimodal models. The research emphasizes the importance of leveraging audio signals effectively and introduces new preprocessing techniques to enhance prediction accuracy.

Uploaded by

vishalvyshnav257

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views5 pages

2 SER Using LSTM

Uploaded by

vishalvyshnav257

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

SPEECH EMOTION RECOGNITION WITH DUAL-SEQUENCE LSTM ARCHITECTURE

Jianyou Wang1 , Michael Xue1 , Ryan Culhane1 , Enmao Diao1 , Jie Ding2 , Vahid Tarokh1
1
Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
2
School of Statistics, University of Minnesota Twin Cities, Minneapolis, MN, USA
arXiv:1910.08874v4 [eess.AS] 13 Feb 2020

ABSTRACT data uses only raw audio signals, whereas research in mul-
timodal data leverages both audio signals and lexical infor-
Speech Emotion Recognition (SER) has emerged as a critical
component of the next generation of human-machine inter- mation, and in some cases, visual information. Not surpris-
facing technologies. In this work, we propose a new dual- ingly, since they take advantage of more information, multi-
level model that predicts emotions based on both MFCC fea- modal models generally outperform unimodal models by 6-
7%. Traditionally, unimodal models extract high level in-
tures and mel-spectrograms produced from raw audio signals.
formation from raw audio signals, such as MFCC features,
Each utterance is preprocessed into MFCC features and two
mel-spectrograms at different time-frequency resolutions. A and then pass the output through a recurrent neural network
standard LSTM processes the MFCC features, while a novel [2]. Recently, researchers have begun transforming raw audio
LSTM architecture, denoted as Dual-Sequence LSTM (DS- signals into spectrograms or mel-spectrograms [3, 4], which
LSTM), processes the two mel-spectrograms simultaneously. contain low level information and can be converted back to
raw audio. These spectrograms are then mapped into a latent
The outputs are later averaged to produce a final classification
time series through several convolutional layers before going
of the utterance. Our proposed model achieves, on average, a
weighted accuracy of 72.7% and an unweighted accuracy of through a recurrent layer.
73.3%—a 6% improvement over current state-of-the-art uni- Some researchers think that audio data alone is not
modal models—and is comparable with multimodal models enough to make an accurate prediction [5], and thus many
that leverage textual information as well as audio signals. have turned to using textual information as well. However,
it is possible that two utterances with the same textual con-
Index Terms— Speech Emotion Recognition, Mel- tent can have entirely different meanings when fueled with
Spectrogram, LSTM, Dual-Sequence LSTM, Dual-Level different emotions. Therefore, using textual information too
Model liberally may lead to misleading predictions. It is our opin-
ion that the full potential of audio signals has not been fully
1. INTRODUCTION explored, and we propose several changes to the existing
state-of-the-art framework for unimodal SER [6, 7].
As the field of Automatic Speech Recognition (ASR) rapidly In this paper, we make three major contributions to the
matures, people are beginning to realize that the information existing unimodal SER framework. First, we propose a new
conveyed in speech goes beyond its textual content. Recently, dual-level model that contains two independent neural net-
by employing deep learning, researchers have found promis- works that process the MFCC features and mel-spectrograms
ing directions within the topic of Speech Emotion Recogni- separately, but are trained jointly. Similar to other dual-
tion (SER). As one of the most fundamental characteristics level architectures [8], we found that our proposed dual-level
that distinguishes intelligent life forms from the rest, emo- model provides a significant increase in accuracy. Second,
tion is an integral part of our daily conversations. From the inspired by the time-frequency trade-off [9], from each ut-
broad perspective of general-purposed artificial intelligence, terance we calculate two mel-spectrograms of different time-
the ability to detect the emotional contents of human speech frequency resolutions instead of just one. Since these two
has far-reaching applications and benefits. Furthermore, the spectrograms contain complementary information—namely,
notion that machines can understand and perhaps some day one has a better resolution along the time axis and the other
produce emotions can profoundly change the way humans has a better resolution along the frequency axis—we propose
and machines interact. a novel variant of LSTM [10], denoted as Dual-Sequence
Previous work in SER models on the benchmark IEMO- LSTM (DS-LSTM), that can process these two sequences of
CAP dataset [1] can be generally divided into two categories: data simultaneously and harness their complementary infor-
unimodal and multimodal. Research that focuses on unimodal mation effectively. It should be noted that previous research
This work was supported in part by Office of Naval Research Grant No. in multi-dimensional LSTM (MD-LSTM) [11, 12, 13], es-
N00014-18-1-2244. pecially in ASR [14, 15], focused on adapting the LSTM to
Fig. 1: Dual-level model with DS-LSTM cell

a multi-dimensional data format. Although similar in con- For each utterance, we also propose to derive two mel-
cept, our proposed DS-LSTM has a distinct architecture, and spectrograms of different time-frequency resolutions instead
is designed to process two sequences of one-dimensional of just one, as done in previous research. One (denoted by
data instead of multi-dimensional data. Third, we propose S1i ) is a mel-scaled spectrogram with a narrower window and
a novel mechanism for data preprocessing that uses nearest- thus a better time resolution, while the other (denoted by S2i )
neighbor interpolation to address the problem of variable is a mel-scaled spectrogram with a wider window and thus a
lengths between different audio signals. We have found that better frequency resolution. In our work, S1i and S2i are cal-
interpolation works better than more typical methods such as culated from a short-time Fourier transform with 256 and 512
truncating and padding data, which lose information and also FFT points, respectively. The hop length and the number of
increase the computational cost. mel channels are 50% and 25% of the number of FFT points,
respectively.
2. RESEARCH METHODOLOGY The standard method to deal with variable length in ut-
terances is padding or truncation. Since there are rises and
2.1. Dataset Description cadences in human conversation, we cannot assume the emo-
tional contents are uniformly distributed within each utter-
We used the Interactive Emotional Dyadic Motion Capture ance. Therefore, by truncating data, critical information is
(IEMOCAP) dataset [1] in this work, a benchmark dataset inevitably lost. On the other hand, padding is computation-
containing about 12 hours of audio and video data, as well as ally expensive. We propose a different approach to deal with
text transcriptions. The dataset contains five sessions, each variable length between utterances: nearest-neighbor interpo-
of which involves two distinct professional actors conversing lation, in which we interpolate along the time axis for each
with one another in both scripted and improvised manners. In mel-spectrogram to the median number of time steps for all
this work, we utilize data from both scripted and improvised the spectrograms, followed by a logarithmic transformation.
conversations, as well as only audio data to stay consistent
with the vast majority of prior work. We also train and eval-
2.3. Proposed Model
uate our model on four emotions: happy, neutral, angry, and
sad, resulting in a total of 5531 utterances (happy: 29.5%, 2.3.1. Dual-Level Architecture
neutral: 30.8%, angry: 19.9%, sad: 19.5%). We denote these
5531 utterances in the set {u1 , . . . , u5531 }. Our proposed dual-level architecture is illustrated in Figure 1.
It contains two separate models, MLSTM and MDS-LSTM , the
first for the MFCC features and the second for the two mel-
2.2. Preprocessing spectrograms. Each of these two models has a classification
For extracting MFCC features, we used the openSMILE layer, the outputs of which are averaged to make the final pre-
toolkit [12], a software that automatically extracts features diction. The loss function is also the average of two different
from an audio signal. Using the MFCC12 E D A config- cross entropy losses from the two models.
uration file, we extracted 13 Mel-Frequency Cepstral Co-
efficients (MFCCs), as well as 13 delta and 13 accelera- 2.3.2. LSTM for MFCC Features
tion coefficients, for a total of 39 acoustic features. These
features are extracted from 25 ms frames, resulting in a The MFCC features for each utterance are represented by
sequence of 39-dimensional MFCC features per utterance Z = {z1 , . . . , zT }, with each zi ∈ R39 . Each Z is fed into a
ui ∈ {u1 , . . . , u5531 }. standard two-layer single-directional LSTM, whose outputs,
H = {h01 , . . . , h0T }, as specified by Figure 1, are mean pooled
before being fed into the final classification layer [2].

2.3.3. CNN for Mel-Spectrograms

As mentioned earlier, for each utterance ui , we produce two
mel-spectrograms with different time-frequency resolutions.
We pass these two spectrograms into two independent 2D
CNN blocks, each of which consist of two convolution and
max-pooling layers. After both spectrograms go through the
two convolution and max-pooling layers, they have a differ-
ent number of time steps, one with T1 and the other with T2 , Fig. 2: The graphical representation of one DS-LSTM cell
where T1 ≈ 2T2 . Before passing both sequences into the DS-
LSTM, we use an alignment procedure to ensure they have
the same number of time steps, taking the average of adja- While an LSTM is a four-gated RNN, the DS-LSTM is
cent time steps in the sequence of length T1 . After alignment, a six-gated RNN, with one extra input gate iFt at (3) and
both sequences have the same number of time steps T3 , where one extra intermediate memory cell C˜Ft at (6). The two in-
T3 ≈ T2 . termediate memory cells C˜Tt and C˜Ft are derived from X
and Y , respectively, with the intuition that C˜Tt will capture
2.3.4. Dual-Sequence LSTM more information along the time axis, while C˜Ft will capture
more information along the frequency axis. Empirical experi-
Following the alignment operation, we obtain two sequences ments suggest that the forget gate, two input gates, and output
of data, X = {x1 , . . . , xT3 } and Y = {y1 , . . . , yT3 }, with gate should incorporate the maximum amount of information,
the same number of time steps. Here, X comes from mel- which is the concatenation of xt , yt , and ht−1 .
spectrogram S1i , which records more information along the A recurrent batch normalization layer (rbn) is used to nor-
time axis, and Y comes from mel-spectrogram S2i , which malize the output of the forget gate, input gates, and output
records more information along the frequency axis. It is en- gate in order to speed up training and provide the model with
tirely conceivable that sequences X and Y will complement a more robust regularization effect.
each other, and therefore it will be beneficial to process them In summary, Section 2.3.2 describes the vanilla model
through a recurrent network simultaneously. MLSTM . Sections 2.3.3 and 2.3.4 describe the architecture for
As Figure 2 indicates, we propose a Dual-Sequence our proposed DS-LSTM model, denoted as MDS-LSTM . To-
LSTM (DS-LSTM) that can process two sequences of data si- gether, MLSTM +MDS-LSTM describes our proposed Dual-Level
multaneously. Let denote the Hadamard product, [a, b] the model as illustrated in Figure 1.
concatenation of vectors, σ the sigmoid activation function,
tanh the hyperbolic tangent activation function, and rbn the
3. EXPERIMENTAL SETUP AND RESULTS
recurrent batch normalization layer, which keeps a separate
running mean and variance for each time step [16]. 3.1. Experimental Setup
ft = rbn(σ(Wf [xt , yt , ht−1 ] + bf )) (1) For the CNN block used to process the mel-spectrograms, a
4×4 kernel is used without padding, and the max pooling ker-
iTt = rbn(σ(WiT [xt , yt , ht−1 ] + biT )) (2) nel is 2×2 with a 2×2 stride. For each layer of the CNN, the
iFt = rbn(σ(WiF [xt , yt , ht−1 ] + biF )) (3) output channels are 64 and 16, respectively. All gate neu-
ral networks within LSTM and DS-LSTM have 200 hidden
ot = rbn(σ(Wo [xt , yt , ht−1 ] + bo )) (4) nodes. Each LSTM is single-directional with two layers. The
C˜Tt = tanh (WT [xt , ht−1 ] + bT ) (5) weight and bias for the recurrent batch normalization param-
eters are initialized as 0.1 and 0, respectively, as suggested by
C˜Ft = tanh (WF [yt , ht−1 ] + bF ) (6) the original paper [16]. An Adam optimizer is used with the
Ct = ft Ct−1 + iTt C˜Tt + iFt C˜Ft (7) learning rate set at 0.0001.

ht = ot tanh (Ct ) (8) 3.2. Baseline Methods

After the execution of (8), ht is the hidden state for the
Since several modifications are proposed, we create six base-
next time step, but ht also goes through a batch normalization
line models that consist of various parts of the whole model
layer to be the input for the next layer of the DS-LSTM at
in order to better evaluate the value of each modification.
time t.
Base 1: MLSTM , which is the LSTM-based model with the to its access to both audio data and textual data, we see that
MFCC features. our proposed model achieves comparable performance with
Base 2: CNN+LSTM, whose inputs, {S11 , . . . , S1n }, are [20] in mean weighted accuracy.
spectrograms with 256 FFT points. Inputs are passed through Before further investigating the effectiveness of each inte-
a CNN followed by an LSTM. Models such as these are de- grated part of the proposed dual-level model MLSTM +MDS-LSTM ,
veloped in [6] and [17]. we note that Base 1∼3 and 6 have less parameters than our
Base 3: CNN+LSTM, whose inputs, {S21 , . . . , S2n }, are proposed models. However, we have verified that simply
spectrograms with 512 FFT points. Inputs are passed through adding more nodes or layers to these models does not make
a CNN followed by an LSTM. Note that the architecture is any empirical difference in its predictive power, which sug-
the same as Base 2. gests that these aforementioned baseline models have already
Base 4: A combination of models of Base 2 and Base 3: reached their full potential. Therefore, we can objectively
2 × (CNN+LSTM), whose inputs are {S11 , . . . , S1n } and compare these models.
{S21 , . . . , S2n }. In this model, two LSTMs process two se- Both Base 2 and Base 3 take a single sequence of mel-
quences of mel-spectrograms separately, and their respective spectrograms, and both perform slightly worse than Base 1,
outputs are averaged to make final classifications. Note this which only uses MFCC features. This supports the claim that
is different from our proposed DS-LSTM, which processes mel-spectrograms are harder to learn than MFCC features.
these two sequences within a single DS-LSTM cell. Base 4 is a naive combination of Base 2 and Base 3, and be-
Base 5: A combination of models of Base 1 and Base 4. cause the two LSTMs in Base 4 do not interact with each
Base 6: A combination of models of Base 1 and Base 2. other, the complementary information between these two se-
quences of mel-spectrograms is not fully explored; therefore,
In addition to the above six baseline models, we propose two Base 4 is also slightly worse than Base 1. Base 5 and Base 6
models, MDS-LSTM and the dual-level model, MLSTM +MDS-LSTM . are both dual-level models that consider both MFCC features
We compare these models with the baseline models, as well and mel-spectrograms, and they both outperform Base 1∼4,
as four state-of-the-art models that use standard 5-fold cross- demonstrating the effectiveness of the dual-level model.
validation for evaluation. More importantly, we observe that the proposed MDS-LSTM
significantly outperforms Base 1∼4. Comparing MDS-LSTM
3.3. Results and Analysis with Base 4, we see that when two separate LSTMs are re-
placed by the DS-LSTM, which has only six neural networks
in its cell instead of eight neural networks in two LSTMs
Mean WA Mean UA together, the weighted accuracy increases by 5% and the
Base 1 = MLSTM 64.7±1.4 65.5±1.7 parameters are reduced by 25%. This shows that the DS-
Base 2 = CNN+LSTM 63.5±1.6 64.5±1.5 LSTM is a successful upgrade from two separate LSTMs.
Base 3 = CNN+LSTM 62.9±1.0 64.3±0.9 When we consider the dual-level model MLSTM +MDS-LSTM ,
Base 4 = Base 2 + Base 3 64.4±1.8 65.2±1.8 it outperforms all baseline methods significantly.
Base 5 = Base 1 + Base 4 68.3±1.3 69.3±1.2
Base 6 = Base 1 + Base 2 68.5±0.8 68.9±1.2
D. Dai et. al (2019) [18] 65.4 66.9 4. CONCLUSION
S. Mao et. al (2019) [19] 65.9 66.9
In this paper, we have demonstrated the effectiveness of com-
R. Li et. al (2019) [6] — 67.4
bining MFCC features and mel-spectrograms produced from
S. Yoon et. al (2018) [20] * 71.8±1.9 —
audio signals for emotion recognition. Furthermore, we in-
Proposed MDS-LSTM 69.4±0.6 69.5±1.1
troduced a novel LSTM architecture, denoted as DS-LSTM,
Proposed MLSTM +MDS-LSTM 72.7±0.7 73.3±0.8
which can process two mel-spectrograms simultaneously. We
also outlined several modifications to the data preprocessing
Table 1: Mean WA and Mean UA are the average of weighted ac-
curacy and unweighted accuracy, respectively, for 5-fold cross vali- step. Our proposed model significantly outperforms baseline
dation. Most results are reported with one standard deviation. models and current state-of-the-art unimodal models on the
* indicates the model uses textual information. IEMOCAP dataset, and is comparable with multimodal mod-
els, showing that unimodal models, which only rely on audio
Table 1 indicates our proposed model M +M signals, have not reached their full potential.
LSTM DS-LSTM
outperforms all baseline models by at least 4.2% in mean
weighted accuracy, and by at least 4.0% in mean unweighted 5. ACKNOWLEDGEMENTS
accuracy. It also outperforms state-of-the-art unimodal SER
models [18, 19, 6] by at least 6.8% in mean weighted ac- We are grateful for the assistance of Mr. Reza Soleimani and
curacy and 5.9% in mean unweighted accuracy. Although the support of Professor Robert Calderbank from the Rhodes
multimodal SER models typically have a higher accuracy due Information Initiative at Duke University.
6. REFERENCES [12] Alex Graves and Jürgen Schmidhuber, “Offline hand-
writing recognition with multidimensional recurrent
[1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, neural networks,” pp. 545–552, 2009.
E. Mower, S. Kim, J. N. Chang, S. Lee, and
S. Narayanan, “Iemocap: interactive emotional dyadic [13] Marijn F. Stollenga, Wonmin Byeon, Marcus Liwicki,
motion capture database.,” Language Resources and and Jürgen Schmidhuber, “Parallel multi-dimensional
Evaluation, vol. 42, no. 4, pp. 335–359, 2008. lstm, with application to fast biomedical volumetric im-
age segmentation,” NIPS, 2015.
[2] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic
speech emotion recognition using recurrent neural net- [14] Jinyu Li, Abdul Raheem Mohammad, Geoffrey Zweig,
works with local attention,” in 2017 IEEE International and Yifan Gong, “Exploring multidimensional lstms for
Conference on Acoustics, Speech and Signal Processing large vocabulary asr,” in ICASSP, March 2016.
(ICASSP), March 2017, pp. 2227–2231.
[15] Bo Li and Tara N. Sainath, “Reducing the computa-
[3] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion tional complexity of two-dimensional lstms,” in INTER-
recognition from speech using deep learning on spectro- SPEECH, 2017.
grams,” in INTERSPEECH, 2017.
[16] Tim Cooijmans, Nicolas Ballas, César Laurent, and
[4] Jianfeng Zhao, Xia Mao, and Lijiang Chen, “Speech Aaron C. Courville, “Recurrent batch normalization,”
emotion recognition using deep 1d and 2d cnn lstm net- CoRR, vol. abs/1603.09025, 2016.
works,” Biomed. Signal Proc. and Control, vol. 47, pp.
312–323, 2019. [17] Caroline Etienne, Guillaume Fidanza, Andrei Petro-
vskii, Laurence Devillers, and Benoit Schmauch,
[5] E. Kim and J. W. Shin, “Dnn-based emotion recogni- “Cnn+lstm architecture for speech emotion recognition
tion based on bottleneck acoustic features and lexical with data augmentation,” 2018.
features,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing [18] D. Dai, Z. Wu, R. Li, X. Wu, J. Jia, and H. Meng,
(ICASSP), May 2019, pp. 6720–6724. “Learning discriminative features from spectrograms
using center loss for speech emotion recognition,” in
[6] R. Li, Z. Wu, J. Jia, S. Zhao, and H. Meng, “Di- ICASSP 2019 - 2019 IEEE International Conference
lated residual network with multi-head self-attention for on Acoustics, Speech and Signal Processing (ICASSP),
speech emotion recognition,” in ICASSP 2019 - 2019 May 2019, pp. 7405–7409.
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), May 2019, pp. 6675– [19] S. Mao, D. Tao, G. Zhang, P. C. Ching, and T. Lee,
6679. “Revisiting hidden markov models for speech emotion
recognition,” in ICASSP 2019 - 2019 IEEE Interna-
[7] S. Yeh, Y. Lin, and C. Lee, “An interaction-aware atten-
tional Conference on Acoustics, Speech and Signal Pro-
tion network for speech emotion recognition in spoken
cessing (ICASSP), May 2019, pp. 6715–6719.
dialogs,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing [20] S. Yoon, S. Byun, and K. Jung, “Multimodal speech
(ICASSP), May 2019, pp. 6685–6689. emotion recognition using audio and text,” CoRR, vol.
abs/1810.04635, 2018.
[8] I. Choi S. H. Bae and N. S. Kim, “acoustic scene
classification using parallel combination of lstm and
cnn,” Proceedings of the Detection and Classifica-
tion of Acoustic Scenes and Events 2016 Workshop
(DCASE2016), pp. 11–15, 2016.
[9] D. Donoho and P. Stark, “Uncertainty principles and
signal recovery,” SIAM Journal on Applied Mathemat-
ics, vol. 49, no. 3, pp. 906–931, 1989.
[10] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-
term memory,” Neural Computation, vol. 9, pp. 1735–
1780, 1997.
[11] Alex Graves, Santiago Fernández, and Jürgen Schmid-
huber, “Multi-dimensional recurrent neural networks,”
ICANN, 2007.

SER Techniques for Researchers
No ratings yet
SER Techniques for Researchers
55 pages
Location Capacity Demand Allocation Telecom Optic
No ratings yet
Location Capacity Demand Allocation Telecom Optic
10 pages
Electronics 12 00839 v2
No ratings yet
Electronics 12 00839 v2
17 pages
Literature Review (2) Smaple
No ratings yet
Literature Review (2) Smaple
9 pages
Speech Emotion Recognition Using Deep Learning Hybrid Models
No ratings yet
Speech Emotion Recognition Using Deep Learning Hybrid Models
5 pages
Presentation1 (Autosaved) (Autosaved)
No ratings yet
Presentation1 (Autosaved) (Autosaved)
20 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Entropy 21 00479 PDF
No ratings yet
Entropy 21 00479 PDF
17 pages
Paper5 Implementation
No ratings yet
Paper5 Implementation
7 pages
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
No ratings yet
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
5 pages
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
No ratings yet
Efficient Speech Emotion Recognition: Presented By: Samir Kumar Majhi
12 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
Speech Emotion Recognition via CNN LSTM
No ratings yet
Speech Emotion Recognition via CNN LSTM
12 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
JETIR2106163
No ratings yet
JETIR2106163
5 pages
Attention-Based Multi-Level Feature Fusion For Multilingual Speech Emotion Recognition
No ratings yet
Attention-Based Multi-Level Feature Fusion For Multilingual Speech Emotion Recognition
6 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
Speech Emotion Recognition Model
No ratings yet
Speech Emotion Recognition Model
19 pages
Applsci 12 04338 v3
No ratings yet
Applsci 12 04338 v3
18 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
162,163,174 - 1 - Revised Paper
No ratings yet
162,163,174 - 1 - Revised Paper
13 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
1802.05630v2 - Speech Emotion Detection
No ratings yet
1802.05630v2 - Speech Emotion Detection
5 pages
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
No ratings yet
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
10 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Multimodal Emotion Detection With An Emphasis On Speech Modal
No ratings yet
Multimodal Emotion Detection With An Emphasis On Speech Modal
38 pages
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
No ratings yet
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
5 pages
1 s2.0 S0003682X23002906 Main
No ratings yet
1 s2.0 S0003682X23002906 Main
11 pages
Exploring and Applying Audio-Based Sentiment Analysis in Music
No ratings yet
Exploring and Applying Audio-Based Sentiment Analysis in Music
5 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
No ratings yet
Deep Learning Based Emotion Recognition System Using Speech Features and Transcriptions
12 pages
Speech Emotion Recognition Guide
No ratings yet
Speech Emotion Recognition Guide
86 pages
Research Paper Seminar
No ratings yet
Research Paper Seminar
17 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
162,163,174 - 1 - Revised Paper
No ratings yet
162,163,174 - 1 - Revised Paper
14 pages
Arabic English Speech Emotion Recognition System
No ratings yet
Arabic English Speech Emotion Recognition System
5 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
Literature Study 2025
No ratings yet
Literature Study 2025
10 pages
Deep Learning for Emotion Detection
No ratings yet
Deep Learning for Emotion Detection
5 pages
FP-05 4
No ratings yet
FP-05 4
6 pages
Multimodal Emotion Recognition
No ratings yet
Multimodal Emotion Recognition
5 pages
Sentispeak Tone Mood Detector
No ratings yet
Sentispeak Tone Mood Detector
16 pages
End-To-End Speech Emotion Recognition Using Deep Neural Networks
No ratings yet
End-To-End Speech Emotion Recognition Using Deep Neural Networks
5 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
Deep Learning for Emotion Detection
No ratings yet
Deep Learning for Emotion Detection
9 pages
Speech Emotion Recognition Using Machine Learningg
No ratings yet
Speech Emotion Recognition Using Machine Learningg
19 pages
Towards The Explainability of Multimodal Speech Emotion Recognition
No ratings yet
Towards The Explainability of Multimodal Speech Emotion Recognition
5 pages
Deep Learning Report 1 3
No ratings yet
Deep Learning Report 1 3
3 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
Emotion Recognition Using Speech Processing
No ratings yet
Emotion Recognition Using Speech Processing
5 pages
4b Review 2
No ratings yet
4b Review 2
23 pages
DL Research Paper PDF
No ratings yet
DL Research Paper PDF
15 pages
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
No ratings yet
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
68 pages
Reality
No ratings yet
Reality
11 pages
Ism Fat Lab - 20bce1965
No ratings yet
Ism Fat Lab - 20bce1965
4 pages
Firewall Architectures - Ways To Put Various Firewalls
No ratings yet
Firewall Architectures - Ways To Put Various Firewalls
12 pages
20BCE1965 Lab1 ISM
No ratings yet
20BCE1965 Lab1 ISM
8 pages
DL For SER
No ratings yet
DL For SER
9 pages
2022 Errachdi IntroToDigiC
No ratings yet
2022 Errachdi IntroToDigiC
30 pages
Bempong Kwasi Gyimah 5862816 Assignment 2
No ratings yet
Bempong Kwasi Gyimah 5862816 Assignment 2
8 pages
SSNAO Dupliant
No ratings yet
SSNAO Dupliant
9 pages
Differential Geometry Exam 2017
No ratings yet
Differential Geometry Exam 2017
2 pages
DM Recurrence Relation
No ratings yet
DM Recurrence Relation
32 pages
Assignement 1 ECE 434 AI
No ratings yet
Assignement 1 ECE 434 AI
4 pages
Unit 3:group B Test 9-Klasse
No ratings yet
Unit 3:group B Test 9-Klasse
3 pages
Operations Research: Integer Programming
No ratings yet
Operations Research: Integer Programming
42 pages
10.2 - Arrays (1D and 2D)
No ratings yet
10.2 - Arrays (1D and 2D)
15 pages
Learning To Detect Violent Videos Using Convolutio
No ratings yet
Learning To Detect Violent Videos Using Convolutio
7 pages
2D Object Detection in Autonomous Vehicles
No ratings yet
2D Object Detection in Autonomous Vehicles
62 pages
Zoho Placement Feedback Summary
No ratings yet
Zoho Placement Feedback Summary
4 pages
Ec1008 Signals and Systems PDF
No ratings yet
Ec1008 Signals and Systems PDF
9 pages
6 Uninformed Search
No ratings yet
6 Uninformed Search
13 pages
Fixed-Point Iteration Guide
No ratings yet
Fixed-Point Iteration Guide
11 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
Deep Learning (Book)
No ratings yet
Deep Learning (Book)
130 pages
ss2 3rd Term Exam
No ratings yet
ss2 3rd Term Exam
4 pages
Solutions 3
No ratings yet
Solutions 3
4 pages
A Review On Image Retrieval Techniques
No ratings yet
A Review On Image Retrieval Techniques
4 pages
Deep Learning Predicts EPL Results
No ratings yet
Deep Learning Predicts EPL Results
6 pages
Unit-5 Rel
No ratings yet
Unit-5 Rel
5 pages
CAT1 MCQs
No ratings yet
CAT1 MCQs
11 pages
Karachi LTE1800 Model Tuning - Cluster Comparison
No ratings yet
Karachi LTE1800 Model Tuning - Cluster Comparison
18 pages
AI Fundamentals Level 1 Quiz - Attempt Review
No ratings yet
AI Fundamentals Level 1 Quiz - Attempt Review
9 pages
Department of Electronics and Communication Engineering: Kuppam Engineering College, Kuppam-517425
No ratings yet
Department of Electronics and Communication Engineering: Kuppam Engineering College, Kuppam-517425
3 pages
9-Biotonic Sort
No ratings yet
9-Biotonic Sort
25 pages
Confidence Intervals For The Difference Between Two Means With Tolerance Probability
No ratings yet
Confidence Intervals For The Difference Between Two Means With Tolerance Probability
10 pages
Turtle Programming - Encryption in Python Final PDF
No ratings yet
Turtle Programming - Encryption in Python Final PDF
14 pages

2 SER Using LSTM

Uploaded by

2 SER Using LSTM

Uploaded by

SPEECH EMOTION RECOGNITION WITH DUAL-SEQUENCE LSTM ARCHITECTURE

2.3.3. CNN for Mel-Spectrograms

ht = ot tanh (Ct ) (8) 3.2. Baseline Methods

You might also like