Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views7 pages

Multi-Modal Emotion Recognition On IEMOCAP Dataset

The document discusses a study on multi-modal emotion recognition using the IEMOCAP dataset, employing deep learning techniques to analyze speech, text, and motion capture data. The authors explore various neural network architectures, achieving improved accuracy in emotion detection by combining data from multiple modalities. Their approach demonstrates advancements over previous research, with a final accuracy of 71.04% when integrating all data types.

Uploaded by

Marie Raz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Multi-Modal Emotion Recognition On IEMOCAP Dataset

The document discusses a study on multi-modal emotion recognition using the IEMOCAP dataset, employing deep learning techniques to analyze speech, text, and motion capture data. The authors explore various neural network architectures, achieving improved accuracy in emotion detection by combining data from multiple modalities. Their approach demonstrates advancements over previous research, with a final accuracy of 71.04% when integrating all data types.

Uploaded by

Marie Raz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324558351

Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning

Article · April 2018


DOI: 10.48550/arXiv.1804.05788

CITATIONS READS
120 2,804

2 authors, including:

Homayoon Beigi

117 PUBLICATIONS 1,202 CITATIONS

SEE PROFILE

All content following this page was uploaded by Homayoon Beigi on 22 October 2018.

The user has requested enhancement of the downloaded file.


Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning.

Samarth Tripathi Homayoon Beigi


Computer Science Department President of Recognition Technologies, Inc.
Columbia University and Adjunct Professor of Computer Science
New York, New York Columbia University
Email: [email protected] New York, New York
Email: [email protected]
arXiv:1804.05788v1 [cs.AI] 16 Apr 2018

Abstract—Emotion recognition has become an important field Automatic Speech Recognition [4], and Image recognition
of research in Human Computer Interactions as we improve [8].
upon the techniques for modelling the various aspects of In our research we perform multimodal emotion detec-
behaviour. With the advancement of technology our under- tion from the IEMOCAP dataset [12], which consists 12
standing of emotions are advancing, there is a growing need for hours of audio-visual data of improvisations and scripted
automatic emotion recognition systems. One of the directions scenarios from actors, annotated for emotions. A lot of the
the research is heading is the use of Neural Networks which prior research in this field is concentrated on detecting emo-
are adept at estimating complex functions that depend on a tions using just the speech part of the dataset. Two important
large number and diverse source of input data. In this paper papers in this field are the works of [13] and [17]. Both
we attempt to exploit this effectiveness of Neural networks use RNN based architectures on extracted speech features
to enable us to perform multimodal Emotion recognition on to obtain the their best results. [13] use Markov chain over
IEMOCAP dataset using data from Speech, Text, and Motion input signals to remove segments with no emotional state.
capture data from face expressions, rotation and hand move- [17] use Connectionist Temporal Classification loss function
ments. Prior research has concentrated on Emotion detection to allow them model states with little or no emotions in
from Speech on the IEMOCAP dataset, but our approach is the speech window. We replicate their results for speech based
first that uses the multiple modes of data offered by IEMOCAP
emotion recognition and build upon their works.
In our research we seek to detect emotions using the data
for a more robust and accurate emotion detection.
from many modalities of IEMOCAP. For this we explore
various deep learning based architectures to first get the
1. Introduction best individual detection accuracy from each of the different
modes. We then try combine them in an ensemble based
architecture to allow for end-to-end training across the dif-
Emotions are very important in human decision han- ferent modalities using the variations of the better individual
dling, interaction and cognitive process [3]. With the ad- models. Our ensemble consists of Long Short Term Memory
vancement of technology our understanding of emotions are networks, Convolution Neural Networks, Fully connected
advancing, there is a growing need for automatic emotion Multi-Layer Perceptrons and we complement them using
recognition systems. However one of the new and exciting techniques such as Dropout, adaptive optimizers such as
directions this research is heading is using multimodal data Adam, pretrained word-embedding models and Attention
sources to make accurate emotion classifications. Each of based RNN decoders. Comparing our speech based emotion
the modes have been individually studied extensively, such detection with [13] we achieve 62.72% accuracy compared
as emotion detection using speech facial expressions and to their 62.85%; and comparing with [17] we achieve
text transcripts. In this paper we combine these modes to 55.65% accuracy compared to their CTC based 54% ac-
make stronger and more robust detector for emotions. For curacy. After combining Speech (individually 55.65% ac-
our research we particularly explore Neural Networks and curacy) and Text (individually 64.78% accuracy) modes we
advanced machine learning techniques such as Attention and achieve an improvement to 68.40% accuracy. When we also
Dropout. Neural network is a machine that is designed to account MoCap data (individually 51.11% accuracy) we also
model the way our brain performs a particular task, where achieve a further improvement to 71.04%.
the key concepts of brain as a complex, non-linear and
parallel computer are imitated [?], and possess the ability 2. Related Works
to model and estimate complex functions depending on
multitude of factors. Recently the developments in machine Emotion is a psycho-physiological process that can be
learning using Deep Neural Networks have achieved state- triggered by conscious and/or unconscious perception of
of-the-art performance in Text and Sentiment Analysis [2], objects and situations and is associated with multitude of
factors such as mood, temperament, personality, disposition, loss, and then fine-tune this model on IEMOCAP. Using
and motivation [20]. Emotions play an important role in only IEMOCAP as the baseline model, they achieve 55%
human communication and can be expressed either verbally accuracy, and with the help of fine-tuning on top of the
through emotional vocabulary or by expressing nonverbal pretrained ASR model they achieve 61% accuracy.
cues such as intonation of voice, facial expressions, and
gestures [15]. Emotion recognition has been studied widely
using speech [13] [17] [16], text [9], facial cues [10], and
3. Experimental Setup
EEG based brain waves [11]. One of the biggest open-
sourced multimodal resources available in emotion detection We use The Interactive Emotional Dyadic Motion Cap-
is IEMOCAP dataset [12]. ture (IEMOCAP) database for the task of multi-modal emo-
The dataset consists of approximately 12 hours of audio- tion recognition. It consists of about 12 hours of audio-
visual data, including facial recordings, speech and text visual data from 10 actors. The recordings follow dialogues
transcriptions. The dataset is an acted and has multispeaker between a male and a female actor in both scripted or
recordings, and has been an important dataset for emotion improvised topics. After the audio-visual data has been col-
based research. However most of the research on this has lected it is divided into small utterances of length between
concentrated specifically in emotion detection using Speech 3 to 15 seconds which are then labelled by evaluators.
based data. One of the early important papers on this Each utterance is evaluated by 3-4 assessors. The evalua-
dataset is ”Speech Emotion Recognition Using Deep Neural tion form contained 10 options (neutral, happiness, sadness,
Network and Extreme Learning Machine” [18] which beat anger, surprise, fear, disgust frustration, excited, other). We
state of the art by 20% over techniques that used HMMs, consider only 4 of them anger, excitement (happiness),
SVMs and other shallow learning methods. They perform neutral and sadness so as to remain consistent with the prior
segment level feature extraction, feed those features to a research. We consider emotions where atleast 2 experts were
MLP based architecture, where the input is 750 dimensional consistent with their decision, which is more than 70 % of
feature vector, followed by 3 hidden layer of 256 neurons the dataset, again consistent with prior research.
each with rectilinear units as non-linearity. This feeds to an Along with the .wav file for the dialogue we also have
utterance level classifier which predicts the final emotion. available the transcript each the utterance. For each ses-
”High-level Feature Representation using Recurrent sion one actor wears the Motion Capture (MoCap) camera
Neural Network for Speech Emotion Recognition” [13] data which records the facial expression, head and hand
follows the previous research [18]. They trained long short- movements of the actor. The Mocap data contains column
term memory (LSTM) based recurrent neural network and tuples, for facial expressions the tuples are contained in 165
achieved about 60 % accuracy on IEMOCAP dataset. First dimensions, 18 for hand positions and 6 for head rotations.
they divide each utterance into small segments with voiced As this Mocap data is very extensive we use it instead of the
region, then assume that the label sequences of each seg- video recording in the dataset. These three modes (Speech,
ment follows a Markov chain. They extract we extract Text, Mocap) of data form the basis of our multi-modal
32 features for every frame: F0 (pitch), voice probability, emotion detection pipeline.
zero-crossing rate, 12-dimensional Mel-frequency cepstral Next we preprocess the IEMOCAP data for these modes.
coefficients (MFCC) with log energy, and their first time For the speech data our preprocessing follows the work
derivatives. The network contains 2 hidden layers with 128 of [17]. We use the Fourier frequencies and energy-based
BLSTM cells (64 forward nodes and 64 backward nodes). features Mel-frequency cepstral coefficients (MFCC) for a
They achieve 62.85 % accuracy with this technique. total of 34 features They include 12 MFCC, chromagram-
Another research we closely follow is ”Emotion Recog- based and spectrum properties like flux and roll-off. We
nition From Speech With Recurrent Neural Networks” [17]. calculate features in 0.2 second window and moving it with
Where they use CTC loss function to improve upon RNN 0.1 second step and with 16 kHz sample rate. We keep a
based Emotion prediction. They use 34 features including 12 maximum of 100 frames or approximately for 10 seconds
MFCC, chromagram-based and spectrum properties like flux of the input, and zero pad the extra signal and end up
and roll-off. For all speech intervals they calculate features with (100,34) feature vector for each utterance. For the text
in 0.2 second window and moving it with 0.1 second step. transcript of each of the utterance we use pretrained Glove
The use of CTC loss helps as often almost the whole embeddings [19] of dimension 300, along with the maxi-
utterance has no emotion, but emotionality is contained only mum sequence length of 500 to obtain a (500,300) vector
in a few words or phonemes in an utterance which the for each utterance. For the Mocap data, for each different
CTC loss handles well. Unlike [13], Chernykh et. al. use mode such as face, hand, head rotation we sample all the
all the session data for the emotion classification. Another feature values between the start and finish time values and
important research on Speech based Emotion recognition is split hem into 200 chronological arrays. We then average
the work of [14] which uses transfer learning to improve on each of the 200 arrays along the columns (165 for faces,
Neural Models for emotion detection. Their model uses 1D 18 for hands, and 6 for rotation), and finally concatenate
convolutions and GRU layers to initialize a neural model for all of them to obtain (200,189) dimension vector for each
Automatic Speech Recognition inspired by Deep Speech. utterance. We can now proceed to feed our model these
They use many datasets for ASR based training on CTC processed input modal vectors.
TABLE 1: Speech emotion detection models and accuracy
Model Accuracy
Model1 50.6%
Model2 51.32%
Model3 54.15%
Model4 55.65%

TABLE 2: Comparison between our Speech emotion detec-


tion models and previous research
Model Accuracy
Lee and Tashev [13] 62.85%
Ours (improv only) 62.72%
Chernykh [17] 54%
Neumann [16] 56.10%
Lakomkin [14] 56%
Ours (all) 55.65%
Ours (all, Speech + text + Mocap) 71.04%

of the speech data inclusing using MelSpectrogram, smaller


window (0.08s) with longer context (200 timestamps) as
well as combining these approaches into one big network
but did not achieve improvements.
To compare our results with prior research we use our
best model (Model4) and evaluate it in the manner similar to
various conditions of the previous researches. We train using
Figure 1: Neural Modal for Speech based Emotion detection Session1-4 and use Session5 as our test set. Like [13] we use
only the improvisation session for both Training and Testing
and achieve similar results. To compare with [17] [16] [14]
4. Models and Results who use the both scripted and Improvisation sessions we
achieve again achieve similar results. One important insight
In this section we present the different Neural Network of our results is with minimal preprocessing and no complex
based models we attempted over each of the modes as well loss functions or noise injection into the training, we can
as our final combined model, while also comparing with easily match prior research’s performance using Attention
prior research. We begine with the speech based emotion based Bidirectional LSTMS.
detection model detection.

4.1. Speech Based Emotion Detection 4.2. Text based Emotion Recognition

Our first model (Model1) consists of three layered Fully For our next task of performing emotion detection us-
Connected model with 1024,512,256 hidden neural units ing only the text transcripts of our data resembles that of
with ’Relu’ as activation and 4 output neurons with Soft- sentiment analysis, a very common and highly researched
max. The model takes the flattened speech vectors as input task of Natural Language Processing. For this approach
and trains using cross entropy loss with Adadelta as the we again try similar models as before. Here we try two
optimizer. Model2 uses two stacked LSTM layers with 512 approaches Model1 which uses 1D convolutions of kernel
and 256 units followed by a Dense layer with 512 units and size 3 each, with 256,128,64 and 32 filters using Relu as
Relu Activation. Model3 finally uses 2 LSTM layers with Activation and Dropout of 0.2 probability, followed by 256
128 units each but the second LSTM layer has Attention dimension Fully Connected layer and Relu, feeding to 4
implemention as well, followed by 512 units of Dense layer output neurons with Softmax. Model2 uses two stacked
with ReLu activation. Model4 improves both the encoding LSTM layers with 512 and 256 units followed by a Dense
LSTM and Attention based decoding LSTM by making layer with 512 units and Relu Activation. Both these models
them bi-directional. All these last 3 models use Adadelta are initialized with Glove Embeddings based word-vectors.
as the optimizer. We divide our dataset with a randomly We also try Randomized initialization with 128 dimensions
chosen 20 PC validation split and report our accuracies in Model3 and obtain similar performance as Model2. The
based this set. As we can see the final Attention based LSTM based models use Adadelta and Convolution based
LSTM model performs the best. We also try many variations models use Adam as optimizers.
Figure 3: Neural Modal for MoCap based Emotion detection
Figure 2: Neural Modal for Text based Emotion detection
TABLE 4: MoCap based emotion detection models and
TABLE 3: Text emotion detection models and accuracy accuracy
Model Accuracy Model Accuracy
Model1 62.55% MoCap-head Model1 37.75%
Model2 64.68% MoCap-head Model2 40.28%
Model3 64.78% MoCap-hand Model1 33.70%
MoCap-hand Model2 36.94%
MoCap-face Model1 48.99%
4.3. MoCap based Emotion Detection MoCap-face Model2 48.58%
MoCap-face Model3 49.39%
For the Mocap based emotion detection we use try MoCap-combined Model3 51.11%
LSTM and Convolution based models. For emotion detec-
tion using only the head rotation we try 2 models, first one TABLE 5: Combined Multi-Modal emotion detection mod-
(Model1) uses LSTM with 256 units followed by Dense els and accuracy
layer and Relu activation, while the second model (Model2) Model Accuracy
uses just 256 hidden unit based Dense Layer with Relu and
Text + Speech Model1 65.38%
achieves better performance. We use the two models again
Text + Speech Model2 67.41%
for Hand movement based emotion detection and Dense
Text + Speech Model3 69.74%
Layer achieves better performance. For the facial expression
Text + Speech + Mocap Model4 67.94%
based Mocap data, we use two stacked LSTM layers with
512 and 256 units followed by a Dense layer with 512 units Text + Speech + Mocap Model5 68.58%
and Relu Activation as Model1. Model2 on Face Mocap Text + Speech + Mocap Model5 71.04%
uses 5 2D Convolutions each with kernel size 3, Stride 2 and
32,64,64,128,128 filters, along with Relu activation and 0.2
Dropout. These layers are then followed by a Dense Layer 4.4. Combined Multi-modal Emotion Detection
with 256 neurons and Relu followed by 4 output neurons
and Softmax. We also try Model3 which is a slight variation For the final part of our experiment we train models
of Model2 where we replace the last Convolution layer with using all the three modes discussed above. We first use the
a Dense layer of 1024 units. We finally use Model3 based text transcript and speech based vectors for one model. We
architecture for the concatenated MoCap data architecture try architectures which use Model1 for text processing and
with 189 input feature length. The LSTM based models Model1 of speech processing architectures, both without the
use Adadelta and Convolution and Fully Connected based output neurons, their final hidden layers concatened to 512
models use Adam as optimizers. dimenion hidden layer feeding into 4 output neurons. This
Figure 4: Simulation results for the network.

Both model1 and Model 2 use random initalizations of 128


dimensional embeddings. For Model3 we replace Model2
with Glove embeddings. We then proceed to also include
MoCap data as well into one complete model. For Model4
we combine the previous Model3 with MoCap based Model
2 and concatenate all three 256 layer final outputs.For
Model5 we combine the previous Model3 with MoCap
based Model1 and concatenate all three 256 layer final
outputs. For Model6 we replace the Dense Layers in Speech
mode part of previous Model4 with Attention based LSTM
architectures. All the code is available in an Open source
manner in case the model descriptions aren’t very clear.
Figure 5: Accuracy graph of our Final Model
5. Conclusion and Future Work

architecture doesnot yield good results. We then try a new As we can see using multimodal data sources for a com-
model (Model1) which uses 3 Dense layers (1024,512,256) plete end-end Deep Learning bsaed model can significantly
neurons each for both text and speech features concatenated improve accuracy than using just one mode such as text,
followed by another Dense layer with 256 neurons using speech or facial change data. However which individual
Relu and Dropout of 0.2 and 4 output softmax neurons. Our architectures we choose for each mode and how we combine
Model2 uses 256 units of 2 stacked LSTMS followed by a them have a profound effect in our accuracy. Some more
Dense layer with 256 neurons for text data; 2 Dense layers interesting approaches for future work could involve more
with 1024 and 256 neurons for speech data; concatenated permutations of architectures to obtain better end-to-end
followed by another Dense layer with 256 neurons using learning. We could use Transfer learning from ASR model
Relu and Dropout of 0.2 and 4 output softmax neurons. and fine-tune it for emotion detection. Also Mocap data
is averaged over time periods to obtain 200 dimensional [19] Pennington, Jeffrey, Richard Socher, and Christopher Manning.
feature vectors is could be made more robust. We release ”Glove: Global vectors for word representation.” Proceedings of the
2014 conference on empirical methods in natural language processing
the code publicaly and it could be good starting point for (EMNLP). 2014.
future research.
[20] Soleymani, Mohammad, et al. ”A multimodal database for affect
recognition and implicit tagging.” IEEE Transactions on Affective
References Computing 3.1 (2012): 42-55.

[1] Haykin, Simon, and Neural Network. A comprehensive foundation.


Neural Networks 2.2004 (2004).
[2] Radford, Alec, Rafal Jozefowicz, and Ilya Sutskever. ”Learning
to generate reviews and discovering sentiment.” arXiv preprint
arXiv:1704.01444 (2017).
[3] Sreeshakthy, M., and J. Preethi. Classification of Human Emotion
from Deap EEG Signal Using Hybrid Improved Neural Networks with
Cuckoo Search. BRAIN. Broad Research in Artificial Intelligence and
Neuroscience6.3-4 (2016): 60-73.
[4] Amodei, Dario, et al. ”Deep speech 2: End-to-end speech recogni-
tion in english and mandarin.” International Conference on Machine
Learning. 2016.
[5] Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural
networks from overfitting.” Journal of machine learning research 15.1
(2014): 1929-1958.
[6] Kingma, Diederik, and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).
[7] Xu, Kelvin, et al. ”Show, attend and tell: Neural image caption gen-
eration with visual attention.” International Conference on Machine
Learning. 2015.
[8] Huang, Gao, et al. ”Densely connected convolutional networks.” arXiv
preprint arXiv:1608.06993 (2016).
[9] Kim, Yoon. ”Convolutional neural networks for sentence classifica-
tion.” arXiv preprint arXiv:1408.5882 (2014).
[10] Bassili, John N. ”Emotion recognition: the role of facial movement
and the relative importance of upper and lower areas of the face.”
Journal of personality and social psychology 37.11 (1979): 2049.
[11] Tripathi, Samarth, et al. ”Using Deep and Convolutional Neural
Networks for Accurate Emotion Classification on DEAP Dataset.”
AAAI. 2017.
[12] Busso, Carlos, et al. ”IEMOCAP: Interactive emotional dyadic motion
capture database.” Language resources and evaluation 42.4 (2008): 335.
[13] Lee, Jinkyu, and Ivan Tashev. ”High-level feature representation using
recurrent neural network for speech emotion recognition.” INTER-
SPEECH. 2015.
[14] Lakomkin, Egor, et al. ”Reusing Neural Speech Representations for
Auditory Emotion Recognition.” Proceedings of the Eighth Interna-
tional Joint Conference on Natural Language Processing (Volume 1:
Long Papers). Vol. 1. 2017.
[15] Liu, Yisi, and Olga Sourina. EEG-based subject-dependent emotion
recognition algorithm using fractal dimension. Sys- tems, Man and
Cybernetics (SMC), 2014 IEEE International Conference on. IEEE,
(2014).
[16] Neumann, Michael, and Ngoc Thang Vu. ”Attentive Convolutional
Neural Network based Speech Emotion Recognition: A Study on the
Impact of Input Features, Signal Length, and Acted Speech.” arXiv
preprint arXiv:1706.00612 (2017).
[17] Chernykh, Vladimir, Grigoriy Sterling, and Pavel Prihodko. ”Emotion
Recognition From Speech With Recurrent Neural Networks.” arXiv
preprint arXiv:1701.08071 (2017).
[18] Han, Kun, Dong Yu, and Ivan Tashev. ”Speech emotion recognition
using deep neural network and extreme learning machine.” Fifteenth
Annual Conference of the International Speech Communication Asso-
ciation. 2014.

View publication stats

You might also like