DSP RP 5
DSP RP 5
Abstract
Speech emotion recognition (SER) is a hot topic in speech signal processing. With the advanced development of
the cheap computing power and proliferation of research in data-driven methods, deep learning approaches are
prominent solutions to SER nowadays. SER is a challenging task due to the scarcity of datasets and the lack of emo-
tion perception. Most existing networks of SER are based on computer vision and natural language processing, so
the applicability for extracting emotion is not strong. Drawing on the research results of brain science on emotion
computing and inspired by the emotional perceptive process of the human brain, we propose an approach based on
emotional perception, which designs a human-like implicit emotional attribute classification and introduces implicit
emotional information through multi-task learning. Preliminary experiments show that the unweighted accuracy (UA)
of the proposed method has increased by 2.44%, and weighted accuracy (WA) 3.18% (both absolute values) on the
Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, which verifies the effectiveness of our method.
Keywords Speech emotion recognition, Emotion perception, Implicit emotional attribute, Multi-task learning
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
Liu et al. EURASIP Journal on Audio, Speech, and Music Processing (2023) 2023:22 Page 2 of 7
solve problems in other fields. How to reasonably use the human brain for emotion perception (perceptual net-
network in other fields to improve the ability to model work, perceptual process, perceptual characteristics,
emotional information is a major problem in speech etc.). Due to the complexity of the human brain structure
emotion recognition. Moreover, the scarcity of data- and the insufficiency of existing research techniques, it
sets and emotion perception makes the recognition task is difficult for brain science to see the full picture of the
more challenging. Therefore, the performance of speech emotional cognitive mechanism of the human brain.
emotion recognition is still not ideal. In recent years, How the human brain processes information in speech
brain science is strengthening the exploration of the to recognize emotion is still a mystery. However, various
structure and function of various brain areas that pro- imaging technologies and electrophysiological signals
duce emotions, thought and consciousness in the human are used to establish topological structures of high-order
brain. For example, emotion perception mainly depends connections of brain networks at various levels, and there
on the limbic system of the human brain [17, 18], and have been many experiments carried out in related fields
different parts of the limbic system have different per- such as brain network modeling and emotional comput-
ception of different emotions [19, 20]. In this paper, an ing [21–25]. Existing research results also reveal some
approach inspired by emotion perception is proposed potential mechanisms of human emotional cognition. For
based on the human brain’s perceptive process of emo- example, different parts of the brain perceive different
tion, and a human brain-like implicit emotion attribute emotions differently [21, 25].
classification is designed. The implicit emotion attribute Research has shown that emotional perception is
information is introduced through multi-task learning to linked to a set of structures in the brain called the limbic
increase the extraction of emotion information. Prelimi- system, which includes the hypothalamus, the cingulate
nary experiments show that the unweighted accuracy cortex, the hippocampus, and others. Different parts play
(UA) on the the Interactive Emotional Dyadic Motion different roles in the perception of different emotions.
Capture (IEMOCAP) dataset is improved by 2.44%, and For example, removing the amygdala leads to a reduc-
the weighted accuracy (WA) by 3.18% (both absolute tion in fear, and the posterior hypothalamus may be par-
values), which verifies that the proposed human brain- ticularly important for anger and aggression. The frontal
like implicit emotion attribute classification is beneficial cortex of the brain is more sensitive to intense emotions
to extract emotion information. such as happiness and anger. A part of the brain called
The paper is organized as follows: Section 2 intro- the hypothalamus is more active during the process of
duces the characteristics of the human brain’s emotion feeling sadness. Meanwhile, a part of the brain called the
perception. Section 3 introduces the network designed hippocampus plays a significant role in the perception of
according to the characteristics of emotion perception. sadness.
Section 4 elaborates the experimental results and con- Table 1 lists emotions and the parts of the brain associ-
clusions. Section 5 summarizes the paper and further ated with them. According to Table 1, the following char-
looks forward to the future development of speech acteristics related to the human brain perception can be
emotion recognition. concluded.
structure causes the relevant parts of the human can sense some emotions, it means that these emo-
brain to be more sensitive to certain emotions. Could tions contain the same implicit attribute information.
this structure be introduced into speech emotion At the same time, the fact that the same emotion can
recognition? Many parts of the human brain are sen- be sensed by different parts suggests that these parts
sitive to the same emotion, but there are differences have some similar structural features. Based on this,
in sensitivity. So, are there similar but not identical this paper designs an implicit emotion attribute classi-
internal structures in the various parts of the human fier to simulate the emotion perception of the human
brain? brain. So, an implicit attribute binary classifier is
2) Different emotions are involved in different parts. For designed according to whether the perception of dif-
example, both anger and sadness are linked to the ferent emotions by certain parts is related, as shown in
amygdala, but sadness is linked to the left thalamus, Table 2. For example, if the frontal cortex of the human
whereas anger is not. This means that the human brain has a strong perception of happiness and anger,
brain has differences in the perception of different it is believed that happiness and anger have the same
emotions and also shows that the differences in the implicit attribute (denoted as attribute A), while other
perception of different parts of the human brain are emotions (sadness and neutral, etc.) do not have attrib-
related to the internal structure of the parts. So what ute A. So in the binary classifier for attribute A, the
exactly are these parts perceptual about? classification label of happiness and anger is set to 1,
3) One part of the limbic system is related to the per- while the classification label of other emotions is set to
ception of multiple emotions. For example, the amyg- 0. In this paper, four parts with high degree of distinc-
dala is associated with the perception of happiness, tion are introduced, and four implicit attributes of A–D
sadness, anger, and other emotions. What common and corresponding classifiers are defined, as shown in
information does the amygdala perceive in these the Table 2.
emotions?
3.2 Multi‑task learning based on implicit attribute
According to above analysis, we propose a conjecture classification
that some parts of the human brain’s limbic system can In order to train the four implicit attribute classifica-
perceive certain attribute information in emotions, tion and speech emotion classifiers at the same time,
and the attribute information is the common infor- this paper adopts the way of multi-task learning, and
mation of many emotions that this part can perceive. the loss of the implicit emotion attribute binary classi-
The specific attribute is unknown, so this paper calls it fication task will be added to the total loss of the model
implicit attribute information. The perceptual network with a certain weight. At the same time, referring to the
of the human brain for emotion has a certain structure. structured network characteristics of the human brain
Therefore, implicit attribute information is extracted for emotion recognition, the network in this paper also
through some parts of the limbic system and then sent introduces the implicit emotion attribute information,
to the brain center with the underlying information for which increases the difference between different emo-
emotion recognition. tions, and is conducive to the network to recognize dif-
Based on these assumptions, this paper adopts arti- ferent emotions.
ficial neural networks to simulate the parts of the The specific structure of the network is shown in
human limbic system that draw on its mechanism of the Fig. 1. The network consists of four CNN layers,
extracting and perceiving emotional information and four categories of implicit emotional attributes, gated
proposes a method based on emotional perception. recurrent unit (GRU), and an attention layer. Firstly,
According to the limbic system’s perception of dif- the logMel spectrum extracted by Librosa [26] is used
ferent emotions, implicit attribute classification is as the input of the network, and then the extracted
defined, and the information of implicit attribute is features are input into four continuous CNN layers,
extracted through multi-task learning which is then
added into the emotion recognition system as auxiliary
information for recognition. Table 2 Implicit attribute classification
Attribute parts Label 1 Label 0
3 Emotion recognition based on emotion
perception A Frontal cortex Happy, angry Sad, neutral
3.1 Implicit emotion attribute classification design B Thalamus Sad Happy, angry, neutral
A part of the human limbic system can sense some C Hippocampus Sad, angry Happy, neutral
implicit attribute information of emotions. If a part D Anterior neurite Happy Anger, sad, neutral
Liu et al. EURASIP Journal on Audio, Speech, and Music Processing (2023) 2023:22 Page 4 of 7
A 84.59%
B 95.06%
C 77.54%
D 66.39%
4.3 Emotion classification based on multi‑task learning tive role in emotion classification only with the assis-
In order to verify the effect of different implicit attrib- tance of other attribute information. This further ver-
utes on speech emotion recognition, different multi-task ifies the credibility of the implicit attribute hypothesis
experiments are designed in this paper, namely, multi- in this paper. Meanwhile, the experimental results
task experiments based on 1–4 implicit attributes are indirectly prove that the part where the human brain
respectively carried out. The experimental results are recognizes emotions may share some kind of infor-
shown in Table 4. The following points can be inferred mation. When recognizing the same emotion, the
from Table 4. level of sensitivity varies differently as for the parts
of the limbic system. Combining a variety of infor-
1) The experimental performance of adopting four mation to jointly judge the emotional changes of the
attributes is the best, achieving that the UA index is surrounding characters, this also further verifies the
2.42% higher than that of the baseline system (single credibility of the implicit attribute hypothesis in this
task), and the WA index is 3.18% higher than that of article.
the baseline system (absolute value), indicating the 3) The mixed use of two and three attributes has high
effectiveness of our method based on emotion per- and low experimental performance, indicating that
ception. It is conducive to extract emotional informa- the relationship between different implicit attributes
tion which draws lessons from the exploration results may be complementary or cancel each other for
of brain science on the structure and function of the the effect of emotion recognition. And the effect of
brain area that produces emotions in the human using all four attributes is the best, demonstrating
brain, combined with deep learning to simulate the that the positive effect of these implicit attributes
neural network of the human brain. needs more attributes to participate. Different parts
2) In the multi-task experiment with single attribute, of the human limbic system have certain implicit
A, B, and C all are improved, but the introduction of emotional attributes. Experiments have shown that
attribute D reduces the performance of multi-task, multiple parts are involved when recognizing a cer-
which is consistent with the experimental results of tain emotion, but the states of inhibition or activa-
binary classification. It is possible that the emotional tion when different parts recognize emotions are
information in implicit attribute D is unstable. How- different.
ever, in the mixed use experiment of attribute D and
other attributes, the system performance is generally
better than the baseline. It indicates that although the
stability of attribute D is not good, it can play a posi- 5 Conclusion
Brain science is constantly studying the brain structure
and underlying mechanism of emotions. Combined with
Table 4 Emotion recognition results based on multi-task
the continuous simulation of the human brain by artifi-
learning cial intelligence, this paper draws on the mechanism of
the human brain emotional perception and designs the
Multi-task learning UA WA
implicit emotional attributes classification to imitate the
A 68.57% 67.79% brain structure related to emotions. Implicit emotion
B 68.91% 67.42% information is introduced through multi-task learning
C 68.31% 67.23% as auxiliary information to recognize emotion, improv-
D 67.44% 66.67% ing the effect of speech emotion recognition and prov-
A+B 68.61% 65.36% ing the effectiveness of the network proposed in this
A+C 67.46% 67.23% paper. In the future, we can learn from the human brain’s
A+D 67.91% 64.23% mechanism of cognitive emotions and add more attrib-
B+C 67.25% 67.42% ute information. Meanwhile, we can also adopt different
B+D 68.75% 68.16% approaches instead of multi-task learning to mine emo-
C+D 68.23% 68.16% tional information.
A+B+C 67.84% 65.92%
A+B+D 68.30% 66.29% Abbreviations
A+C+D 68.14% 66.29% SER Speech emotion recognition
B+C+D 69.38% 67.04% UA Unweighted accuracy
WA Weighted accuracy
A+B+C+D 70.42% 67.79% IEMOCAP The Interactive Emotional Dyadic Motion Capture
Baseline (single task) 67.98% 64.61% RNN Recurrent neural networks
Liu et al. EURASIP Journal on Audio, Speech, and Music Processing (2023) 2023:22 Page 6 of 7
DNN Deep neural networks 7. Y. Xu, H. Xu, J. Zou, in IEEE International Conference on Acoustics, Speech
CNN Convolutional neural networks and Signal Processing (ICASSP). Hgfm: a hierarchical grained and feature
LSTM Long short-term memory model for acoustic emotion recognition (IEEE, Barcelona, 2020), pp.
GRU Gated recurrent unit 6499–6503
8. D. Priyasad, T. Fernando, S. Denman, S. Sridharan, C. Fookes, in IEEE
Acknowledgements International Conference on Acoustics, Speech and Signal Processing
The authors thank the editors and the anonymous reviewers for their con- (ICASSP). Attention driven fusion for multi-modal emotion recognition
structive comments and useful suggestions. (IEEE, Barcelona, 2020), pp. 3227–3231
9. A. Nediyanchath, P. Paramasivam, P. Yenigalla, in IEEE International Con-
Authors’ contributions ference on Acoustics, Speech and Signal Processing (ICASSP). Multi-head
Liu innovatively proposed and designed the complete experiments for the attention for speech emotion recognition with auxiliary learning of
paper and was a major contributor in writing the manuscript. Cai and Wang gender recognition (IEEE, Barcelona, 2020), pp. 7179–7183
worked together to filter and sort emotional data and wrote programs to 10. C.H. Park, D.W. Lee, K.B. Sim, Emotion recognition of speech based on
complete the entire experiments. After the experiment, all of us analyzed the rnn. Nurse Lead. 4, 2210–2213 (2002). https://doi.org/10.1109/ICMLC.
whole experimental results to further corroborate our conjecture. All authors 2002.1175432
read and approved the final manuscript. 11. J. Niu, Y. Qian, K. Yu, in The 9th International Symposium on Chinese
Spoken Language Processing. Acoustic emotion recognition using deep
Funding neural network (IEEE, Singapore, 2014), pp. 128–132
Not applicable. 12. Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for
speech emotion recognition using convolutional neural networks. IEEE
Availability of data and materials Trans. Multimedia 16(8), 2203–2213 (2014)
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) is the most 13. J. Lee, I. Tashev, in Proceedings of Interspeech 2015. High-level feature
widely used dataset in SER. The dataset is accessible at https://sail.usc.edu/ representation using recurrent neural network for speech emotion
iemocap/. It consists of 12 h of emotional speech performed by 10 actors from recognition (ISCA, Dresden Germany, 2015)
the Drama Department of University of Southern California. The performance 14. M.A. Jalal, E. Loweimi, R.K. Moore, T. Hain, in Proceedings of Interspeech
is divided into two parts, improvised and scripted, depending on whether the 2019. Learning temporal clusters using capsule routing for speech
actors perform according to a fixed script. The dataset is labeled for 9 types emotion recognition (ISCA, Graz, 2019), pp. 1701–1705
of emotion-anger, excitement, happiness, sadness, frustration, fear, neutral, 15. R. Shankar, H.W. Hsieh, N. Charon, A. Venkataraman, in Proceedings of
surprise, and other. Our experiments choose four main emotions-anger, Interspeech 2019. Automated emotion morphing in speech based on
excitement, happiness, and sadness. Furthermore, the experimental code diffeomorphic curve registration and highway networks(ISCA, Graz,
implementation is available at https://github.com/FlowerCai/speech-emoti 2019), pp. 4499–4503
on-recognition. We can research further in the field of SER based on the 16. S. Siriwardhana, T. Kaluarachchi, M. Billinghurst, S. Nanayakkara, Mul-
experiment in the future. timodal emotion recognition with transformer-based self supervised
feature fusion. IEEE Access 8, 176274–176285 (2020)
17. S. Costantini, G. De Gasperis, P. Migliarini, in 2019 IEEE Second Interna-
Declarations tional Conference on Artificial Intelligence and Knowledge Engineering
(AIKE). Multi-agent system engineering for emphatic human-robot
Ethics approval and consent to participate interaction (IEEE, Sardinia Italy, 2019), pp. 36–42
Not applicable. 18. H. Okon-Singer, T. Hendler, L. Pessoa, A.J. Shackman, The neurobiology
of emotion-cognition interactions: fundamental questions and strate-
Competing interests gies for future research. Front. Hum. Neurosci. 9, 58 (2015)
The authors declare that we have no competing interests. 19. Q. Ma, D. Guo, Research on brain mechanisms of emotion. Adv. Psy-
chol. Sci. 11(03), 328 (2003)
20. S. Lee, S. Yildirim, A. Kazemzadeh, S. Narayanan, in Ninth European
Received: 5 April 2022 Accepted: 29 April 2023 Conference on Speech Communication and Technology. An articulatory
study of emotional speech production (ISCA, Lisbon Portugal, 2005)
21. J. LeDoux, Rethinking the emotional brain. Neuron 73(4), 653–676
(2012)
22. V.R. Rao, K.K. Sellers, D.L. Wallace, M.B. Lee, M. Bijanzadeh, O.G. Sani, Y.
References Yang, M.M. Shanechi, H.E. Dawes, E.F. Chang, Direct electrical stimula-
1. L.S.A. Low, N.C. Maddage, M. Lech, L.B. Sheeber, N.B. Allen, Detection of tion of lateral orbitofrontal cortex acutely improves mood in individu-
clinical depression in adolescents’ speech during family interactions. als with symptoms of depression. Curr. Biol. 28(24), 3893–3902 (2018)
IEEE Trans. Biomed. Eng. 58(3), 574–586 (2010) 23. P. Fusar-Poli, A. Placentino, F. Carletti, P. Landi, P. Allen, S. Surguladze,
2. X. Huahu, G. Jue, Y. Jian, in Proceedings of the 2010 International F. Benedetti, M. Abbamonte, R. Gasparotti, F. Barale et al., Functional
Conference on Artificial Intelligence and Computational Intelligence, vol. atlas of emotional faces processing: a voxel-based meta-analysis of 105
1. Application of speech emotion recognition in intelligent household functional magnetic resonance imaging studies. J. Psychiatry Neurosci.
robot, (IEEE, Sanya, 2010), pp. 537–541 34(6), 418–432 (2009)
3. W.J. Yoon, Y.H. Cho, K.S. Park, in International Conference on Ubiquitous 24. F. Ahs, C.F. Davis, A.X. Gorka, A.R. Hariri, Feature-based representations
Intelligence and Computing. A study of speech emotion recognition of emotional facial expressions in the human amygdala. Soc. Cogn.
and its application to mobile services (Springer, Hong Kong China, Affect. Neurosci. 9(9), 1372–1378 (2014)
2007), pp. 758–766 25. M.D. Pell, Recognition of prosody following unilateral brain lesion:
4. K. Han, D. Yu, I. Tashev, in Proceedings of Interspeech 2014. Speech influence of functional and structural attributes of prosodic contours.
emotion recognition using deep neural network and extreme learning Neuropsychologia 36(8), 701–715 (1998)
machine (ISCA, Singapore, 2014) 26. B. McFee, C. Raffel, D. Liang, D.P. Ellis, M. McVicar, E. Battenberg, O.
5. M. Chen, X. He, J. Yang, H. Zhang, 3-d convolutional recurrent neural Nieto, in Proceedings of the 14th python in science conference, vol. 8.
networks with attention model for speech emotion recognition. IEEE librosa: audio and music signal analysis in python (SciPy, Texas US,
Signal Process. Lett. 25(10), 1440–1444 (2018) 2015), pp. 18–25
6. X. Wu, S. Liu, Y. Cao, X. Li, J. Yu, D. Dai, X. Ma, S. Hu, Z. Wu, X. Liu, et al., in 27. C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N.
IEEE International Conference on Acoustics, Speech and Signal Processing Chang, S. Lee, S.S. Narayanan, Iemocap: interactive emotional dyadic
(ICASSP). Speech emotion recognition using capsule networks (IEEE, motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Brighton UK, 2019), pp. 6695–6699
Liu et al. EURASIP Journal on Audio, Speech, and Music Processing (2023) 2023:22 Page 7 of 7
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations.