0% found this document useful (0 votes)

9 views8 pages

DSP RP 4

The document discusses a project aimed at developing a multimodal emotion recognition system that utilizes audio and video inputs to identify user emotions. It highlights the importance of integrating emotional understanding into artificial systems for applications such as conversational agents and smart home technologies. The paper outlines the challenges faced in data complexity, multimodality, and emotion representation, as well as the methodologies for feature extraction and model construction using deep learning techniques.

Uploaded by

srushtisapkal285

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

DSP RP 4

Uploaded by

srushtisapkal285

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Multimodal emotion recognition from audio and video

Nithyasri S *, Hemavarthini B and Bharathi N. Gopalsamy

Computer Science and Engineering, SRM Institute of Science and Technology, Vadapalani, Chennai, India.

International Journal of Science and Research Archive, 2024, 12(01), 142–149

Publication history: Received on 14 March 2024; revised on 26 April 2024; accepted on 29 April 2024

Article DOI: https://doi.org/10.30574/ijsra.2024.12.1.0723

Abstract
As humans we want to interact with a machine as we would with a person, in a way that it understands us, advises us,
and looks after us with no human supervision. Current systems lack empathy and user understanding in spite of very
effective logical reasoning. By predicting the emotions of the users, we are able to identify their needs and cater to them
as best as possible. Emotion recognition in video and audio has many potential applications, including conversational
agents, recommendation systems as well as systems for smart homes, mental illness care, virtual reality games, remote
physical training, education and car-hailing services. The aim of the project is to develop an automatic emotion detection
system based on voice and facial expression. We propose a model that highlights contextual, multimodal information
for emotion detection and recognition. If systems can understand emotions and respond accordingly to behavioral
patterns, we can anticipate artificial agents becoming our cognitive-consulting partners in our daily lives. Additionally,
the project can be expanded to make interactions more natural and better suited to handle complex situations.

Keywords: Multimodal Emotion Recognition; Computer Vision; Deep Learning; Audio and Video; OpenCV; Librosa

1. Introduction
The objective of this project is to develop a method of identifying emotions from both sound and video inputs. This will
involve creating an ensemble model that combines data gathered from both a microphone (audio) and a camera (video)
in a clear and easily understandable manner. Recognizing emotions is an ongoing process that is sensitive to a person's
current state, meaning that the emotions associated with a particular action may differ from person to person. To ensure
meaningful communication, it is important. Accurately interpreting emotions in real-time is crucial. In everyday life,
emotional recognition is vital for social interactions, and emotions are a significant factor in determining human
behavior. It is something that comes naturally to most humans. Emotions have a big effect on human decision making.
Advanced systems like conversational agents and recommendation systems, that involve human users are better off
integrating emotions in the loop. This not only enhances system performance but also helps in the integration of various
components.

2. Literature Review
The literature review below provides an overview of the research on face detection, evaluation metrics for imbalanced
classification, deep learning for multimodal integration, data fusion, and emotion recognition using different modalities.

Face detection is an essential task in computer vision and has numerous applications.

Brownlee [2] provides a comprehensive tour of various evaluation metrics that can be used to evaluate the performance
of classifiers on imbalanced data sets Imbalanced datasets are datasets where the number of samples in each class is
not equal. This can be problematic for classifier models because they tend to be biased towards the majority class and
* Corresponding author: Nithyasri S
Copyright © 2024 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution Liscense 4.0.
International Journal of Science and Research Archive, 2024, 12(01), 142–149

perform poorly on the minority class. Therefore, it is important to evaluate the performance of these models on
imbalanced datasets using appropriate metrics.. Evaluation metrics for imbalanced classification is an important area
of research that deals with the problem of imbalanced datasets in classification tasks.

Dimitri [3] provides a short survey of the applications, future perspectives, and challenges of deep learning for
multimodal integration. The paper does not provide a detailed technical analysis of the deep learning models used for
multimodal integration. This may limit its usefulness for readers who are interested in the technical aspects of these
models. Deep learning for multimodal integration is an emerging area of research that deals with the integration of
information from multiple modalities such as text, image, and audio.

Haileleol [4] provides an introduction to data fusion, covering the basic concepts, types of data fusion, and applications.
There are two main types of data fusion: "low-level" fusion and "high-level" fusion. Low-level fusion involves combining
raw data from multiple sources, such as sensor data or imagery, while high-level fusion involves combining processed
data or information that has already been analyzed. Data fusion is another area of research that deals with the
integration of data from multiple sources to produce more accurate and reliable results.

Somashekhar and Chiranth [5] propose a real-time human emotion recognition system using facial expressions.
Emotion recognition using different modalities has received significant attention in recent years. In this paper, the
authors focus on facial expressions as a modality for emotion recognition. Emotion recognition systems may exhibit
bias, as they rely on data that may be limited in diversity and representativeness. This can result in inaccurate and unfair
predictions, especially towards underrepresented groups.

Zexu et al. [6] propose a multi-modal attention mechanism for speech emotion recognition. The authors evaluate their
proposed multi-modal attention mechanism on two benchmark datasets: the EmoReact dataset and the IEMOCAP
dataset. The proposed multi-modal attention mechanism consists of two main components: a visual attention module
and an audio attention module. One potential limitation is that the proposed model requires both audio and visual inputs
to perform emotion recognition, which may not always be available or feasible in real-world settings.

Zhang et al. [7] propose a real-time video emotion recognition system based on reinforcement learning and domain
knowledge. The proposed system relies on pre-defined facial feature extraction methods, which may not be optimal for
all types of data. This could limit the generalizability of the system to different datasets and applications.

Zhao et al. [8] review the state-of-the-art in multimodal emotion recognition using deep learning. The authors discuss
the various modalities that can be used for emotion recognition, including facial expressions, speech, text, and
physiological signals such as electroencephalography (EEG) and electrocardiography (ECG). They also discuss the
challenges associated with each modality, such as the need for high-quality sensors for physiological signals and the
difficulty of accurately interpreting text-based data.

Chen et al. [9] propose a multimodal approach to emotion recognition using audio and video information. The hybrid
approach used in the paper combines the benefits of both early and late fusion techniques. In this approach, the features
extracted from both the audio and video modalities are first processed separately using separate models. The authors
use two deep neural networks (DNNs) to extract features from the audio and video data respectively. The DNNs are pre-
trained using a large amount of data and fine-tuned on the emotion recognition task.

Fan et al. [10] propose an attention-based multimodal feature fusion for emotion recognition. The proposed approach
consists of two main components: a multimodal feature extraction network and an attention-based feature fusion
network. The multimodal feature extraction network processes the input data from each modality separately and
extracts a set of features that capture the relevant information. These features are then fed into the attention-based
feature fusion network, which selectively combines the features from different modalities using attention weights. The
resulting fused features are then used to predict the emotional state of the person.

Yang et al. [11] propose a novel deep learning-based framework for multimodal emotion recognition. The framework
consists of three main components: a feature extractor, a modality-specific module, and a fusion module. The framework
was evaluated on several benchmark datasets, including the AffectNet, IEMOCAP, and MSP-IMPROV datasets. The
experimental results showed that the proposed framework outperformed several state-of-the-art methods for
multimodal emotion recognition, demonstrating the effectiveness of the proposed approach.

143
International Journal of Science and Research Archive, 2024, 12(01), 142–149

Khorrami et al. [12] propose a deep multimodal fusion of acoustic and text features for speech emotion recognition. The
authors use a deep neural network architecture that consists of two branches: one for processing acoustic features and
one for processing text features. The acoustic branch processes the speech signal using a convolutional neural network
(CNN) and a long short-term memory (LSTM) network. The text branch processes the transcription of the speech using
a bidirectional LSTM network. The two branches are then fused using a concatenation layer and a fully connected layer.

In conclusion, the literature review above provides an overview of the research on face detection, evaluation metrics
for imbalanced classification, deep learning for multimodal integration, data fusion, and emotion recognition using
different modalities. It highlights the different approaches, challenges, and future directions of these areas of research.

3. Dataset Description
Dataset used for audio modality are:

● Ryerson Audio Visual Database of Emotional Speech and Song (RAVDESS)-768 audio files representing 4
emotions
● Surrey Audio Visual Expressed Emotion (SAVEE)-240 audio files representing 4 emotions
● Toronto emotional speech set(TESS)-1600 audio files representing 4 emotions
Dataset used for video modality is:

● Extended Cohn-Kanade dataset(ck+48)-501 images representing 4 emotions

The four emotions are ‘angry’,’ happy’, ‘sad’ and ‘fear’.

4. Challenges
This project faces three challenges, which are as follows:

4.1. Multimodality
The modalities being merged in this project may contain contradicting or duplicative information within the same
frame. The task at hand is to pinpoint the context by understanding how to detect the interaction within a single
modality over time while also capturing the interaction across modalities. Additionally, selecting relevant features poses
an aspect of the problem that we intend to address.

4.2. Data complexity

The input data for this project comes from various modalities, making it naturally heterogeneous with distinct
structures. Furthermore, the temporal aspect adds high-dimensionality to the data. The data is also prone to noise, and
distinguishing noisy sources from outliers and preprocessing them present another problem that we intend to solve.
One solution is to replace the noisy modality with a proxy feature vector obtained from the other modalities.

4.3. Emotion representation

Two emotion models are being considered: discrete and continuous. In the discrete approach, emotions are categorized
into distinct states, such as anger or sadness, whereas, in the continuous approach, emotions are placed in a continuous
space where the primary dimensions are more ambiguous. Discrete emotions are easier to comprehend, while the
continuous space can represent a broad range of emotions. For the sake of simplicity, we have chosen the discrete model
for the remainder of the work, but there is potential for further research in the continuous approach. Our objective is to
create an algorithm for discrete emotion recognition that can extract relevant features, handle conflicting and
redundant ones, perform dimensionality reduction and multimodal fusion of the extracted features, and predict an
emotional representation that is realistic and easy to describe mathematically.

5. System Design
The system is designed in such a manner so as to incorporate the architecture, modules, interfaces, and data to satisfy
specified requirements. On a high level, it is split into two modalities, audio unimodal and video unimodal.

144
International Journal of Science and Research Archive, 2024, 12(01), 142–149

Figure 1 High Level application architecture for multimodal emotion recognition

The video and audio unimodal training is carried out individually which is then fused using late fusion method also
known as decision level fusion.The predicted output post decision level fusion is the output of the multimodal
considering both audio and video inputs.The purpose of decision level fusion is that it is considered to work better in
case of varying input modalities.In addition to that fusion in general is used to come to a conclusion in predicting the
emotion in case of contradictory(differ slightly) emotion exhibited by an individual verbally and non-verbally.

An overview of the application has also been laid out for a specific use-case involving a use and an admin.

Figure 2 Use case diagram for multimodal emotion recognition

A user has access to audio unimodal, video unimodal and multimodal options. Upon choosing the audio unimodal option,
the user is redirected to a page where their microphone is enabled and the audio is recorded in real-time and analyzed
automatically within seconds. But this goes through different stages internally. Once the audio is recorded, features are
extracted from the input and stacked horizontally in the form of an array using the Librosa library and the model that
is pre-trained by the admin is consulted and a prediction is returned. Similarly, for the video unimodal, the video is
recorded in real-time by capturing frame by frame from the camera and face is detected using Haar Cascade methods
provided by the python OpenCV library. This is done so by cropping a region of interest and using Haar features to
detect the face.Haar features consist of a series of square shape functions that have been rescaled and are comparable
to the convolution kernels utilized in Convolution Neural Networks. These features are employed on all pertinent facial
regions to identify a human face. Once detected, relevant features are extracted, and the pre-trained model designated
by the administrator is consulted to make a prediction, which is subsequently returned. When a user goes for the
multimodal option, both of the above mentioned operations are performed simultaneously and the two outputs received
undergo a data fusion where the weighted average is taken. Based on the output of the above process, an emotion is
returned to the user.

145
International Journal of Science and Research Archive, 2024, 12(01), 142–149

6. Feature Detection and Extraction

To detect features in the video modality, Haar Cascade is implemented as an algorithm capable of detecting objects in
images at any scale and location, running in real-time without complexity. The model is saved in XML files, and OpenCV
methods can read it. Moreover, the algorithm includes boilerplate models for detecting faces, eyes, upper and lower
body parts, and license plates. For the audio modality, Librosa, a Python package for analyzing audio and music files, is
employed to extract features. This involves applying various techniques such as ZCR, Chroma_STFT, and MFCCs to audio
input snippets, resulting in a stacked feature array. The zero-crossing rate of an audio time series is computed by ZCR.
Chroma_STFT leverages short-term Fourier transformation to compute Chroma features, which represents audio's
tonal aspects for pitch and signal structure classification. In Librosa, MFCCs provides an API for feature extraction and
data processing in Python, simplifying the process of obtaining MFCCs by allowing arguments to set the number of
frames, hop length, and number of MFCCs. Based on these arguments, a 2D array is returned. Finally, the feature array
undergoes Mel Spectrogram, which logarithmically displays frequencies above a certain threshold and is stacked
horizontally.

7. Modeling
After the feature extraction process, Convolutional Neural Networks (CNNs) are utilized to construct the model, which
includes three layers: convolutional layers, max pooling layers, and fully connected layers. The initial layer is the
convolutional layer, responsible for identifying the different features from the input images. Following the convolutional
layer, a pooling layer is applied to decrease the computational burden by reducing the connections between layers and
conducting independent operations on each feature map. In the max pooling method, the feature map's most prominent
element is chosen.The convolutional layer's extracted features are then generalized, and the network can recognize
them independently, significantly decreasing computations in the network. The fully connected layer, which is typically
located towards the end of the CNN architecture, just before the output layer, is comprised of neurons, weights, biases,
and two primary operations: flatten and dropout. In the flatten operation, the input from previous layers is compressed
into a single vector and fed into the fully connected layer. Mathematical operations are generally performed at this stage,
which initiates the classification process. Two fully connected layers are preferred over a single connected layer,
although they are computationally expensive. To reduce overfitting with respect to the training dataset, a dropout layer
is added, dropping some neurons during the training process, resulting in a smaller model size. Activation functions are
also used, adding non-linearity to the network and approximating continuous and complex relationships between
variables. As this is a multi-class classification into four emotions, a softmax activation function is utilized. The CNN's
layers reduce human supervision, and the dropout layer enhances the model's performance.

Figure 3 Architecture of a CNN

8. Evaluation Metric, Loss Function And Optimiser

The evaluation metric to assess the model is accuracy and the loss function chosen is categorical cross-entropy with
Adam optimizer. On choosing accuracy, we ensure that the features chosen for the purpose of modeling are the right
ones and complement the output that is predicted. When dealing with two or more output labels in multi-class

146
International Journal of Science and Research Archive, 2024, 12(01), 142–149

classification models, categorical cross-entropy is commonly utilized. The output label is assigned a one-hot category
encoding value, where categorical labels are represented as a series of 0s and 1s. Adam optimizer is also being made
use of and is an extension of stochastic gradient descent (SGD) which finds applications in deep learning such as in
computer vision and natural language processing. Adam converges rapidly to a “sharp minima” whereas stochastic
gradient descent is computationally heavy and converges to a “flat minima” but generally performs well on the test data.

Figure 4 Formula for calculating categorical cross-entropy

9. Results And Conclusion

Video unimodal attained a training accuracy of 99.33% with an accuracy of 91% on the testing data.The audio unimodal
attained an accuracy of 98.31%, with an accuracy of 86.9491% on the test data.

Decision level fusion (weighted average) is used for obtaining the final result.

Figure 5 Training and Testing Loss and Training and Testing Accuracy

147
International Journal of Science and Research Archive, 2024, 12(01), 142–149

Figure 6 Confusion matrix of the audio unimodal

Figure 7 Training and Testing Loss and Training and Testing Accuracy

Figure 8 Confusion matrix of the video unimodal

148
International Journal of Science and Research Archive, 2024, 12(01), 142–149

The decision level fusion of the audio and video modalities is being performed with the purpose of predicting emotion
being depicted verbally and non-verbally. Emotion recognition is a dynamic process that is constantly evolving as the
emotional state of the person is being targeted. Additionally, the emotions corresponding to each individual are
different. So there is a huge potential for expansion in different directions, like adding a further few modalities to detect
hand gestures, posture, movements and so on. The scope of the project can also be enhanced in the field of IOT by
including biological signals in order to incorporate multi-dimensional inputs.

Compliance with ethical standards

Disclosure of conflict of interest

No conflict of interest to be disclosed.

References
[1] Behera, Girija Shankar. Face Detection with Haar Cascade. Exploring a bit older algorithm which... | by Girija
Shankar Behera. Towards Data Science, 24 December 2020.
[2] Brownlee, Jason. Tour of Evaluation Metrics for Imbalanced Classification - MachineLearningMastery.com.
Machine Learning Mastery, 8 January 2020.
[3] Dimitri, G. M. A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and
Challenges..
[4] Haileleol, Haylat T. INTRODUCTION TO DATA FUSION. multi-modality | by Haylat T | Haileleol Tibebu. Medium,
29 January 2020.
[5] Somashekhar, B. M., and N. L. Chiranth. REAL TIME HUMAN EMOTION RECOGNITION. IJRTI, 2018.
[6] Zexu, Pan, et al. [2009.04107] Multi-modal Attention for Speech Emotion Recognition. arXiv, 9 September 2020.
[7] K. Zhang, Y. Li, J. Wang, E. Cambria and X. Li, Real-Time Video Emotion Recognition Based on Reinforcement
Learning and Domain Knowledge, in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32,
no. 3, pp. 1034-1047, March 2022, doi: 10.1109/TCSVT.2021.3072412.
[8] Zhao, Y., Liu, Y., & Feng, G. (2021). Multimodal emotion recognition using deep learning: A review. IEEE
Transactions on Affective Computing, 12(1), 1-1. doi: 10.1109/TAFFC.2021.3102672
[9] Chen, H., Huang, Y., & Liu, J. (2020). A multimodal approach to emotion recognition using audio and video
information. Multimedia Tools and Applications, 79(27), 19387-19406. doi: 10.1007/s11042-020-09419-3
[10] Fan, W., Wang, S., & Li, J. (2020). Multimodal emotion recognition using attention-based multimodal feature
fusion. IEEE Access, 8, 163361-163372. doi: 10.1109/ACCESS.2020.3026525
[11] Yang, H., Liu, Y., & Jiang, Y. (2019). A novel deep learning-based framework for multimodal emotion recognition.
IEEE Transactions on Neural Networks and Learning Systems, 30(5), 1365-1376. doi:
10.1109/TNNLS.2018.2844739
[12] Khorrami, P., Le Paine, T., Brady, D., & Leung, T. (2019). Speech emotion recognition using deep multimodal fusion
of acoustic and text features. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 7370-7374). doi: 10.1109/ICASSP.2019.8683336

149

Minor PRJ LR Paper
No ratings yet
Minor PRJ LR Paper
6 pages
Sensors 23 05475 v3
No ratings yet
Sensors 23 05475 v3
19 pages
Entropy 25 01440
No ratings yet
Entropy 25 01440
33 pages
Multimodal Recognition With Deep Learning: Audio, Image, and Text
No ratings yet
Multimodal Recognition With Deep Learning: Audio, Image, and Text
11 pages
Yanan Shang
No ratings yet
Yanan Shang
6 pages
CG-MER: A Card Game-Based Multimodal Dataset For Emotion Recognition
No ratings yet
CG-MER: A Card Game-Based Multimodal Dataset For Emotion Recognition
8 pages
Multi-Modal Emotion Recognition From Speech and Facial Expression Based On Deep Learning
No ratings yet
Multi-Modal Emotion Recognition From Speech and Facial Expression Based On Deep Learning
4 pages
José Salas-Cáceres Audiovisual
No ratings yet
José Salas-Cáceres Audiovisual
17 pages
Xianxun Zhu Review
No ratings yet
Xianxun Zhu Review
27 pages
rw3 (1) 1
No ratings yet
rw3 (1) 1
21 pages
Multimodal Emotion Recognition in Deep Learninga Survey
No ratings yet
Multimodal Emotion Recognition in Deep Learninga Survey
6 pages
1 s2.0 S0957417423021942 Main
No ratings yet
1 s2.0 S0957417423021942 Main
23 pages
Ref 4
No ratings yet
Ref 4
5 pages
Tao Zhang
No ratings yet
Tao Zhang
40 pages
Multi-Label Multimodal Emotion Recognition
No ratings yet
Multi-Label Multimodal Emotion Recognition
17 pages
Emotion Recognition From Audio and Visual Data Using F-Score Based Fusion
No ratings yet
Emotion Recognition From Audio and Visual Data Using F-Score Based Fusion
10 pages
IJRPR33889
No ratings yet
IJRPR33889
5 pages
Final Report Fyp24093 2
No ratings yet
Final Report Fyp24093 2
18 pages
Biomimetics 10 00418 v2
No ratings yet
Biomimetics 10 00418 v2
24 pages
Emotion Classification Using Deep Learning Technique
No ratings yet
Emotion Classification Using Deep Learning Technique
21 pages
Multimodal Information-Based Broad and Deep Learning Model For Emotion Understanding
No ratings yet
Multimodal Information-Based Broad and Deep Learning Model For Emotion Understanding
5 pages
1 s2.0 S2667305322001089 Main
No ratings yet
1 s2.0 S2667305322001089 Main
19 pages
Multimodal Emotion Recognition
No ratings yet
Multimodal Emotion Recognition
5 pages
Journal of Advanced Zoology: Multimodal Emotion Recognition System Using Machine Learning Classifier
No ratings yet
Journal of Advanced Zoology: Multimodal Emotion Recognition System Using Machine Learning Classifier
4 pages
EEG-based Multimodal Representation Learning For E
No ratings yet
EEG-based Multimodal Representation Learning For E
4 pages
Multimodal Mixed Emotion Detection: Soumili Kundu M.Tech I.T (CWE) Roll No. 06 SET, Jadavpur University
No ratings yet
Multimodal Mixed Emotion Detection: Soumili Kundu M.Tech I.T (CWE) Roll No. 06 SET, Jadavpur University
7 pages
A Review On EEG-based Multimodal Learning For Emotion Recognition
No ratings yet
A Review On EEG-based Multimodal Learning For Emotion Recognition
63 pages
Multimodal Emotion Recognition Using Deep Networks: C. Fadil, R. Alvarez, C. Mart Inez, J. Goddard and H. Rufiner
No ratings yet
Multimodal Emotion Recognition Using Deep Networks: C. Fadil, R. Alvarez, C. Mart Inez, J. Goddard and H. Rufiner
4 pages
Mti 06 00011 v3
No ratings yet
Mti 06 00011 v3
23 pages
Two-Stage Fuzzy Fusion Based-Convolution Neural Network For Dynamic Emotion Recognition
No ratings yet
Two-Stage Fuzzy Fusion Based-Convolution Neural Network For Dynamic Emotion Recognition
13 pages
Project Emotion Detection-2
No ratings yet
Project Emotion Detection-2
15 pages
Futureinternet 16 00247 v3
No ratings yet
Futureinternet 16 00247 v3
17 pages
A Systematic Review On Multimodal Emotion Recognition Building Blocks Current State Applications and Challenges
No ratings yet
A Systematic Review On Multimodal Emotion Recognition Building Blocks Current State Applications and Challenges
44 pages
A Hierarchical Cross-Modal Spatial Fusion Network For Multimodal Emotion Recognition
No ratings yet
A Hierarchical Cross-Modal Spatial Fusion Network For Multimodal Emotion Recognition
10 pages
A Multimodal Fusion Approach Emotion Identification From Audio and Video
No ratings yet
A Multimodal Fusion Approach Emotion Identification From Audio and Video
5 pages
Cross Attentional Audio-Visual Fusion For Dimensional Emotion Recognition
No ratings yet
Cross Attentional Audio-Visual Fusion For Dimensional Emotion Recognition
8 pages
Multimodal Emotion Recognition Fusion
No ratings yet
Multimodal Emotion Recognition Fusion
16 pages
Multimodal Transformer Fusion For Continuous Emotion Recognition
No ratings yet
Multimodal Transformer Fusion For Continuous Emotion Recognition
5 pages
An Emotion Recognition System: Bridging The Gap Between Humans-Machines Interaction
No ratings yet
An Emotion Recognition System: Bridging The Gap Between Humans-Machines Interaction
10 pages
Multimodal Emotion Recognition
No ratings yet
Multimodal Emotion Recognition
4 pages
Sensors 24 03484 v2
No ratings yet
Sensors 24 03484 v2
29 pages
Irjet V9i4176
No ratings yet
Irjet V9i4176
4 pages
Li21j Interspeech
No ratings yet
Li21j Interspeech
5 pages
Emotion Recognition in Speech by Multimodal Analysis of Audio and Text
No ratings yet
Emotion Recognition in Speech by Multimodal Analysis of Audio and Text
7 pages
2024 Progressive - Fusion - Network - With - Mixture - of - Experts - For - Multimodal - Sentiment - Analysis
No ratings yet
2024 Progressive - Fusion - Network - With - Mixture - of - Experts - For - Multimodal - Sentiment - Analysis
8 pages
Multimodal Research Paper
No ratings yet
Multimodal Research Paper
10 pages
Study of Video Based Facial Expression and Emotion
No ratings yet
Study of Video Based Facial Expression and Emotion
6 pages
Final Survey Paper 4
No ratings yet
Final Survey Paper 4
8 pages
Review of Literature
No ratings yet
Review of Literature
4 pages
Deep Learning for Emotion Recognition
No ratings yet
Deep Learning for Emotion Recognition
9 pages
Report
No ratings yet
Report
8 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Video Based Emotion Recognition: Submitted By
No ratings yet
Video Based Emotion Recognition: Submitted By
8 pages
Parallel Computing
No ratings yet
Parallel Computing
10 pages
Speech Emotion Recognization
No ratings yet
Speech Emotion Recognization
65 pages
Multimodal Spatio Temporal Framework For Re - 2024 - International Journal of in
No ratings yet
Multimodal Spatio Temporal Framework For Re - 2024 - International Journal of in
11 pages
Major Project Synopsis
No ratings yet
Major Project Synopsis
12 pages
Survey Paper
No ratings yet
Survey Paper
10 pages
DSP RP 8
No ratings yet
DSP RP 8
5 pages
DSP RP 3
No ratings yet
DSP RP 3
4 pages
DSP RP 1
No ratings yet
DSP RP 1
21 pages
DSP RP 25
No ratings yet
DSP RP 25
6 pages
DSP RP 21
No ratings yet
DSP RP 21
10 pages
DSP RP 23
No ratings yet
DSP RP 23
4 pages
Deepfake Detection via Audiovisual Learning
No ratings yet
Deepfake Detection via Audiovisual Learning
8 pages
CH7 Interfaces
No ratings yet
CH7 Interfaces
98 pages
A Multimodal Driver Emotion Recognition Algorithm Based On The Audio and Video Signals in Internet of Vehicles Platform
No ratings yet
A Multimodal Driver Emotion Recognition Algorithm Based On The Audio and Video Signals in Internet of Vehicles Platform
12 pages
32-2024-A Novel Dual-Pipeline Based Attention Mechanism For
No ratings yet
32-2024-A Novel Dual-Pipeline Based Attention Mechanism For
7 pages
Image+Text Hybrid Model Paper
No ratings yet
Image+Text Hybrid Model Paper
27 pages
Unit 1 - HCI Assignment Lectures
No ratings yet
Unit 1 - HCI Assignment Lectures
165 pages
Neuro Symbolic AI For Medical Image Diagnosis
No ratings yet
Neuro Symbolic AI For Medical Image Diagnosis
10 pages
MEIA Multimodal Embodied Perception and Interaction in Unknown
No ratings yet
MEIA Multimodal Embodied Perception and Interaction in Unknown
12 pages
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
No ratings yet
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
20 pages
Syed Imran
No ratings yet
Syed Imran
17 pages
DeepStyle: Multimodal Fashion & Interior Search
No ratings yet
DeepStyle: Multimodal Fashion & Interior Search
11 pages
HALVA - To Tackle Hallucinations in Multimodal LLMs
No ratings yet
HALVA - To Tackle Hallucinations in Multimodal LLMs
7 pages
Hybrid Deep Learning Framework For Personality Prediction in E-Recruitment
No ratings yet
Hybrid Deep Learning Framework For Personality Prediction in E-Recruitment
4 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chapter Two
100% (1)
Chapter Two
68 pages
Chatbot AI Framework Review
No ratings yet
Chatbot AI Framework Review
6 pages
Multimodal Affective Computing Trends
No ratings yet
Multimodal Affective Computing Trends
26 pages
BE - Project - Report 12-11-24
No ratings yet
BE - Project - Report 12-11-24
64 pages
Build Your Multimodal RAG System
No ratings yet
Build Your Multimodal RAG System
19 pages
2438 - AI Detection
No ratings yet
2438 - AI Detection
12 pages
Chapter 7 - Interfaces
No ratings yet
Chapter 7 - Interfaces
56 pages
Biomed Clip
No ratings yet
Biomed Clip
23 pages
Creating A Cooperative AI Policymaking Platform Through Open Source Collaboration
No ratings yet
Creating A Cooperative AI Policymaking Platform Through Open Source Collaboration
21 pages
Affective Computing: Recent Advances, Challenges, and Future Trends
No ratings yet
Affective Computing: Recent Advances, Challenges, and Future Trends
24 pages
Research Proposal Transformer
No ratings yet
Research Proposal Transformer
3 pages
Mental Health Bot 2
No ratings yet
Mental Health Bot 2
6 pages
IEEE Paper
No ratings yet
IEEE Paper
6 pages
Blackbook Final
No ratings yet
Blackbook Final
40 pages
Copie de AI Speech Classifier Pitch Deck by Slidesgo
No ratings yet
Copie de AI Speech Classifier Pitch Deck by Slidesgo
29 pages
Introducing A Unified Framework For Content Object
No ratings yet
Introducing A Unified Framework For Content Object
26 pages

DSP RP 4

Uploaded by

DSP RP 4

Uploaded by

Multimodal emotion recognition from audio and video

Nithyasri S *, Hemavarthini B and Bharathi N. Gopalsamy

International Journal of Science and Research Archive, 2024, 12(01), 142–149

Article DOI: https://doi.org/10.30574/ijsra.2024.12.1.0723

● Extended Cohn-Kanade dataset(ck+48)-501 images representing 4 emotions

4.2. Data complexity

4.3. Emotion representation

Figure 1 High Level application architecture for multimodal emotion recognition

Figure 2 Use case diagram for multimodal emotion recognition

6. Feature Detection and Extraction

Figure 3 Architecture of a CNN

8. Evaluation Metric, Loss Function And Optimiser

Figure 4 Formula for calculating categorical cross-entropy

9. Results And Conclusion

Figure 6 Confusion matrix of the audio unimodal

Figure 8 Confusion matrix of the video unimodal

Compliance with ethical standards

Disclosure of conflict of interest

You might also like