Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
44 views80 pages

Deepak Report Phase1

Uploaded by

Sriram Sivakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views80 pages

Deepak Report Phase1

Uploaded by

Sriram Sivakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

PERSONALIZED MUSIC THERAPY WITH

EMOTION RECOGNITION AND


LOCALIZATION

A PROJECT REPORT

Submitted By

BALAJI S. 3122226001002

in partial fulfillment for the award of the degree

of

MASTER OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING

Department of Computer Science and Engineering


Sri Sivasubramaniya Nadar College of Engineering
(An Autonomous Institution, Affiliated to Anna University)

Rajiv Gandhi Salai (OMR), Kalavakkam - 603110

June 2024
Sri Sivasubramaniya Nadar College of Engineering
(An Autonomous Institution, Affiliated to Anna University)

BONAFIDE CERTIFICATE

Certified that this project report titled “VIDEO CAPTIONING VIA


RETAINING HIGH MEMORABLE VIDEO CLIPS” is the bonafide work of
“DEEPAK SRIRAM. S (3122226001004),” who carried out the project work
under my supervision.
Certified further that to the best of my knowledge the work reported herein does
not form part of any other thesis or dissertation on the basis of which a degree or
award was conferred on an earlier occasion on this or any other candidate.

Dr. T.T. MIRNALINEE Dr. A. BEULAH


Head of the Department Supervisor
Professor, Assistant Professor,
Department of CSE, Department of CSE,
SSN College of Engineering, SSN College of Engineering,
Kalavakkam - 603 110 Kalavakkam - 603 110

Place:
Date:

Submitted for the examination held on. . . . . . . . . . . .

Internal Examiner External Examiner


ACKNOWLEDGEMENTS

I thank GOD, the almighty for giving me strength and knowledge to do this project.

I would like to thank and deep sense of gratitude to my guide Dr. A. BEULAH,
Assistant Professor, Department of Computer Science and Engineering, for her
valuable advice and suggestions as well as her continued guidance, patience and
support that helped me to shape and refine my work.

My sincere thanks to Dr. T.T. MIRNALINEE, Professor and Head of the


Department of Computer Science and Engineering, for her words of advice and
encouragement and I would like to thank our project Coordinator Dr. CHITRA
BABU, Professor, Department of Computer Science and Engineering for her
valuable suggestions throughout this project. I would like to thank the Review
Committee Members, Dr. D. VENKATA VARA PRASAD, Professor,
Department of Computer Science and Engineering and Dr. Y. V. LOKESWARI,
Associate Professor, Department of Computer Science and Engineering, for their
valuable feedbacks.

I express my deep respect to the founder Dr. SHIV NADAR, Chairman, SSN
Institutions. I also express my appreciation to our Dr. V. E. ANNAMALAI,
Principal, for all the help he has rendered during this course of study.

I would like to extend my sincere thanks to all the teaching and non-teaching
staffs of our department who have contributed directly and indirectly during the
course of my project work. Finally, I would like to thank my parents and friends
for their patience, cooperation and moral support throughout my life.

DEEPAK SRIRAM S.
ABSTRACT

Video captioning presents a complex challenge when managing online video

content. Conventional approaches often neglect the aspects of human attention

and memory, concentrating primarily on visual, audio, or text cues. An innovative

framework merges video captioning with memorability prediction to tackle this

issue. This system forecasts memory scores for video segments to craft more

captivating and pertinent summaries. To enhance performance, explainable AI

tools like SHapley Additive exPlanations and Local Interpretable Model-Agnositc

Explanations were leveraged to pinpoint and prioritize features crucial to the

model. The video summarization model, which incorporates both audio and video

inputs, splits videos in 6-second intervals, crafting captions solely for segments

with high memorability scores. Through experiments conducted on the

Memento10K and Video Storytelling datasets, considerable advancements were

observed in forecasting video memorability and enriching video summarization,

offering a new vantage point on video content analysis and user engagement.
TABLE OF CONTENTS

ABSTRACT iii

LIST OF TABLES viii

LIST OF FIGURES ix

1 INTRODUCTION 1
1.1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . 2
1.4 OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 ORGANISATION OF THE REPORT . . . . . . . . . . . . . . . 4

2 LITERATURE SURVEY 6

3 PROPOSED SYSTEM 24
3.1 SYSTEM ARCHITECTURE . . . . . . . . . . . . . . . . . . . . 24
3.2 VIDEO MEMORABILITY MODEL . . . . . . . . . . . . . . . . 26
3.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Ensembled Model - Support Vector Regressor, Decision
Tree, RidgeRegressor . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Dual Memorability Prediction . . . . . . . . . . . . . . . 30
3.2.4 Explainable AI(XAI) . . . . . . . . . . . . . . . . . . . . 30
3.3 VIDEO CAPTIONING MODEL . . . . . . . . . . . . . . . . . . 32

iv
3.3.1 Input Frames . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Flattening and Dense Layers . . . . . . . . . . . . . . . . 36
3.3.4 Concatenation of Features . . . . . . . . . . . . . . . . . 37
3.3.5 LSTM Encoders . . . . . . . . . . . . . . . . . . . . . . 37
3.3.6 Decoder Initialization . . . . . . . . . . . . . . . . . . . . 37
3.3.7 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.8 Caption Generation . . . . . . . . . . . . . . . . . . . . . 38
3.3.9 Training and Evaluation . . . . . . . . . . . . . . . . . . 38

4 IMPLEMENTATION 41
4.1 DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 HARDWARE SPECIFICATIONS . . . . . . . . . . . . . . . . . 42
4.3 SOFTWARE SPECIFICATIONS . . . . . . . . . . . . . . . . . . 42
4.4 FEATURE EXTRACTION . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 VGG16 Feature Extraction . . . . . . . . . . . . . . . . . 43
4.4.2 C3D Feature Extraction . . . . . . . . . . . . . . . . . . 44
4.4.3 EfficientNetB0 Feature Extraction . . . . . . . . . . . . . 44
4.5 ENSEMBLED (SVR + DT + RIDGE) MODEL . . . . . . . . . . 45
4.5.1 Feature Extraction and Ensemble Model Integration . . . 45
4.5.2 Memorability Prediction Process . . . . . . . . . . . . . . 45
4.6 EXPLAINABLE AI . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.1 Local Interpretable Model-Agnostic Explanations . . . . . 47
4.6.2 SHapley Additive exPlanations . . . . . . . . . . . . . . . 48
4.7 VIDEO CAPTIONING MODEL . . . . . . . . . . . . . . . . . . 49
4.7.1 Video Encoding and Audio Encoding . . . . . . . . . . . 49
v
4.7.2 Caption Encoding . . . . . . . . . . . . . . . . . . . . . 50
4.7.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 50
4.7.4 Decoding and Caption Generation . . . . . . . . . . . . . 50

5 EXPERIMENTAL INVESTIGATIONS AND RESULTS 52


5.1 EXPERIMENTAL OUTPUTS . . . . . . . . . . . . . . . . . . . 52
5.1.1 Frame Extraction from the Video Data . . . . . . . . . . . 52
5.1.2 Feature Vector Results . . . . . . . . . . . . . . . . . . . 53
5.2 VIDEO CAPTIONING OUTPUT . . . . . . . . . . . . . . . . . 54
5.3 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.1 Experiment 1 - VGG16, EfficientNetB0, C3D Features
with 10 percent of the data using ANN . . . . . . . . . . 56
5.3.2 Experiment 2 - VGG16, EfficientNetB0, C3D Features
with 50 percent of data using ANN . . . . . . . . . . . . 56
5.3.3 Experiment 3 - VGG16, EfficientNetB0, C3D Features
with all training data using ANN . . . . . . . . . . . . . . 57
5.3.4 Experiment 4 - VGG16, EfficientNetB0, C3D Features
with 50 percent of data using Support Vector Regressor
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.5 Experiment 5 - VGG16, EfficientNetB0, C3D Features
with all training data using Support Vector Regressor model 58
5.3.6 Experiment 6 - VGG16, EfficientNetB0, C3D Features
with all training data using Decision Tree Model . . . . . 59
5.3.7 Experiment 7 - Ensemble Model with VGG16,
EfficientNetB0, C3D Features on all training data . . . . . 59

vi
5.3.8 Experiment 8 - Ensemble Model with VGG16,
EfficientNetB0, C3D Features after XAI . . . . . . . . . . 59
5.4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 ISSUES PERTAINING TO SUSTAINABILITY 63


6.1 SOCIETY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 SAFETY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 LEGAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 CONCLUSION AND FUTURE WORK 66


7.1 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 67

vii
LIST OF TABLES

4.1 Sample Memorability Data . . . . . . . . . . . . . . . . . . . . . 41


5.1 Analysis of Spearman Correlation among Experiments . . . . . . 60
5.2 Analysis of Mean Squared Error among Experiments . . . . . . . 60
5.3 Analysis of Mean Absolute Error among Experiments . . . . . . . 61
5.4 Comparison of BLEU Scores with the existing models . . . . . . 61
5.5 Comparison of ROUGE-L Scores with the existing models . . . . 62

viii
LIST OF FIGURES

3.1 Video Captioning Architecture . . . . . . . . . . . . . . . . . . . 25


3.2 Memorability Prediction Model Architecture . . . . . . . . . . . 26
3.3 VGG16 Architecture . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 EfficientNetB0 Feature Extraction . . . . . . . . . . . . . . . . . 28
3.5 C3D Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Ensemble Model . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Video Captioning Model Architecture . . . . . . . . . . . . . . . 33
3.8 ResNetv2 Architecture . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 MobileNetv2 Architecture . . . . . . . . . . . . . . . . . . . . . 36
4.1 LIME Explanations . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 SHAP Explanations . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Frame Generation from Video Data . . . . . . . . . . . . . . . . 52
5.2 Feature Extracted from EfficientNetB0 . . . . . . . . . . . . . . . 53
5.3 Feature Extracted from C3D . . . . . . . . . . . . . . . . . . . . 53
5.4 Feature Extracted from VGG16 . . . . . . . . . . . . . . . . . . . 53
5.5 Memorability Score for a Single Short Video Instance . . . . . . . 54
5.6 Store High Memorable clips in folder . . . . . . . . . . . . . . . 54
5.7 Input Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 Folder Containing High Memorable Video Frame for Long Video 55
5.9 Summary of the video . . . . . . . . . . . . . . . . . . . . . . . . 55

ix
CHAPTER 1

INTRODUCTION

1.1 BACKGROUND

Video captioning is a powerful tool for engaging audiences and achieving


marketing goals. With the increasing popularity of videos as a medium for
communication and entertainment, it has become more important than ever to
create content that stands out and resonates with viewers. Conventional video
captioning methodologies have their limitations, which lead to the introduction of
an innovative framework in this research. One way to achieve this is through the
use of video memorability filtering techniques, which can help optimize video
content for maximum memorability by predicting the elements that contribute
most to memorability. Video memorability refers to the extent to which a video is
remembered by its viewers.Video memorability filtering techniques use machine
learning models to analyze various features of a video, such as its visual and
auditory characteristics, and generate a video memorability score that indicates its
likelihood of being remembered by viewers. The breakthrough framework
proposed in this thesis leverages deep learning features and machine learning
models to predict memorability scores for different video segments. By
identifying the elements that contribute most to memorability, video producers
and marketers can produce more effective and engaging content that connects
with their audience. This thesis explores the use of video memorability filtering
techniques to enhance the effectiveness of video captioning. Specifically, it
focuses on the use of an LSTM encoder-decoder for video captioning and an
1
2

ensemble model for predicting video memorability scores. By using these


techniques, the thesis aims to demonstrate the potential of video memorability
filtering to create more engaging and effective video content.

1.2 MOTIVATION

Traditional video captioning methods often overlook a vital factor—the human


aspect of attention and memory. These approaches primarily focus on visual,
audio, or text cues, disregarding the intricate relationship between a viewer’s
memory and the memorability of the content. This research aims to bridge this
gap by introducing a novel framework that seamlessly integrates video captioning
with memorability scores. By incorporating memorability scores into the video
captioning process, we can create more engaging and relevant captions that
closely align with human cognitive functions.

1.3 PROBLEM STATEMENT

Professionals across various domains consistently struggle to create captivating


video captions that capture the essence of content and align with human memory
patterns. Existing methods often fail to accurately predict the memorability of
different video segments, resulting in suboptimal captioning outcomes. To bridge
this gap, a comprehensive understanding of memory assessment and the seamless
integration of cutting-edge technologies are crucial. A detailed evaluation of
current video captioning methodologies and memory assessment methods reveals
3

significant gaps. These gaps primarily arise due to the absence of a unified
framework that seamlessly incorporates memory scores into the video captioning
process. Addressing these gaps is essential to advance the state-of-the-art in video
captioning and enhance the overall user experience. Given a long video, the goal
is to find the segments which are memorable and converting those segments to
textual descriptions which results in video captioning

1.4 OBJECTIVES

The main objective of this research is to develop a novel video captioning


framework that integrates memory scores using deep learning features and
machine learning models. This framework aims to enhance the quality of video
captions by accurately predicting the memorability of different video segments.
The Objectives are :

• To conduct a comprehensive literature review to identify existing video


captioning methodologies and memory assessment methods.

• To propose an integrated solution for video captioning that seamlessly


incorporates memory scores.

• To Evaluate the effectiveness of the proposed framework through rigorous


experimentation on the Memento10K dataset.

• To use Explainable AI to optimize the memorability prediction model.

• To Generate captions for each segments from a video which has high
memorability score.
4

• To create summary of the video by considering both audio and video


features.

By achieving these objectives, this research aims to contribute to the advancement


of video captioning techniques and enrich the understanding of the interplay
between human memory and multimedia content.

1.5 ORGANISATION OF THE REPORT

The Report is organised as follows:

• Chapter 1, Introduction, is about the problem statement, goals and it also


defines how important this study is. Moreover, a brief description of the
limitations and methodology used in traditional video captioning is
mentioned.

• Chapter 2, Literature Survey, describes about the literature survey of various


researches conducted on video memorability prediction and captioning and
mentions about the methodology and limitations in those researches.

• Chapter 3, Proposed System, discusses about the proposed system for the
video summarisation using memorability scores. The detailed explanation
of the video memorability and summarisation modules along with the
methodology is mentioned.

• Chapter 4, Implementation, elaborates about the Feature extraction module,


and the model training module for the video memorability prediction.
5

Multiple models like Artificial Neural Networks(ANN), Support Vector


Regressions (SVR) and Decision Trees (DT) have been trained to find the
best model for video memorability task, the Explainable AI(XAI) module
helps in optimizing the memorability prediction module’s result and the
Video Captioning module helps in providing the summary of the high
memorable video clips to provide meaningful summaries for the video data.

• The Chapter 5, Experimental Investigations and Results, mentions about


the Experimental results of the models that were trained for predicting the
memorability scores and the model that was created to provide text
summaries for the videos.The Memorability prediction model was
compared based upon there spearman correlation to find out the best model
and Summarization model was compared based upon the BLEU Scores and
ROUGE-L Scores.

• Chapter 6, Issues pertaining to sustainability expresses about the ethical and


legal practices that the model should follow while capturing and using the
video data provided by the end user.

• Chapter 7, Conclusions and Future Work, tells about the key findings and
contributions and research work. Also tells about the future enhancements
that can be done to further improve the model to improve effectiveness.
CHAPTER 2

LITERATURE SURVEY

The Gated Video Memorability Filtering paper [12] takes a close look at methods
used to summarize videos. Specifically, it focuses on a framework developed by
Lu and Wu. They use a Convolutional Neural Network (CNN) to extract visual
and temporal features from video segments, and a Recurrent Neural Network
(RNN) with a gating mechanism to select the most memorable parts. This
approach aims to create engaging video summaries by effectively utilizing deep
learning models. While Lu and Wu’s methodology has its strengths in prioritizing
memorability and leveraging deep learning, it also has some limitations. Firstly,
the evaluation only considers a limited dataset of cooking videos, which raises
questions about how well the framework would work for other types of videos.
Additionally, the use of black-box deep learning models in their approach brings
up concerns about interpretability. It would be helpful to have more insights into
the internal decision-making processes of the models. Furthermore, the
discussion doesn’t address the potential computational cost of training and
running these deep learning models, which may limit the practicality of
implementing this framework in resource-constrained environments. To advance
the effectiveness and practicality of video summarization methodologies, it’s
essential to address these limitations. Further research should explore the
generalizability across different video genres and provide more insights into the
decision-making processes of the model. It’s also important to consider the
computational resources required for implementing the framework in real-world
situations with limited resources. By addressing these considerations, we can

6
7

enhance the overall performance and applicability of video summarization


techniques.

Tiered Representations for Video Memorability Prediction” by Dumont et al.[4]


paper introduces a new method called the M3-S framework that aims to predict
video memorability. What’s unique about this approach is that it utilizes tiered
representations at different complexity levels – low, mid, and high. By
considering visual cues, scene properties, and temporal patterns, the framework
provides a more comprehensive understanding of video content. One of the
strengths of this framework is its modular architecture, which consists of four
modules. This allows for targeted processing of different aspects of memorability,
leading to improved interpretability. Compared to traditional methods that focus
on single-level features, the M3-S framework takes a fresh approach by
incorporating tiered representations. Another notable aspect is the consideration
of contextual similarity, which helps enhance memorability predictions when
compared to approaches that ignore context. However, it’s important to note that
the paper acknowledges some limitations. One concern is the computational
complexity associated with extracting tiered representations. This raises questions
about its real-time applicability. The authors also mention that the model might
struggle with complex semantics, extreme pixel intensity, or extreme motion,
which they hope to address in future enhancements. Additionally, the evaluation
focuses on a single dataset of short videos, which might limit its generalizability
across various video genres and lengths. Despite these challenges, the M3-S
framework stands out by introducing tiered representations and explicit modeling
of contextual similarity. This allows for a broader range of factors influencing
memorability to be captured. Looking ahead, future research can focus on
mitigating the computational complexity and exploring other applications of the
8

M3-S framework, such as video summarization and editing. Overall, Dum et al.’s
M3-S framework, despite its challenges, makes a contribution to the field by
offering a more nuanced understanding of video memorability.

Li et al. have introduced the Adaptive Multi-Modal Ensemble Network [7]


(AMEN), a revolutionary framework for video memorability prediction that
integrates insights from both visual and audio aspects. By utilizing a 3D
convolutional neural network (CNN) for video features and a recurrent neural
network (RNN) for audio features, AMEN incorporates modality-specific
predictions through a dynamic attention mechanism. This multi-modal strategy
successfully addresses a crucial element of human memory and surpasses
single-modal methods that focus solely on visual cues. Moreover, the dynamic
attention mechanism enhances AMEN’s adaptability by prioritizing important
features, resulting in more accurate and robust video memorability predictions.
One of the key strengths of AMEN lies in its comprehensive fusion of visual and
audio modalities, recognizing the significance of both components in the
memorability prediction process. The dynamic attention mechanism also boosts
the model’s flexibility, allowing it to handle variations in video content and audio
styles effectively. Consequently, AMEN outperforms existing single-modal and
static fusion methods in terms of performance. However, it’s important to note
that Li et al. also highlight several weaknesses such as potential computational
complexity, interpretability challenges, and the need for further evaluation across
a wider range of video datasets. These insights emphasize the importance of
refining AMEN to ensure broader applicability and a deeper understanding.
When compared to other existing works, AMEN distinguishes itself by adopting
a multi-modal approach, incorporating separate models for each modality along
with a dynamic attention mechanism. This departure from static fusion methods
9

and single-modal approaches significantly improves the accuracy and flexibility


of video memorability prediction, demonstrating a significant advancement in this
field. The authors also propose future research directions aimed at mitigating
identified limitations, including techniques to reduce computational complexity,
enhance interpretability, and evaluate AMEN across diverse video genres and
lengths.

Guinaudeau and Xalaber’s study called ”Textual Analysis for Video


Memorability Prediction,” [5] takes a close look at both visual and textual
representations of videos using two sequential models. While the emphasis is on
visual features, the authors highlight the significant contribution of textual
content. They examine results from different sources, including manual
descriptions and automatic captions, to gain a better understanding of the factors
that influence viewer recall and contribute to high memorability scores. One of
the notable strengths of the study is its holistic approach, which goes beyond
conventional visual features to explore the role of textual representations. By
analyzing both manual descriptions and automatic captions, the authors provide
valuable insights into the effectiveness of various approaches. They find a
correlation between longer, more precise texts and higher memorability scores,
offering practical guidance for optimizing video descriptions and captions.
Additionally, the study acknowledges the black-box nature of the sequential
models used for textual analysis. This highlights the need for further research to
understand how these models extract meaningful information from text to predict
memorability. Addressing these considerations will enhance the robustness and
interpretability of future studies in the realm of textual analysis for video
memorability prediction. Despite these considerations, Guinaudeau and
Xalabarder’s work makes a significant contribution by explicitly examining the
10

role of textual information, providing complementary insights to the existing


research that primarily focuses on visual features. Moving forward, it is crucial to
explore broader datasets, delve into various textual features, and develop
interpretable textual analysis models to deepen our understanding of the intricate
interplay between visual and textual cues in human memory. In conclusion, this
study propels the field of video memorability forward, opening doors for
continued exploration and refinement in this dynamic domain.

Lin et al.’s exploration of video summarization introduces robust framework that


leverages Hierarchical Long Short-Term Memory (LSTM) network [9]. This
framework showcases notable strengths in capturing the dynamics of video
content. By incorporating an attention mechanism, it enhances summarization by
focusing on the most important frames and subshots. The use of cost-sensitive
learning prioritizes crucial frame selection, resulting in impressive ROUGE
scores on benchmark datasets. Despite these strengths, there are certain
limitations to the proposed framework. The computational complexity of training
and deploying deep learning models poses challenges for real-time applications.
Additionally, the black-box nature of the model hinders understanding its internal
workings, especially regarding frame and subshot selection. The limited diversity
of the dataset, mainly consisting of short news videos, raises concerns about the
generalizability of findings to different video genres and lengths. Moreover,
attention mechanisms, while powerful, introduce potential subjective biases that
can affect the cultural neutrality of generated summaries. To address these
limitations, Lin et al. suggest insightful future research directions. One
suggestion is to explore lighter-weight deep learning architectures or alternative
approaches to mitigate computational demands. It is crucial to develop
interpretable models that enhance adaptability and address biases in frame and
11

subshot selection. Testing the framework across a wider range of video lengths
and genres is recommended to assess its generalizability and robustness. Lastly,
techniques to mitigate biases in attention mechanisms are pivotal for ensuring fair
and inclusive video summarization. In conclusion, Lin et al.’s video
summarization framework shows promising results. However, addressing the
identified limitations is vital for further development and responsible application
of the model. By refining the framework to overcome computational constraints,
enhancing interpretability, and ensuring generalizability and fairness, there is
significant potential for advancement in intelligent video analysis and
summarization.

”Video Summarization through Reinforcement Learning with a 3D


Spatio-Temporal U-Net” by Liu et al [10] presents a framework that combines
deep learning and reinforcement learning to create video summaries. The 3D
Spatio-Temporal U-Net, which is a deep learning architecture specifically
designed for analyzing spatial and temporal features in videos. In addition, they
use reinforcement learning (RL) to predict actions for each video frame. This
combination results in summaries that highlight the most important events in the
video. One of the strengths of this framework is its versatility. It can be used in
both unsupervised and supervised training modes, which makes it suitable for a
wide range of scenarios and datasets.The summaries generated by the RL agent
are impressive because they capture key events while maintaining a smooth flow.
By utilizing the 3D Spatio-Temporal U-Net for advanced feature representation
and incorporating an RL agent for personalized and context-aware summaries.
One challenge with the RL approach is its black-box nature, which makes it
difficult to understand the decision-making process of the RL agent. It would be
great to explore more interpretable RL models to address this issue. Additionally,
12

the computational complexity, especially for longer videos, is a concern. It would


be smart to investigate lightweight architectures or faster training algorithms to
address this challenge. To make the framework even better, testing it on a wider
range of video genres and lengths is essential to ensure its reliability. We should
also look into potential biases in RL agents inherited from training data and find
ways to mitigate them for fair and unbiased video summarization. ”Video
Summarization through Reinforcement Learning with a 3D Spatio-Temporal
U-Net” by Liu et al. offers an exciting approach to video summarization. By
merging deep learning and reinforcement learning, it provides effective and
user-satisfying summaries. However, to fully unleash its potential, the framework
needs to overcome challenges related to interpretability, computational
complexity, and potential bias. By doing so, we can responsibly and reliably
apply this approach in video analysis and summarization.

A Collaborative Spatial-Temporal Network with Attention Representation for


Video Memorability Prediction [11] introduces VMemNet, an innovative deep
learning architecture designed specifically for predicting video memorability.
This neural network has a wide range of applications, including video editing,
recommendation systems, and educational content creation. It combines two deep
learning subnetworks: a convolutional neural network (CNN) that extracts visual
features and a recurrent neural network (RNN) equipped with an attention
mechanism to capture the temporal dynamics and highlight important video
segments. By taking this collaborative approach, VMemNet is able to gain a
comprehensive understanding of the video content, ultimately making more
accurate predictions about what aspects of the video will be memorable to
13

viewers. One of the key features of VMemNet is its joint optimization of the
CNN and RNN. By ensuring that these two subnetworks work together
seamlessly, the model is able to produce highly accurate predictions about video
memorability. It outperforms existing methods in the field and achieves
state-of-the-art performance on benchmark datasets. VMemNet’s ability to
integrate spatial and temporal information, coupled with its attention mechanism
for identifying important video segments, sets it apart from other models.
However, there are also some challenges that VMemNet faces. One of these
challenges is its black-box nature, particularly the interpretability of the attention
mechanism. The authors suggest that future research should focus on developing
more transparent models to address this issue. Another challenge is the need to
evaluate VMemNet’s generalizability across different video genres and lengths.
While the model shows promising results on existing datasets, more evaluation is
needed to determine its applicability in various contexts. Additionally, the
computational complexity associated with training and deploying VMemNet may
limit its real-world usability. The authors suggest exploring lighter-weight
architectures or more efficient training methods to address this challenge. Lastly,
there is great potential for personalization with VMemNet, as it can be adapted
based on individual viewer preferences. This intriguing avenue for future research
could provide a deeper understanding of what truly makes videos memorable to
different individuals.

Video Memorability Prediction via Late Fusion of Deep Multi-Modal Features


[7] is a study that introduces an innovative framework that predicts the
memorability of videos by combining various types of features using specialized
deep learning models. The framework incorporates neural networks for analyzing
captions, ResNets for extracting visual features at the frame level, and 3D
14

ResNets with Fisher Vectors for capturing motion dynamics. By considering


textual, visual, and motion features, Leyva and Sanchez Silva’s approach
effectively takes into account the complex factors that influence video
memorability. One of the strengths of this methodology is its ability to combine
predictions from different modalities through a clever greedy search algorithm,
known as the late fusion mechanism. This flexible approach allows the model to
dynamically adjust the weights assigned to each modality, resulting in improved
accuracy in predicting video memorability. The authors demonstrated the
state-of-the-art performance of their framework on the MediaEval2019 dataset,
highlighting its effectiveness in predicting long-term memorability. However,
despite its merits, Leyva and Sanchez Silva’s research paper does have some
limitations that should be considered. While the late fusion mechanism proves to
be effective, it lacks transparency, making it challenging to fully understand the
contribution and weighting of each modality in the final prediction. Additionally,
the computational complexity of training and running multiple deep learning
models raises concerns about the practicality of real-time applications. It is
crucial to address these limitations in order to improve the interpretability and
practical utility of the framework. Future research could focus on enhancing the
transparency of the late fusion mechanism, allowing researchers to gain deeper
insights into the prediction process. Exploring lighter-weight architectures or
more efficient training strategies could also help alleviate computational
complexity concerns. In conclusion, Leyva and Sanchez Silva’s work represents a
significant advancement in the field of video memorability prediction. However,
it is essential to carefully consider the limitations outlined in their research and
explore potential areas for refinement in future studies. By addressing these
challenges, researchers can unlock the full potential of this framework and further
15

contribute to the progress of video memorability prediction.

Hasnain Ali et.al propose a novel model to predict episodic video memorability
[1], which considers viewers’ memory of certain events within the same given
video. This research gives a multifaceted understanding on how people engage
and remember various media content; it can be applied to video editing,
advertisement placement and design of educational materials. Strengths of the
suggested framework include focusing on episodic memory, deep features fusion,
and fuzzy-based FastText text analysis. The focus on episodic memory in the
framework distinguishes it from other methods, providing a more detailed
understanding of what viewers recall within a video. This deep features fusion
strategy harnesses numerous such models to extract mechanisms inclusive of
color histograms, object Faster R-CNN detection and Principal Component
Analysis for temporal dynamics. Furthermore, the incorporation of fuzzy-based
FastText text analysis for video annotations brings in semantic consideration
which improves on feature space and helps capture relationships between texts.
Results of experiments on the data set MediaEval 2018 show better results,
reflected in higher Spearman’s rank correlation coefficients for short-term and
long-term memorability of episodic events. Computational complexity and
improved interpretability pose challenges to the proposed framework. The
intricacies of the feature fusion and episodic memory identification process
should be made more transparent. Eliminating these limitations is a crucial step
to refining the model and unlocking its potential applications in personalized
video experiences, optimizing processes of creating videos, and improving
educational materials designed for effective knowledge retention. On the whole,
the work conducted by Ali et al. adds considerable knowledge to our
16

understanding and predictions of episodic video memorability that constitutes a


promising advancement in this field.

Cohendet et al.’s innovative study “VideoMem” [2] is addressing the burning


issue of video comprehension and prediction on how memorable videos are
which has numerous consequences for anyone involved in editing, making
recommendations or generating educational content. The paper outlines key
strengths such as a big dataset, investigation of deep neural network models,
mutual use with attention mechanism and high results for predicting the loss from
videos both in short-term perspective and long one. The dataset consists of 10,00
videos, including memorability scores for short- and long term recall which is one
vital strength VideoMem can offer to reinforce the research field. 3D
convolutional neural network for visual features and a recurrent neural network
RNN there is the commitment to multimodal approach deep nn models illustrate
this statement. By adopting a dynamic attention mechanism, the model becomes
more accurate in prioritizing important features from both visual and audio
modalities. VideoMem goes beyond providing its dataset and models, giving a
standard for the state-of-the-art memorability prediction accuracy while being an
essential source of information that can be used in future studies. The dynamic
attention mechanism and multi-modal approach indicate the significance of visual
as well as audio elements to make robust predictions, which significantly helps in
progressing with video memorability prediction methods. VideoMem recognizes
strengths of the system but also points out computing complexity related to large
scale deep learning models and mention lights-weight architectures as well as
training methods that are more efficient. The paper acknowledges the need for
increased interpretability, particularly in relation to attention mechanism and
suggests that transparent models should be developed. From this holistic
17

perspective, VideoMem becomes an influential piece of work in the field dealing


with challenges on one hand and promoting our knowledge about video
memorability, thus creating a path for more effective optimization of video
content resulting in higher levels of user engagement from individualized videos.

Sweeney et al. Memories in the Making Predicting Video with Encoding Phase
EEG [14] is a methodology that consists of recording neural activity responses to
the primary presentation with videos, which provides critical information about
neurocognitive processes that guide memory formation. The strong points of this
study include innovative approach, generation procedure for the EEGMem
dataset set ups, issues tackled using a smart deep learning framework and
exceptional levels reached in predictive rate. As for the strengths of this research,
one has to say that it applied a new approach EEG signals obtained during
encoding phase can predict memorability video. This study has opened the path
forward for EEG application involving a unique approach to understanding the
tangle of human cognition and video content. Methodologically, as a notable
contribution the authors present EEGMem dataset following EEG recordings
from subjects watching video excerpts which constitutes an outstanding database
rich in brain activity studies for studying how these activities affect memorability.
The study uses a deep learning structure to transform EEG signals in the visual
domain, extracting relevant features for memorability prediction. This innovative
approach highlights the successful science of artificial intelligence incorporation
into cognitive neuroscience and is not only promising in video memorability
research but also implies a wider scope for its applications, such as they are
related to brain-computer interfaces or neural feedback systems. Though these
are strengths, the study acknowledges limitations. It is noted that the EEGMem
dataset was relatively small, and it pointed out an urgent need for growing off
18

with respect to heterogeneous participants as well as video content. Thus


increasing the external validity of obtained results would be one of goals in
further work. Furthermore, this study is aware of the issue with interpretability
because it notes that current predictions regarding memorability are not easy to
understand in regards to what specific EEG features and brain regions contribute
most towards these predictions. These thoughts influence the directions of future
research, such as differentiation between subjective variations in brain activity
and memorability mortality rates or tests combined with other physiological
measurements for a biofunctional perspective on video cognitive memory.

The limitations that apply to the “VideoMem” paper and the currently existing
state of research on video memorability can be described as follows. The result
indicates that the current deep learning models for memorability prediction
heavily depend on the visual features while they may overlook the effect of the
Auditory and Textual features. However, the VideoMem dataset addresses some
of the limitations of the VGG dataset by being large, while at the same time, there
is a concern that it caters for a particular style and/or genre of videos, thus
limiting the usability of results to real-world videos. Also, deep learning model is
highly complex and nontransparent that provides methodological restrictions to
interpretability and thus researchers gets very little understanding of which
factors contribute to the memorability of the video from the proposed model.

Based on this analysis, several central issues emerge in the future. Due to the
differences in the topics and coverage of news articles and human-identified
videos, the integration of multimodal features including visual, auditory, and
textual modality has the potential to strengthen the memorability models and their
capability of providing better predictions. However, other factors like people’s,
19

history of watching, and cultural background if taken into consideration can lead
to more definite outputs. Finally, the actual techniques of creating explainable AI
would help in the understanding of the structure and components of the
memorability model and identify what exactly is important for making the video
memorable to the audience. In light of these issues, it is imperative that the field
of video memorability research continues to progress and adapt; service as a
platform for further research, toward practical applications in content production
and design, advertisement, and other fields that concern user interest and
interaction.

Image captioning is a task that has recently attracted interest in computer vision
and natural language processing, which entails providing a textual description for
an image without human intervention. The earlier methodologies are
feature-based with static representation of knowledge and often required different
models for different modalities which restrained performance. Followed by the
powerful architecture of m-RNN for image captioning, Mao et al (2015) [13]
introduced a novel architecture known as multimodal recurrent neural networks
(m-RNN). This novel framework seamlessly integrates visual and textual
information, comprising three key components: to be a visual encoder containing
a convolutional neural network (CNN) to encode the pictures, a textual encoder
with a recurrent neural network (RNN) to encode caption-related features, as well
as a fusion layer containing the caption-generating RNN. The network learns
completely from the input image to the caption it generates by optimizing the loss
function with large-scale datasets. As shown in the quantitative analysis of the
m-RNN model, BLEU, METEOR, and CIDEr scores provide evidence for
improved effectiveness of the proposed model compared to earlier renditions with
the benchmark datasets of Flickr30k and MSCOCO. Thus, although the m-RNN
20

architecture aims to integrate multiple advances in multitask learning, simply


constructing more complex architectures is still a problem that requires additional
attention on scalability, robustness, and potential bias in AI algorithms. However,
the proposed m-RNN has greatly contributed to the Machine Learning
development providing researchers with innovative direction in the field, such as
multimodal learning and natural language processing.

Li et al.[8] provided a new solution in the form of an application that generates


video textual summaries. Their work is to tackle the problem of generating
semantically connected and informative textual descriptions of videos. For the
automation of the method, natural language processing and deep learning are
adopted to recognize the important parts of events and to synthesize a brief report.
Thus, by using video frames to extract visual characteristics while utilizing
contextual information and language patterns, the model gives a proper summary
of the events that underpinned the videos. In the way of the experimental
assessment and comparison with the previous work, Li et al.[8] show that the
method works fine in generating accurate textual descriptions of the different
kinds of videos. However, there are some concerns that can be associated with
their strategy, including the inability to describe complex scenes or rare events,
issues with increasing model size when working with massive datasets, and the
possibility of incorporating the prejudices of the training data or implied model
into the generated summaries. However, it should also be noted that there are
certain complexities associated with the proposed method and this may limit its
ability to operate in real time in some instances.

Consequently, Xu et al.[15] offered a solution to this problem by presenting a


unified model that incorporates the deep representations of the videos along with
21

the compositional text that will also enable the integration of the video and
language information more seamlessly. Their approach employs deep learning in
trying to extract high levels of features concerning the videos as well as the text
by employing compositional text representations. When trained under this hybrid
framework, two modalities, the model is capable to harvest the relationship
between the visual and textual data, helping accomplish various tasks such as
video captioning and description. As shown in Xu et al. [15], the proposed
approach successfully learns the associations between vision and language
through experiments and provides better or similar results than prior art. However
there are few potential disadvantages of their approach which we have to
carefully consider: There maybe ambiguities that arises while training deep
learning models having servers of multimodal data, especially while dealing with
so many and different type of visuals and textual patterns, and Lastly, large scale
annotated datasets are essential to train a powerful Deep learning model, which
may not always possible to obtain. Also, it is critical to understand the learned
representations qualitatively and to question the predatory nature of the outputs
generated by the model.

Yu et al.[16] devised a new solution based on hierarchically-attentive recurrent


neural networks to create summaries and story narration from albums of images.
Their approach incorporates hierarchical attentions to capture the coarse-grained
and fine-grained aspects in the articles of making the model work only on the
right features while summing up. In this way, the developed approach would
impose the possibility to create coherent and meaningful text that naturally
reflects the content of the initial picture albums. The studies of Yu et al.[16]
present a detailed analysis of the proposed model and comprehensive
experimental and evaluation results on various datasets and benchmarks proving
22

its effectiveness in generating accurate and coherent summaries and storytelling.


However, there are also some drawbacks that can be expected are the
predominance in training hierarchical attention models, the difficulties in
handling large-scale image albums or albums contain heterogeneous content and
the need for the annotated training data will be a limitation in the amount and the
expenses. Therefore, an investigation of the overall applicability of the proposed
method, as well as its scalability in real-world problems, is worth exploring.

Cummins et al.[3] have made a similar contribution to this area by examining the
nature of the procedural crime-drama series, CSI. This is done in terms of
features like content in visual presentation, the kind of story arcs used,
development of the characters as well as using emotional interest as components
that help determine factors that aid memory retention on the series. Thus,
applying the methodology of multimedia analysis and using the concepts of
psychological memory, the authors introduce readers to the possibilities of
combining elements of the known TV series. In this study, Cummins et al.[3]
have endeavored to dissect various aspects of viewer memory by employing
psychological methods that utilize real data. Within the empirical study outlined
by Cummins et al. [3] and based on the analysis of real data, the authors explore
the processes of formation and storage of viewer memory for episodic television
content. Possible drawbacks of the proposed approach may include the following:
Difficulties associated in applying the findings from one TV series to other series
or even a different genre, It might be difficult to measure emotions several times
during a program where their value would be threadbare, There is a strong
indication that large-sample low-investment user studies will need to be done to
validate the proposed paradigms. Finally, if there is a tendency in choosing the
23

methods to perform analysis or select samples of data, then the broad relevance of
the results will be influenced this way.

The current literature in video memorability prediction and summarization


contains several limitations, as shown in the surveyed work, that our research
attempts to mitigate. Here, Ali et al.[1] and Leyva and Sanchez [6] are all about
deep feature fusion for memorability prediction, but those predictions are not
incorporated into a usable video summarisation system. Cohendet et al.[2] and
Dumont et al.[4] investigate short-term and long-term memorability but avoid the
integration of explainable AI (XAI) techniques in the model to improve
interpretability and accuracy. A primary study limitation is that the research by
Cummins et al.[3] or Guinaudeau and Xalabarder [5] analyze specific
genres/textual analysis but does not systematically consider suitable multimodal
features that include single and multiple audios or videos. Further, while Li et
al.[8] and Liu et al. [10] study video storytelling and the summarization process
using machine deep learning techniques such as LSTM networks and
reinforcement learning, the authors do not employ memorability scores as the
means of optimizing content. Further, there are works another set of works by Lu
et al. [11] and Lu and Wu [12] propose attention mechanisms for memorability
prediction but do not include XAI techniques for the refinement and explanation
of these predictions. Lastly, the studies deep captioning Mao et al.[13], and the
use of hierarchical attention mechanism Yu et al. [16] do not incorporate memory
that is inherent in human cognition. It also has some shortcomings, which we aim
to address by integrating memorability prediction with a reliable summarization
approach and utilizing XAI tools such as SHAP and LIME for better model
accuracy and interpretability along with video and audio features for a
comprehensive approach.
CHAPTER 3

PROPOSED SYSTEM

3.1 SYSTEM ARCHITECTURE

The video memorability prediction, Captioning system is a multi-stage


architecture that would analyze the content of videos for predicting their
likelihood to be remembered and generate textual Captions. The input to the
system is a video that initially gets divided into several portions, each incident for
about six seconds. These segments are the basic units for further processing. The
proposed system architecture is given below at fig 3.1

The framework of video summarization proposed in this paper involves the use of
various techniques in an effort to recognize the important scenes in a video as
well as providing captions to these segments. Foremost, the Video Memorability
Prediction Module using the Memento10k dataset employs spatial and temporal
features such as EfficientNet, VGG-net, and C3D to estimate the videos’ segment
memorability score. Post training of the model, XAI approaches are used to
analyze the decision-making process for locating features that help in predicting
memorability factors and model tuning to enhance performance. These scores are
then subjected to the thresholding process where only those which are greater
than a predetermined value are selected by using the score obtained in the
experimentation with textual captioning output as the threshold. The Video
Captioning Module then analyzes these memorable segments focusing on
techniques to extract textual descriptions to identify the same. Last but not least,

24
25

F IGURE 3.1: Video Captioning Architecture

the Textual Captioning step combines all of these captions into one coherent
textual summary of the most extraordinary elements of the video. The
architecture presented in this paper incorporates machine learning models, deep
features, XAI, and threshold-based decision-making for subjecting video content
to memorability-based summari- zation they present a viable solution in concept.
26

3.2 VIDEO MEMORABILITY MODEL

This module is created to anticipate the memorability of video parts, harnessing


an advanced model trained on Memento10k dataset. 10,000 short six-seconds
intervalled videos along with their annotations for shallow and deep memory are
collected in this dataset. By using this vast dataset, our system can understand the
intricacies of memorability in a wide variety of content. This module is explained
in the below fig 3.2.

F IGURE 3.2: Memorability Prediction Model Architecture


27

3.2.1 Feature Extraction

The Video Memorability Module’s core is a model designed to forecast


memorability. This model has rigorous training, beginning with deep learning
features extraction from the video segments. In particular, we use state-of-the art
feature extraction techniques including the EfficientNet features, VGGnet and
C3D. These features are key in capturing both spatial and temporal information,
which is important for understanding how the content’s spatio-temporal dynamics
operate.

3.2.1.1 VGG16 Feature Extraction

VGG16 consists of 16 weight layers, including 13 convolutional layers and 3


fully connected layers. The convolutional layers are characterized by their small
3x3 convolutional filters, with a stride of 1 and zero-padding to maintain spatial
resolution. The depth of the convolutional layers gradually increases, allowing
the network to capture complex hierarchical features.

The use of VGG16 a classic CNN architecture 3.3 to capture detailed spatial
features. Using the pre-trained weights of VGG16 enables us to acquire mid level
visual features that supplement our hierarchical feature. VGG16 processes video
frames, and the activations from significant layers are taken as features. VGG16
extracts multi-level features that give a subtle meaning to spatial information,
which contributes towards the overall understanding of spatio-temporal.
28

F IGURE 3.3: VGG16 Architecture

3.2.1.2 EfficientNetB0 Feature Extraction

EfficientNetB0 model, known for its efficiency in balance between size and
performance of models at capturing high-level visual features from video
segments. By utilizing the pre-trained weights of EfficientNetB0, it is possible to
acquire semantically rich features that are instrumental in achieving an adequate
comprehension of the visual content.

The video frames are passed to EfficientNetB0 , and the features extracted from
intermediate layer outputs. These features capture hierarchical representations that
enable the model to distinguish both global and local visual patterns within each
frame.

F IGURE 3.4: EfficientNetB0 Feature Extraction


29

In the above architecture Fig.3.4, the input frames extracted from the video data
are fed into the efficientnetB0 model . The features are extracted from the layer
before the top layer and it produces a feature representation of shape (1, 1, 1280).
These image features are then combined to get the feature for the entire video and
then used in the feature vector for further model training.

3.2.1.3 C3D Feature Extraction

C3D consists of a series of 3D convolutional layers that operate on both spatial


and temporal dimensions. Unlike traditional 2D CNNs that process images frame
by frame, C3D considers the temporal aspect by incorporating 3D convolutions
across consecutive video frames. The architecture typically comprises eight
convolutional layers, five pooling layers, and two fully connected layers.

C3D is built for spatio-temporal feature extraction, thus suitable to be used as a part
of our model. By adding not only spatial features but also motion dynamics to the
model, C3D improves its ability to capture temporal patterns in video segments.

F IGURE 3.5: C3D Feature Extraction

3.2.2 Ensembled Model - Support Vector Regressor, Decision


Tree, RidgeRegressor

The deep learning features are extracted and used as inputs to an ensembled that
functions in the regression mode. An ensembled model with the model
30

combination of Support Vector Regressor, Ridge Regressor and Decision Trees,


form a strong machine learning algorithm is good to distinguish patterns and
relationships in high-dimensional feature spaces. By supplying our carefully
selected attributes to the ensembled model, we help it learn and map intricate
correlations between video content memorability. The trained ensembled model
becomes capable of predicting the score for each video segment’s memorability.
The Ensemble model is explained in the below figure 3.6

F IGURE 3.6: Ensemble Model

3.2.3 Dual Memorability Prediction

Importantly, the model is set to predict both short-term and long-term


memorability. This dual predictability increases the system’s flexibility and
allows to understand in what ways content is remembered by viewers at various
time scales. This subtle understanding is especially useful in applications like
video editing, content recommendation or the making of educational materials.

3.2.4 Explainable AI(XAI)

The Explainable AI approaches were integrated with the regression model


developed in this research in a bid to give an explanation to the model. Two key
techniques were employed: A model-agnostic approach known as Local
31

Interpretable Model-agnostic Explanations (LIME) and SHapley Additive


exPlanations (SHAP). They enabled making a better understanding of how
exactly inputs resulting from model’s individual predictions are guided by
different factors.

3.2.4.1 Local Interpretable Model-Agnostic Explanations

LIME provides a non-localized outlook on our regression model’s forecasts,


which enables us to evaluate single predictions’ causes. Since LIME predicts how
a model behaves in a specific case and approximates the local behaviour, it
derives local explanations. For each prediction, LIME identifies a set of features
in neighborhoods around the input that are a match and fits an interpretable model
such as linear regression to this selection. It helps to determine how significant
the value of any particular characteristic is in relation to a specific predicted
result.

Applying the described visualization, we utilize the show in notebook function in


Jupyter notebooks to visualize the LIME explanations. This was possible due to
the high degree of interpretability of the feature contributions which in turn helped
in explaining the procedures that went on in the mind of the model.

3.2.4.2 SHapley Additive exPlanations

SHAP values are used to prioritize each feature’s contribution total important score
for the results of the model. By calculating SHAP for all the instances the overall
contribution of each feature in determining the model prediction was obtained.
32

The feature importances summary of the SHAP results has been presented in the
SHAP summary plot created through the help of the summary plot function
which provided the overall overview of feature importances. This enabled the
author to determine variables that significantly shaped the results predicted by the
model. Additionally, ICE plots were examined in relation to the feature
importance because these plots showed the conditions or patterns between the
feature and the SHAP values for many instances individually. These plots offered
an important understanding of how the results of the model change with a given
feature/change in feature values.

By combining the LIME and SHAP models, regression model interpretability and
reliability were improved by providing richer and more detailed information.
Such a model allowed for examining the decision-making of the model further
and ascertain the important features influencing its prediction. Therefore, there is
an enhanced trust in the results of our study together with the possibility of
making the right decisions regarding the model outputs. Therefore, with the
Intervention of the Interpretations from the Explainable AI frameworks, a new
model is performed that has only 1000 most basic features that contributes to the
model disregarding the rest of features that do not impact the model. This actually
helps to make our model even better in terms of Spearman Correlation

3.3 VIDEO CAPTIONING MODEL

The Video captioning model is to create automatically generated captions for the
videos to get meaningful insights from the video without the need to watch the
video. The architectural design, known as an Encoder-Decoder paradigm,
33

integrates various components to effectively handle both visual and textual


information.

F IGURE 3.7: Video Captioning Model Architecture


34

3.3.1 Input Frames

The model processes both visual and auditory frames extracted from the video:

• Visual Frames: Resized to a resolution of 224 × 224 with 3 color channels


(RGB).

• Audio Frames: Processed at a resolution of 96 × 64 with 3 channels


representing the Mel-spectrogram features.

3.3.2 Feature Extraction

• Visual CNN (ResNet50): The ResNet50 model pre-trained on ImageNet


was employed to extract the high-level features of the visual images to be
processed. This output feature map has a size of 7 × 7 × 2048.

• Audio CNN (MobileNetV2): The MobileNetV2 model, also available and


pre-trained on ImageNet, was employed to extract the auditory features from
the frames obtained from the audio signals. The output feature map is a 3 ×2
matrix of feature maps of dimension 1280.

3.3.2.1 ResNet50v2

ResNet50 v2 refers to an advanced model of ResNet, an architecture developed to


overcome the deterioration issue in deep learning frameworks via residual
learning. It proposes identity mapping through the add operations that connect
two layers, in which two layers are connected by one or more intermediate layers,
35

thus making the training of deep networks more efficient. This version applies
notBatch normalization and relu activation before convolution
layer(pre-activation) which stabilizes the gradients and enhances its flow and
subsequently increases the overall performance. ResNet50v2 network has 50
layers which comprises the convolutional layers and some identity blocks. And as
for today, it is applied mostly for image classification and other problems and
tasks in the field of computer vision since it has relatively high robustness and
accuracy.

F IGURE 3.8: ResNetv2 Architecture

3.3.2.2 MobileNetV2

MobileNetV2 is one of the new state-of-the-art efficient mobile architectures that


are designed to work in environments where computing resources are limited. It
improves the basic MobileNet model based on the inverted residual and linear
bottleneck structures, which nonetheless enable a decrease in the model size and
the required computational power by several have minimal impact on the
36

accuracy of the model. The network structure uses depth-wise separable


convolutions, thus replacing the ordinary convolution operation with two steps,
namely depth-wise convolution and point-wise convolution to optimize the
model. As the result, MobileNetV2 is less likely to consume much power to
work, and this makes it quite suitable for mobile devices because its performance
is not very far off being both slightly better on accuracy yet potentially worse on
latency. It is typically applied to scenarios like, object detection, segmentation
and classification in low computational settings.

F IGURE 3.9: MobileNetv2 Architecture

3.3.3 Flattening and Dense Layers

• Visual Features: The ResNet50 network feature map is then flattened and
finally fed through a dense layer to get a 2048-dimensional feature vector.

• Audio Features: In the same way, the feature map extracted from
MobileNetV2 is flattened, and the flattened vector is created with 1280
features.
37

3.3.4 Concatenation of Features

The two features; visual and audio are joined together as one to form the complete
feature vector which leads to an increased size of the vector from 1632 to 3328.

3.3.5 LSTM Encoders

• Visual LSTM Encoder: The visually described features are given to the
LSTM encoder, which has 512 hidden units and produces a sequence of
hidden states along with the final states of the encoder.

• Audio LSTM Encoder: Another LSTM encoder, similar to the visual


LSTM encoder, is used for processing the audio features distinctly,
generating its own sequence of hidden states and final encoder states.

3.3.6 Decoder Initialization

The final states of both the LSTM encoders are combined to initialize the decoder
state so that both visual and audio context can be used.

3.3.7 Decoder

• Embedding Layer: The captions are then preprocessed using tokenization


and converted into low-dimensional representations.
38

• Decoder LSTMCell: The LSTM cell in the decoder outputs a new word to
the current sequence: the actual previous word if we are in the training phase
or the predicted word if we are in the testing phase.

• Fully Connected Layer: Transforms the LSTM output to the word space,
generating a word probability distribution, or logits, over the vocabulary
space.

3.3.8 Caption Generation

• Training Phase: Teacher forcing is employed in the model where, instead


of the next word in the sequence being randomly determined, it is substituted
with the actual next word in the sequence to the LSTM cell. It focuses on the
word level, and the loss is computed using Sparse Categorical Crossentropy
between the predicted vector and the actual words.

• Inference Phase: The model developed here also produces captions by


outputting one word at a time until the end token or the predetermined
maximum number of captions is reached.

3.3.9 Training and Evaluation

3.3.9.1 Dataset Preparation

A custom dataset class is developed to load video frames and their corresponding
captions from a dataset and preprocess them. The class is also responsible for
correctly feeding data to the model in increments.
39

3.3.9.2 Loss Function

The loss function combines traditional cross-entropy loss with an optional BLEU
score-based loss to balance word-level accuracy and overall caption quality:

• Cross-Entropy Loss: Calculates the dissimilarity between the actual word


sequence and the one predicted by the model.

• BLEU Score Loss: Calculates the corresponding distance of the generated


and reference captions and scales it for backpropagation.

3.3.9.3 Training Loop

For each epoch of the training loop, the program processes batches of input,
which consist of video frames and their captions. For each batch, the train step
function updates the model weights using the calculated loss. The Adam
optimizer is adopted because it can adjust the learning rate during training.

Thus a new video captioning model is created that fully utilizes the combined
visual and auditory information to produce relevant and semantically linked
descriptions for videos. The high-level CNN architectures in the feature
extraction phase and LSTM networks for the sequence modeling of the videos
allow the model to identify key characteristics and capture the dynamics involved
in the videos. This approach increases the quality of captions as it considers
different areas and targets them, which is essential for video indexing,
accessibility, and recommendation applications.
40

Thus a Video Memorability prediction model is created which is then optimized by


using Explainable AI. The resulting high memorable clips will then be summarized
by using the video captioning module to provide Video Summaries.
CHAPTER 4

IMPLEMENTATION

4.1 DATASET

The Dataset used for training the video memorability prediction model is the
Memento10K Dataset. The Dataset consists of Videos and their corresponding
short-term memorability and long-term memorability. The Dataset consist of
10,000 short videos each of length 6 seconds.

Video Short-term Memorability Long-term Memorability


Video1.webm 0.75 0.82
Video2.webm 0.68 0.75
Video3.webm 0.80 0.88
video4.webm 0.40 0.43
video5.webm 0.65 0.60
TABLE 4.1: Sample Memorability Data

The Dataset is collected by asking the participants to watch 180 video in short-
term memorability and 120 videos in long-term memorability. The task of the
participants is to press the space bar when the participants find the video similar the
previously watched videos. By this the short-term and long-term memorabilities
are calculated.

The Dataset that is used for captioning the video is the same Memento10K Dataset
which contains captions for each 6 seconds videos.These captions are stored as
a json format in a txt file. The Dataset that is used to check the correctness of

41
42

the video captioning model is the video storytelling dataset which contains longer
videos along with the captioned texts.

The video storytelling dataset contains various categories of videos in which the
birthday, camping, christmas, vacation dataset are present. The birthday and
camping datasets are taken in order to test the video captioining models outputs as
the Memento10K Dataset contains only shorter videos of 6 seconds. The dataset
contains longer videos with duration ranging from 10 minutes to 30 minutes.

4.2 HARDWARE SPECIFICATIONS

The Hardware specifications for the Video summarisation using memorability


scores are a minumum of i3 dual core processor that is capable of providing
power to extract features and train memorability model. Also Google Colab’s
Free GPU T4 is used for features extraction and training model. As most of the
experiments are conducted in google colab, google drive’s storage is used for
storing the video data and storing the models.

4.3 SOFTWARE SPECIFICATIONS

The Software specifications for the video summarisation using memorability


scores are Google colab and jupyter notebook. Python3 for programming,
packages such as numpy, pandas, matplotlib, tensorflow, scikit-learn, cv2 are
used. Numpy is used for storing the processed frames and to do mathematical
computations faster. Pandas is used to read in the video metadata, Analysing the
43

video data. Matplotlib.pyplot for visualising the frame images and video data.
tensorflow is used to extract features from the video data and in training ANN
Models. Scikit-learn is used to create machine learning models for video
memorability predicitions.

4.4 FEATURE EXTRACTION

Feature extraction is a step that involves transforming raw data into a set of relevant
and informative features. These features capture essential characteristics of the
data, enabling more effective representation and analysis.

4.4.1 VGG16 Feature Extraction

The VGG16 model is loaded in with the weights trained from the imagenet. The
frames from the video data is extracted as 20 frames per video and each frame is
given to the VGG16 model. The VGG16 model’s final CNN layer is extracted as a
feature and is given to a GlobalAveragePooling2D layer which converts the learnt
patterns into a feature vector. The final MaxPooling layer is extracted as a feature.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import VGG16
b a s e m o d e l = VGG16 . VGG16 ( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = False
f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s =0))
from t e n s o r f l o w . k e r a s . l a y e r s import G l o b a l A v e r a g e P o o l i n g 2 D
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
VGG feature vector

L ISTING 4.1: VGG16 Feature Extraction


44

4.4.2 C3D Feature Extraction

A sequence of video segments is fed into the C3D model, and its features are
extracted using convolutional layers. Spatial and temporal information is captured
by these features, as they represent the dynamic content of a video more fully.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import C3D
b a s e m o d e l = C3D ( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = False
f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s =0))
from t e n s o r f l o w . k e r a s . l a y e r s import G l o b a l A v e r a g e P o o l i n g 2 D
C3D feature vector = GlobalAveragePooling2D ( ) ( f e a t u r e s )
C3D feature vector

L ISTING 4.2: C3D Feature Extraction

4.4.3 EfficientNetB0 Feature Extraction

The Video Segments are given to the EfficientNetB0 Model which is used to
extract the features based on the hierarchical representations that enable the
model to distinguish both global and local visual patterns within each
frame.Image frames of size (224, 224) is fed into the EfficientNetB0 model which
gives an output feature of shape (1, 1, 1280) .
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import E f f i c i e n t N e t B 0
model = E f f i c i e n t N e t B 0 (
weights= ’ imagenet ’ ,
i n c l u d e t o p =False ,
shape =(224 , 224 , 3 ) )
45

f e a t u r e s = model . p r e d i c t ( t f . e x p a n d d i m s ( f r a m e s 2 0 [ 0 ] , a x i s = 0 ) )
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )

L ISTING 4.3: EfficientNetB0 Feature Extraction

4.5 ENSEMBLED (SVR + DT + RIDGE) MODEL

4.5.1 Feature Extraction and Ensemble Model Integration

To feed the Ensembled model, features are initially extracted from the video
segments utilizing three pre-trained models: EfficientNetB0, VGG16, and C3D.
EfficientNetB0 works with hierarchical features, VGG16 is about high-level
visual representations and C3D extracts spatio-temporal features. These identified
features are fed as input to the Ensembled model, which in turn takes a
concatenated feature vector that encapsulates all visual and temporal
characteristics.

4.5.2 Memorability Prediction Process

The Ensembled training is done on Memento10k dataset have 10,000 very short
videos, with memory score of both shorter and long time observation. Ensembled
model takes the concatenated feature vectors of the three pre-trained models and
unifies them into a single input. This training process consists of fine-tuning SVR
specifics to develop an optimal hyperplane that minimizes predictions mistakes,
and the optimal number of branches in the decision tree, and choosing the most
46

approximate regression fit, so the model can distinguish complex relationships


between input features and memorability scores.

Once the Ensembled model is trained, feature vectors of unseen videos are given
to the model and is tested to check the efficiency of the model in predicting the
memorability of the video. During training the metrics Mean Absolute Error
(MAE) and Mean Squared Error (MSE) are used as metrics to optimize the
hyperplane in the SVR.

Model Creation - Ensemble Model:


from s k l e a r n . e n s e m b l e import V o t i n g R e g r e s s o r
ensemble model = VotingRegressor (
[ ( ’ s v r ’ , svm model ) ,
( ’ ridge ’ , ridge reg ) ,
( ’ decision tree ’ , decision tree model )
])
ensemble model . f i t ( X t r a i n s c a l e d , y t r a i n . f l a t t e n ( ) )

L ISTING 4.4: Sample code - Ensemble Model Creation

4.6 EXPLAINABLE AI

The Explainable AI approaches were integrated with the regression model


developed in this research in a bid to give an explanation to the model. Two key
techniques were employed: A model-agnostic approach known as Local
Interpretable Model-agnostic Explanations (LIME) and SHapley Additive
exPlanations (SHAP). They enabled making a better understanding of how
exactly inputs resulting from model’s individual predictions are guided by
different factors.
47

4.6.1 Local Interpretable Model-Agnostic Explanations

F IGURE 4.1: LIME Explanations

LIME provides a non-localized outlook on our regression model’s forecasts,


which enables us to evaluate single predictions’ causes. Since LIME predicts how
a model behaves in a specific case and approximates the local behaviour, it
derives local explanations. For each prediction, LIME identifies a set of features
in neighborhoods around the input that are a match and fits an interpretable model
such as linear regression to this selection. It helps to determine how significant
the value of any particular characteristic is in relation to a specific predicted
result.

From the above fig 4.1 we can find the top 1000 features which contributes to the
model performance.
lime_weights = lime_exp . as_list ()
important_features_lime = [ feature [0] for feature in lime_weights [:1000]]
important_features = list(set( important_features_perm + important_features_shap + important_features_lime ))

print(" Important features :" , important_features )

L ISTING 4.5: Feature Importance Extraction from Lime

The above code is used to give importance to the top 1000 features from the total
features.
48

4.6.2 SHapley Additive exPlanations

SHAP values are used to prioritize each feature’s contribution total important score
for the results of the model. By calculating SHAP for all the instances the overall
contribution of each feature in determining the model prediction was obtained.

F IGURE 4.2: SHAP Explanations

By combining the LIME and SHAP models, regression model interpretability and
reliability were improved by providing richer and more detailed information.
Such a model allowed for examining the decision-making of the model further
and ascertain the important features influencing its prediction. Therefore, there is
enhanced trust in the results of our study together with the possibility of making
the right decisions regarding the model outputs.
49

Therefore, with the Intervention of the Interpretations from the Explainable AI


frameworks, a new model is performed that has only 1000 most basic features that
contributes to the model disregarding the rest of features that do not impact the
model. This actually helps to make our model even better in terms of Spearman
Correlation. After this inclusion of the XAI to our model, it provides a spearman
correlation of 0.483
important_features_perm = [ data . feature_names [i] for i in sorted_idx [:: -1][:1000]]

shap_summary_vals = np .abs( shap_values ). mean ( axis =0)


important_features_shap_idx = np . argsort ( shap_summary_vals )[:: -1][:1000]
important_features_shap = [ data . feature_names [i] for i in important_features_shap_idx ]

L ISTING 4.6: Feature Importance Extraction from SHAP

4.7 VIDEO CAPTIONING MODEL

4.7.1 Video Encoding and Audio Encoding

The initial step in this architectural framework involves video encoding, where
raw video data is processed using a Video Encoder which is a pre-trained
ResnetV2 Model and the audio features are encoded using the pre-trained
MobileNetv2. Meaningful features are extracted from each frame of the video,
and these features are then aggregated with the audio features to produce a
compact representation of the entire video which is referred to as the video
feature map. This feature map along with the caption encoding will go to the
attention mechanism and then to the decoding module to generate captions.
50

4.7.2 Caption Encoding

The Captions undergoes processing through a caption encoder module. This


module contains a long short term memory which is used to deconstruct the
words of the caption to create embeddings. These embeddings are useful in
creating the captions for the video data without captions.

4.7.3 Attention Mechanism

The attention mechanism plays a role in merging the visual and textual elements
during the caption generation process. It dynamically directs focus towards
different segments of the video feature map while generating each word of the
caption. By evaluating the similarity between the current state of the decoder and
the encoded video features, the attention mechanism ensures that the model
attends to the most important visual information, thereby enhancing the quality
and relevance of the generated captions.

4.7.4 Decoding and Caption Generation

The final stage in the architecture is the decoding and caption generation. The
Decoder module contains an LSTM which is a RNN based architecture that
utilizes the attended video features along wiht the previously generated words to
predict the next word in the caption sequence. This process continues until the
end of sequence captions are generated. The generated captions represents a
textual description of the content depicted in the input video, providing valuable
insights.
51

Thus the video summaries produced by this video summarization module produces
a BLEU-1 Score of 69.8 and a ROUGE-L of 28.95
CHAPTER 5

EXPERIMENTAL INVESTIGATIONS AND


RESULTS

5.1 EXPERIMENTAL OUTPUTS

5.1.1 Frame Extraction from the Video Data

Feature Extraction is a process of learning information from the underlying video


data. In the pretrained models like EfficientNetB0 and VGG16 video cannot be
directly inputed to get the features. Frames must be extracted which can be then
fed into the above 2d feature extraction models. Below fig 5.1 is the output of the
frame extraction from the video data.

F IGURE 5.1: Frame Generation from Video Data

52
53

5.1.2 Feature Vector Results

The Next step after the frame generation is to extract features from the pretrained
models (EfficientNetB0, VGG16 and C3D). The Feature Extracted from the
pretrained models are used as the input for the ANN and the Machine Learning
models to create the Memorability Prediction Model. The following Figures: 5.2,
5.3, 5.8 are the feature extracted from each pretrained models. These features are
then given to the ANN and Machine Learning models.

F IGURE 5.2: Feature Extracted from EfficientNetB0

F IGURE 5.3: Feature Extracted from C3D

F IGURE 5.4: Feature Extracted from VGG16


54

F IGURE 5.5: Memorability Score for a Single Short Video Instance

F IGURE 5.6: Store High Memorable clips in folder

5.2 VIDEO CAPTIONING OUTPUT

The Following figures shows the input video, corresponding high memorable clips
and the generated captions.
55

F IGURE 5.7: Input Video

F IGURE 5.8: Folder Containing High Memorable Video Frame for Long Video

F IGURE 5.9: Summary of the video


56

5.3 EXPERIMENTS

5.3.1 Experiment 1 - VGG16, EfficientNetB0, C3D Features


with 10 percent of the data using ANN

An Artificial neural network model is created and the first feature extracted vgg16
(800 samples, 10 percent of the data) is fed into the model. The model is created
with 128 neurons on the first layer, 64 neurons on the second layer, and the final
layer is the output layer for the memorability score. The Optimizer called as Adam
is used to optimize the model on each data with a learning rate of 0.001. The model
is built on the metrics of mean absolute error and mean squared error. The Features
are fed to the model through a custom Data Generator with a batch size of 2 and
fed into the model. The model is evaluated on the 100 samples from the training
data other than this 10 percent of trained data and the model is showing an training
mae of 0.194 and training mse of 0.391 and testing mae of 0.332 and mse of 0.122.
The spearson correlation tested from the output is -0.234.

5.3.2 Experiment 2 - VGG16, EfficientNetB0, C3D Features


with 50 percent of data using ANN

An Artificial neural network model is created and the feature extracted vgg16,
C3D features, EfficientNetB0 Features (4000 samples, 50 percent of the data) are
fed into the model. The model is created with 128 neurons on the first layer, 64
neurons on the second layer,32 neurons on the third layers, 16 and 4 and 1 neurons
respectively in the subsequent layers and the final layer is the output layer for the
57

memorability score. The Optimizer called as Adam is used to optimize the model
on each data with a learning rate of 0.001. The model is built on the metrics of
mean absolute error and mean squared error. The Features are fed to the model
through a custom Data Generator with a batch size of 32 and fed into the model.
The model is evaluated on the 100 samples from the training data other than this
10 percent of trained data and the model is showing an training mae of 0.353 and
training mse of 0.519 and testing mae of 0.277 and mse of 0.137 . The spearson
correlation tested from the output is -0.027.

5.3.3 Experiment 3 - VGG16, EfficientNetB0, C3D Features


with all training data using ANN

An Artificial neural network model is created and the feature extracted vgg16, C3D
features, EfficientNet Features are fed into the model. The model is created with
128 neurons on the first layer, 64 neurons on the second layer,32 neurons on the
third layers, 16 and 4 and 1 neurons respectively in the subsequent layers and the
final layer is the output layer for the memorability score. The Optimizer called as
Adam is used to optimize the model on each data with a learning rate of 0.001. The
model is built on the metrics of mean absolute error and mean squared error. The
Features are fed to the model through a custom Data Generator with a batch size
of 32 and fed into the model. The model is evaluated on the 100 samples from the
training data other than this 10 percent of trained data and the model is showing
an training mae of 0.051 and training mse of 0.0045 and testing mae of 0.070 and
mse of 0.008. The spearson correlation tested from the output is 0.029.
58

5.3.4 Experiment 4 - VGG16, EfficientNetB0, C3D Features


with 50 percent of data using Support Vector Regressor
model

The VGG Features, c3d Features, EfficientNet Features extracted from the video
data are given to an Support Vector Regression model. The amount of data given
to the SVR is 50 percent of the extracted features. The model is trained with the
metrics Mean Absolute Error (MAE) and Mean Squared Error (MSE). The Mean
Squared Error: 0.008, Mean Absolute Error: 0.083. The Spearman correlation
tested from the output is 0.233.

5.3.5 Experiment 5 - VGG16, EfficientNetB0, C3D Features


with all training data using Support Vector Regressor
model

The VGG Features, c3d Features, EfficientNet Features extracted from the video
data are given to an SVR model. In this the SVR model is trained on all of the
data. The model is trained with the metrics Mean Absolute Error (MAE) and
Mean Squared Error (MSE). The Mean Squared Error: 0.006, Mean Absolute
Error: 0.063. The Spearman correlation tested from the output is 0.345
59

5.3.6 Experiment 6 - VGG16, EfficientNetB0, C3D Features


with all training data using Decision Tree Model

The VGG16, EfficientNetB0, C3D Features are extracted from the video data are
given to a Decision Tree model. In this the Decision Tree model is trained on all
of the data. The model is trained with the metrics Mean Absolute Error (MAE)
and Mean Squared Error (MSE). The Mean Squared Error: 0.010, Mean Absolute
Error: 0.079. The Spearman correlation tested from the output is 0.168

5.3.7 Experiment 7 - Ensemble Model with VGG16,


EfficientNetB0, C3D Features on all training data

The VGG16, EfficientNetB0, C3D Features are extracted from the video data are
given to an Ensemble model. In this, the Ensemble model is trained on all of the
data.The Mean Squared Error: 0.006, Mean Absolute Error: 0.064. The Spearman
correlation tested from the output is 0.312

5.3.8 Experiment 8 - Ensemble Model with VGG16,


EfficientNetB0, C3D Features after XAI

The VGG16, EfficientNetB0, C3D Features are extracted from the video data are
given to an Ensemble model. In this, the Ensemble model is trained on all of
the data.The Mean Squared Error: 0.0047, Mean Absolute Error: 0.061. The
Spearman correlation tested from the output is 0.483
60

5.4 RESULTS

The Table 5.1 represents the comparison of spearman correlations for the above
experiments to find the best suitable model for the Memorability Prediction model.

Experiment Spearman Correlation


Experiment 1 - ANN (10% of data) -0.234
Experiment 2 - ANN (50% of data) -0.027
Experiment 3 - ANN (100% of data) 0.029
Experiment 4 - SVR (50% of data) 0.233
Experiment 5 - SVR (100% of data) 0.345
Experiment 6 - DT (100% of data) 0.168
Experiment 7 - Ensembled Model (100 % of data) 0.312
Experiment 8 - Ensembled Model (With XAI) 0.483
TABLE 5.1: Analysis of Spearman Correlation among Experiments

The Table 5.2 represents the comparison of Mean Squared Error for the above
experiments to find the best suitable model for the Memorability Prediction model.

Experiment Mean Squared Error


Experiment 1 - ANN (10% of data) 0.122
Experiment 2 - ANN (50% of data) 0.519
Experiment 3 - ANN (100% of data) 0.008
Experiment 4 - SVR (50% of data) 0.008
Experiment 5 - SVR (100% of data) 0.006
Experiment 6 - DT (100% of data) 0.011
Experiment 7 - Ensembled Model (100 % of data) 0.006
Experiment 8 - Ensembled Model (With XAI) 0.0047
TABLE 5.2: Analysis of Mean Squared Error among Experiments

The Table 5.3 represents the comparison of Mean Absolute Error for the above
experiments to find the best suitable model for the Memorability Prediction model.
61

Experiment Mean Absolute Error


Experiment 1 - ANN (10% of data) 0.332
Experiment 2 - ANN (50% of data) 0.277
Experiment 3 - ANN (100% of data) 0.051
Experiment 4 - SVR (50% of data) 0.083
Experiment 5 - SVR (100% of data) 0.006
Experiment 6 - DT (100% of data) 0.079
Experiment 7 - Ensembled Model (100 % of data) 0.064
Experiment 8 - Ensembled Model (With XAI) 0.061
TABLE 5.3: Analysis of Mean Absolute Error among Experiments

The Table 5.4 represents the comparison of BLEU Scores of the existing video
captioning models with our model.

Models BLEU-1 Score


GVMF [12] 70.5
H-RNN [16] 61.6
Xu et al. [15] 61.7
m-RNN [13] 61.9
ResBRNN [3] 69.1
Our Model 69.8
TABLE 5.4: Comparison of BLEU Scores with the existing models

The Table 5.5 represents the comparison of ROUGE-L Scores of the existing
video captioning models with our model.
62

Models ROUGE-L Score


GVMF [12] 30.81
H-RNN [16] 22.10
Xu et al.[15] 23.65
m-RNN [13] 22.91
ResBRNN [3] 25.02
Our Model 28.95
TABLE 5.5: Comparison of ROUGE-L Scores with the existing models
CHAPTER 6

ISSUES PERTAINING TO SUSTAINABILITY

In the areas of video processing and memorability prediction we dig into, it is vital
to review in a critical sense sustainability considerations from different angles such
as Society Safety Legal Environment.

6.1 SOCIETY

The societal impact of our project concerns changes in perceptions and memory
due to the video content. Making sure that our summarisation and memorability
prediction models do not facilitate the spread of dangerous or misleading
information is a very important societal issue. There is a need to focus on ethical
considerations, encouraging responsible content development and use.
Furthermore, the possible societal implications of personalized content
recommendations based on predicted memorability should be considered with a
view – to avoid undesired effects on users’ behavior.

6.2 SAFETY

Safety considerations involve both digital and psychological aspects. From a


digital perspective, keeping user data secure and private is of utmost importance.
Managing such a project involves dealing with sensitive information, including

63
64

videocontent and the activity of user interactions; thus it requires reliable data
protection mechanisms. The safety of the users psychologically is mostly an issue
because one often deals with perhaps memorable or even emotionally impactful
content. Content warnings and responsible handling of video summarization
results are in line with safety concerns.

6.3 LEGAL

Legal impacts involve whether laws and regulations for data protection are being
complied with. Following standards like GDPR General Data Protection
Regulation means that user privacy is protected. Also, dealing with copyright
issues is paramount especially in the case of video content. Legal risks can be
reduced by obtaining proper licensing or making sure that the content stays on the
fair use side of things.

6.4 ENVIRONMENT

The environmental impact of our project also concerns the computational resources
necessary for training and deployment of deep learning models. We recognize the
duty to make our algorithms as energy-efficient, seek out green hardware solutions
and reduce negative impacts on the environment from our technological pursuits.
Utilizing sustainable practices in technology development is consistent with our
environmental responsibilities.
65

This chapter aims at understanding different aspects of sustainability in our


project. So we want not only to innovate the technology but also create a
tech-savvy environment that is always nice, balanced and eco friendly by taking
into account all societal aspects like safety problems , legal issues , or
environmental impact.
CHAPTER 7

CONCLUSION AND FUTURE WORK

7.1 CONCLUSION

Video memorability prediction module was successfully developed and


implemented at this phase in the research. Training was performed on the
Memento10K dataset containing 10,000 short videos of various subjects and with
a duration of only six seconds each along with scores indicating their
memorability in both the short term as well as long term. The memorability
module of the video applied deep learning information from efficientnetB0,
VGG16 and C3D pretrained models that capture important spatial and
time-related data for identifying if a clip can be remembered.

Multiple pre-trained models for feature extraction were employed to demonstrate


the effectiveness in pulling out intricate features from video frames. In this
research phase, the diverse architectures such as EfficientnetB0 VGG16 and C3D
were used. For EfficientnetB0, a lightweight convolutional neural network
architecture with the use of convolution layers such as inverted bottleneck blocks
and linear classifier was used. For an input image of dimension (224, 224) the
feature vector from EfficientnetB0 would have a shape depending on how many
filters are present in its last convolutional layer usually resulting into mean output
with size like(1536 follows all three operations). Similar deep convolutional
neural networks, VGG16 and C3D were employed where each one helped in the

66
67

extraction of spatio-temporal information that is critical for predicting vide o


memorability.

The Usage of Explainable AI played a vital role in improving the performance


of the Memorability prediction model. Using the XAI the top 1000 features are
extracted and were used to further optimize the Memorability prediction model.
Then a conventional Video Captioning model is created using encoder-decoder
LSTM Technique and used to create captions for the memorable frames in the
input video.

7.2 FUTURE WORK

Potential future advancements of this approach to summarizing videos include:


Textual cues that can be used for memorability prediction in the multi-modal
model, training models that adapts to user preferences and history and creating
summarization on the fly video summarization for live videos. Further, interactive
features such as the ability for the user to make changes on a summary itself,
adding cross-lingua generation, inclusion of emotion detection for marking
important events, usage of attention models to implement eye-tracking data has
the potential to improve the system. While doing so, the question of what is
ethical will be central and will include issues like transparency, minimization of
unconscious bias and protection of people’s data.
REFERENCES

1. Ali, H., Gilani, S.O., Khan, M.J., Waris, A., Khattak, M.K. and Jamil, M.,
2022, May. Predicting Episodic Video Memorability Using Deep Features
Fusion Strategy. In 2022 IEEE/ACIS 20th International Conference on
Software Engineering Research, Management and Applications (SERA) (pp.
39-46).

2. Cohendet, R., Demarty, C.H., Duong, N.Q. and Engilberge, M., 2019.
VideoMem: Constructing, analyzing, predicting short-term and long-
term video memorability. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 2531-2540).

3. Cummins, S., Sweeney, L. and Smeaton, A., Analysing the Memorability


of a Procedural Crime-Drama TV Series, CSI., In Proceedings of the 19th
International Conference on Content-based Multimedia Indexing 2022.(pp.
174-180).

4. Dumont, T., Hevia, J.S. and Fosco, C.L., 2023. Modular Memorability: Tiered
Representations for Video Memorability Prediction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.
10751-10760).

5. Guinaudeau, C. and Xalabarder, A.G., 2023, January. Textual Analysis


for Video Memorability Prediction. In Working Notes Proceedings of the
MediaEval 2022 Workshop.

6. Leyva, R. and Sanchez, V., 2021, September. Video memorability prediction


via late fusion of deep multi-modal features. In 2021 IEEE international
conference on image processing (ICIP) (pp. 2488-2492).

68
69

7. Li, J., Guo, X., Yue, F., Xue, F. and Sun, J., 2022. Adaptive Multi-Modal
Ensemble Network for Video Memorability Prediction. Applied Sciences,
12(17), p.8599.

8. Li, J., Wong, Y., Zhao, Q. and Kankanhalli, M.S., 2019. Video storytelling:
Textual summaries for events. IEEE Transactions on Multimedia, 22(2),
pp.554-565.

9. Lin, J., Zhong, S.H. and Fares, A., 2022. Deep hierarchical LSTM networks
with attention for video summarization. Computers Electrical Engineering,
97, p.107618.

10. Liu, T., Meng, Q., Huang, J.J., Vlontzos, A., Rueckert, D. and Kainz, B.,
2022. Video summarization through reinforcement learning with a 3D spatio-
temporal u-net. IEEE Transactions on Image Processing, 31, pp.1573-1586.

11. Lu, W., Zhai, Y., Han, J., Jing, P., Liu, Y. and Su, Y., 2023. VMemNet: A
Deep Collaborative Spatial-Temporal Network With Attention Representation
for Video Memorability Prediction. IEEE Transactions on Multimedia.

12. Lu, Y. and Wu, X., 2022. Video storytelling based on gated video
memorability filtering. Electronics Letters, 58(15), pp.576-578.

13. Mao, J., Xu, W., Yang, Y., Wang, J. and Yuille, A.L., 2015. Deep captioning
with multimodal recurrent neural networks (m-rnn). ICLR. arXiv preprint
arXiv:1412.6632

14. Sweeney, L., Smeaton, A. and Healy, G., 2023, September. Memories in the
Making: Predicting Video Memorability with Encoding Phase EEG. In 20th
International Conference on Content-based Multimedia Indexing (pp. 183-
187).
.
70

15. Xu, R., Xiong, C., Chen, W. and Corso, J., 2015, February. Jointly modeling
deep video and compositional text to bridge vision and language in a unified
framework. In Proceedings of the AAAI conference on artificial intelligence
(Vol. 29, No. 1).

16. Yu, L., Bansal, M. and Berg, T.L., 2017. Hierarchically-attentive rnn for album
summarization and storytelling. arXiv preprint arXiv:1708.02977.

You might also like