Deepak Report Phase1
Deepak Report Phase1
A PROJECT REPORT
Submitted By
BALAJI S. 3122226001002
of
MASTER OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
June 2024
Sri Sivasubramaniya Nadar College of Engineering
(An Autonomous Institution, Affiliated to Anna University)
BONAFIDE CERTIFICATE
Place:
Date:
I thank GOD, the almighty for giving me strength and knowledge to do this project.
I would like to thank and deep sense of gratitude to my guide Dr. A. BEULAH,
Assistant Professor, Department of Computer Science and Engineering, for her
valuable advice and suggestions as well as her continued guidance, patience and
support that helped me to shape and refine my work.
I express my deep respect to the founder Dr. SHIV NADAR, Chairman, SSN
Institutions. I also express my appreciation to our Dr. V. E. ANNAMALAI,
Principal, for all the help he has rendered during this course of study.
I would like to extend my sincere thanks to all the teaching and non-teaching
staffs of our department who have contributed directly and indirectly during the
course of my project work. Finally, I would like to thank my parents and friends
for their patience, cooperation and moral support throughout my life.
DEEPAK SRIRAM S.
ABSTRACT
issue. This system forecasts memory scores for video segments to craft more
model. The video summarization model, which incorporates both audio and video
inputs, splits videos in 6-second intervals, crafting captions solely for segments
offering a new vantage point on video content analysis and user engagement.
TABLE OF CONTENTS
ABSTRACT iii
LIST OF FIGURES ix
1 INTRODUCTION 1
1.1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . 2
1.4 OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 ORGANISATION OF THE REPORT . . . . . . . . . . . . . . . 4
2 LITERATURE SURVEY 6
3 PROPOSED SYSTEM 24
3.1 SYSTEM ARCHITECTURE . . . . . . . . . . . . . . . . . . . . 24
3.2 VIDEO MEMORABILITY MODEL . . . . . . . . . . . . . . . . 26
3.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Ensembled Model - Support Vector Regressor, Decision
Tree, RidgeRegressor . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Dual Memorability Prediction . . . . . . . . . . . . . . . 30
3.2.4 Explainable AI(XAI) . . . . . . . . . . . . . . . . . . . . 30
3.3 VIDEO CAPTIONING MODEL . . . . . . . . . . . . . . . . . . 32
iv
3.3.1 Input Frames . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Flattening and Dense Layers . . . . . . . . . . . . . . . . 36
3.3.4 Concatenation of Features . . . . . . . . . . . . . . . . . 37
3.3.5 LSTM Encoders . . . . . . . . . . . . . . . . . . . . . . 37
3.3.6 Decoder Initialization . . . . . . . . . . . . . . . . . . . . 37
3.3.7 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.8 Caption Generation . . . . . . . . . . . . . . . . . . . . . 38
3.3.9 Training and Evaluation . . . . . . . . . . . . . . . . . . 38
4 IMPLEMENTATION 41
4.1 DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 HARDWARE SPECIFICATIONS . . . . . . . . . . . . . . . . . 42
4.3 SOFTWARE SPECIFICATIONS . . . . . . . . . . . . . . . . . . 42
4.4 FEATURE EXTRACTION . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 VGG16 Feature Extraction . . . . . . . . . . . . . . . . . 43
4.4.2 C3D Feature Extraction . . . . . . . . . . . . . . . . . . 44
4.4.3 EfficientNetB0 Feature Extraction . . . . . . . . . . . . . 44
4.5 ENSEMBLED (SVR + DT + RIDGE) MODEL . . . . . . . . . . 45
4.5.1 Feature Extraction and Ensemble Model Integration . . . 45
4.5.2 Memorability Prediction Process . . . . . . . . . . . . . . 45
4.6 EXPLAINABLE AI . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.1 Local Interpretable Model-Agnostic Explanations . . . . . 47
4.6.2 SHapley Additive exPlanations . . . . . . . . . . . . . . . 48
4.7 VIDEO CAPTIONING MODEL . . . . . . . . . . . . . . . . . . 49
4.7.1 Video Encoding and Audio Encoding . . . . . . . . . . . 49
v
4.7.2 Caption Encoding . . . . . . . . . . . . . . . . . . . . . 50
4.7.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 50
4.7.4 Decoding and Caption Generation . . . . . . . . . . . . . 50
vi
5.3.8 Experiment 8 - Ensemble Model with VGG16,
EfficientNetB0, C3D Features after XAI . . . . . . . . . . 59
5.4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vii
LIST OF TABLES
viii
LIST OF FIGURES
ix
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
1.2 MOTIVATION
significant gaps. These gaps primarily arise due to the absence of a unified
framework that seamlessly incorporates memory scores into the video captioning
process. Addressing these gaps is essential to advance the state-of-the-art in video
captioning and enhance the overall user experience. Given a long video, the goal
is to find the segments which are memorable and converting those segments to
textual descriptions which results in video captioning
1.4 OBJECTIVES
• To Generate captions for each segments from a video which has high
memorability score.
4
• Chapter 3, Proposed System, discusses about the proposed system for the
video summarisation using memorability scores. The detailed explanation
of the video memorability and summarisation modules along with the
methodology is mentioned.
• Chapter 7, Conclusions and Future Work, tells about the key findings and
contributions and research work. Also tells about the future enhancements
that can be done to further improve the model to improve effectiveness.
CHAPTER 2
LITERATURE SURVEY
The Gated Video Memorability Filtering paper [12] takes a close look at methods
used to summarize videos. Specifically, it focuses on a framework developed by
Lu and Wu. They use a Convolutional Neural Network (CNN) to extract visual
and temporal features from video segments, and a Recurrent Neural Network
(RNN) with a gating mechanism to select the most memorable parts. This
approach aims to create engaging video summaries by effectively utilizing deep
learning models. While Lu and Wu’s methodology has its strengths in prioritizing
memorability and leveraging deep learning, it also has some limitations. Firstly,
the evaluation only considers a limited dataset of cooking videos, which raises
questions about how well the framework would work for other types of videos.
Additionally, the use of black-box deep learning models in their approach brings
up concerns about interpretability. It would be helpful to have more insights into
the internal decision-making processes of the models. Furthermore, the
discussion doesn’t address the potential computational cost of training and
running these deep learning models, which may limit the practicality of
implementing this framework in resource-constrained environments. To advance
the effectiveness and practicality of video summarization methodologies, it’s
essential to address these limitations. Further research should explore the
generalizability across different video genres and provide more insights into the
decision-making processes of the model. It’s also important to consider the
computational resources required for implementing the framework in real-world
situations with limited resources. By addressing these considerations, we can
6
7
M3-S framework, such as video summarization and editing. Overall, Dum et al.’s
M3-S framework, despite its challenges, makes a contribution to the field by
offering a more nuanced understanding of video memorability.
subshot selection. Testing the framework across a wider range of video lengths
and genres is recommended to assess its generalizability and robustness. Lastly,
techniques to mitigate biases in attention mechanisms are pivotal for ensuring fair
and inclusive video summarization. In conclusion, Lin et al.’s video
summarization framework shows promising results. However, addressing the
identified limitations is vital for further development and responsible application
of the model. By refining the framework to overcome computational constraints,
enhancing interpretability, and ensuring generalizability and fairness, there is
significant potential for advancement in intelligent video analysis and
summarization.
viewers. One of the key features of VMemNet is its joint optimization of the
CNN and RNN. By ensuring that these two subnetworks work together
seamlessly, the model is able to produce highly accurate predictions about video
memorability. It outperforms existing methods in the field and achieves
state-of-the-art performance on benchmark datasets. VMemNet’s ability to
integrate spatial and temporal information, coupled with its attention mechanism
for identifying important video segments, sets it apart from other models.
However, there are also some challenges that VMemNet faces. One of these
challenges is its black-box nature, particularly the interpretability of the attention
mechanism. The authors suggest that future research should focus on developing
more transparent models to address this issue. Another challenge is the need to
evaluate VMemNet’s generalizability across different video genres and lengths.
While the model shows promising results on existing datasets, more evaluation is
needed to determine its applicability in various contexts. Additionally, the
computational complexity associated with training and deploying VMemNet may
limit its real-world usability. The authors suggest exploring lighter-weight
architectures or more efficient training methods to address this challenge. Lastly,
there is great potential for personalization with VMemNet, as it can be adapted
based on individual viewer preferences. This intriguing avenue for future research
could provide a deeper understanding of what truly makes videos memorable to
different individuals.
Hasnain Ali et.al propose a novel model to predict episodic video memorability
[1], which considers viewers’ memory of certain events within the same given
video. This research gives a multifaceted understanding on how people engage
and remember various media content; it can be applied to video editing,
advertisement placement and design of educational materials. Strengths of the
suggested framework include focusing on episodic memory, deep features fusion,
and fuzzy-based FastText text analysis. The focus on episodic memory in the
framework distinguishes it from other methods, providing a more detailed
understanding of what viewers recall within a video. This deep features fusion
strategy harnesses numerous such models to extract mechanisms inclusive of
color histograms, object Faster R-CNN detection and Principal Component
Analysis for temporal dynamics. Furthermore, the incorporation of fuzzy-based
FastText text analysis for video annotations brings in semantic consideration
which improves on feature space and helps capture relationships between texts.
Results of experiments on the data set MediaEval 2018 show better results,
reflected in higher Spearman’s rank correlation coefficients for short-term and
long-term memorability of episodic events. Computational complexity and
improved interpretability pose challenges to the proposed framework. The
intricacies of the feature fusion and episodic memory identification process
should be made more transparent. Eliminating these limitations is a crucial step
to refining the model and unlocking its potential applications in personalized
video experiences, optimizing processes of creating videos, and improving
educational materials designed for effective knowledge retention. On the whole,
the work conducted by Ali et al. adds considerable knowledge to our
16
Sweeney et al. Memories in the Making Predicting Video with Encoding Phase
EEG [14] is a methodology that consists of recording neural activity responses to
the primary presentation with videos, which provides critical information about
neurocognitive processes that guide memory formation. The strong points of this
study include innovative approach, generation procedure for the EEGMem
dataset set ups, issues tackled using a smart deep learning framework and
exceptional levels reached in predictive rate. As for the strengths of this research,
one has to say that it applied a new approach EEG signals obtained during
encoding phase can predict memorability video. This study has opened the path
forward for EEG application involving a unique approach to understanding the
tangle of human cognition and video content. Methodologically, as a notable
contribution the authors present EEGMem dataset following EEG recordings
from subjects watching video excerpts which constitutes an outstanding database
rich in brain activity studies for studying how these activities affect memorability.
The study uses a deep learning structure to transform EEG signals in the visual
domain, extracting relevant features for memorability prediction. This innovative
approach highlights the successful science of artificial intelligence incorporation
into cognitive neuroscience and is not only promising in video memorability
research but also implies a wider scope for its applications, such as they are
related to brain-computer interfaces or neural feedback systems. Though these
are strengths, the study acknowledges limitations. It is noted that the EEGMem
dataset was relatively small, and it pointed out an urgent need for growing off
18
The limitations that apply to the “VideoMem” paper and the currently existing
state of research on video memorability can be described as follows. The result
indicates that the current deep learning models for memorability prediction
heavily depend on the visual features while they may overlook the effect of the
Auditory and Textual features. However, the VideoMem dataset addresses some
of the limitations of the VGG dataset by being large, while at the same time, there
is a concern that it caters for a particular style and/or genre of videos, thus
limiting the usability of results to real-world videos. Also, deep learning model is
highly complex and nontransparent that provides methodological restrictions to
interpretability and thus researchers gets very little understanding of which
factors contribute to the memorability of the video from the proposed model.
Based on this analysis, several central issues emerge in the future. Due to the
differences in the topics and coverage of news articles and human-identified
videos, the integration of multimodal features including visual, auditory, and
textual modality has the potential to strengthen the memorability models and their
capability of providing better predictions. However, other factors like people’s,
19
history of watching, and cultural background if taken into consideration can lead
to more definite outputs. Finally, the actual techniques of creating explainable AI
would help in the understanding of the structure and components of the
memorability model and identify what exactly is important for making the video
memorable to the audience. In light of these issues, it is imperative that the field
of video memorability research continues to progress and adapt; service as a
platform for further research, toward practical applications in content production
and design, advertisement, and other fields that concern user interest and
interaction.
Image captioning is a task that has recently attracted interest in computer vision
and natural language processing, which entails providing a textual description for
an image without human intervention. The earlier methodologies are
feature-based with static representation of knowledge and often required different
models for different modalities which restrained performance. Followed by the
powerful architecture of m-RNN for image captioning, Mao et al (2015) [13]
introduced a novel architecture known as multimodal recurrent neural networks
(m-RNN). This novel framework seamlessly integrates visual and textual
information, comprising three key components: to be a visual encoder containing
a convolutional neural network (CNN) to encode the pictures, a textual encoder
with a recurrent neural network (RNN) to encode caption-related features, as well
as a fusion layer containing the caption-generating RNN. The network learns
completely from the input image to the caption it generates by optimizing the loss
function with large-scale datasets. As shown in the quantitative analysis of the
m-RNN model, BLEU, METEOR, and CIDEr scores provide evidence for
improved effectiveness of the proposed model compared to earlier renditions with
the benchmark datasets of Flickr30k and MSCOCO. Thus, although the m-RNN
20
the compositional text that will also enable the integration of the video and
language information more seamlessly. Their approach employs deep learning in
trying to extract high levels of features concerning the videos as well as the text
by employing compositional text representations. When trained under this hybrid
framework, two modalities, the model is capable to harvest the relationship
between the visual and textual data, helping accomplish various tasks such as
video captioning and description. As shown in Xu et al. [15], the proposed
approach successfully learns the associations between vision and language
through experiments and provides better or similar results than prior art. However
there are few potential disadvantages of their approach which we have to
carefully consider: There maybe ambiguities that arises while training deep
learning models having servers of multimodal data, especially while dealing with
so many and different type of visuals and textual patterns, and Lastly, large scale
annotated datasets are essential to train a powerful Deep learning model, which
may not always possible to obtain. Also, it is critical to understand the learned
representations qualitatively and to question the predatory nature of the outputs
generated by the model.
Cummins et al.[3] have made a similar contribution to this area by examining the
nature of the procedural crime-drama series, CSI. This is done in terms of
features like content in visual presentation, the kind of story arcs used,
development of the characters as well as using emotional interest as components
that help determine factors that aid memory retention on the series. Thus,
applying the methodology of multimedia analysis and using the concepts of
psychological memory, the authors introduce readers to the possibilities of
combining elements of the known TV series. In this study, Cummins et al.[3]
have endeavored to dissect various aspects of viewer memory by employing
psychological methods that utilize real data. Within the empirical study outlined
by Cummins et al. [3] and based on the analysis of real data, the authors explore
the processes of formation and storage of viewer memory for episodic television
content. Possible drawbacks of the proposed approach may include the following:
Difficulties associated in applying the findings from one TV series to other series
or even a different genre, It might be difficult to measure emotions several times
during a program where their value would be threadbare, There is a strong
indication that large-sample low-investment user studies will need to be done to
validate the proposed paradigms. Finally, if there is a tendency in choosing the
23
methods to perform analysis or select samples of data, then the broad relevance of
the results will be influenced this way.
PROPOSED SYSTEM
The framework of video summarization proposed in this paper involves the use of
various techniques in an effort to recognize the important scenes in a video as
well as providing captions to these segments. Foremost, the Video Memorability
Prediction Module using the Memento10k dataset employs spatial and temporal
features such as EfficientNet, VGG-net, and C3D to estimate the videos’ segment
memorability score. Post training of the model, XAI approaches are used to
analyze the decision-making process for locating features that help in predicting
memorability factors and model tuning to enhance performance. These scores are
then subjected to the thresholding process where only those which are greater
than a predetermined value are selected by using the score obtained in the
experimentation with textual captioning output as the threshold. The Video
Captioning Module then analyzes these memorable segments focusing on
techniques to extract textual descriptions to identify the same. Last but not least,
24
25
the Textual Captioning step combines all of these captions into one coherent
textual summary of the most extraordinary elements of the video. The
architecture presented in this paper incorporates machine learning models, deep
features, XAI, and threshold-based decision-making for subjecting video content
to memorability-based summari- zation they present a viable solution in concept.
26
The use of VGG16 a classic CNN architecture 3.3 to capture detailed spatial
features. Using the pre-trained weights of VGG16 enables us to acquire mid level
visual features that supplement our hierarchical feature. VGG16 processes video
frames, and the activations from significant layers are taken as features. VGG16
extracts multi-level features that give a subtle meaning to spatial information,
which contributes towards the overall understanding of spatio-temporal.
28
EfficientNetB0 model, known for its efficiency in balance between size and
performance of models at capturing high-level visual features from video
segments. By utilizing the pre-trained weights of EfficientNetB0, it is possible to
acquire semantically rich features that are instrumental in achieving an adequate
comprehension of the visual content.
The video frames are passed to EfficientNetB0 , and the features extracted from
intermediate layer outputs. These features capture hierarchical representations that
enable the model to distinguish both global and local visual patterns within each
frame.
In the above architecture Fig.3.4, the input frames extracted from the video data
are fed into the efficientnetB0 model . The features are extracted from the layer
before the top layer and it produces a feature representation of shape (1, 1, 1280).
These image features are then combined to get the feature for the entire video and
then used in the feature vector for further model training.
C3D is built for spatio-temporal feature extraction, thus suitable to be used as a part
of our model. By adding not only spatial features but also motion dynamics to the
model, C3D improves its ability to capture temporal patterns in video segments.
The deep learning features are extracted and used as inputs to an ensembled that
functions in the regression mode. An ensembled model with the model
30
SHAP values are used to prioritize each feature’s contribution total important score
for the results of the model. By calculating SHAP for all the instances the overall
contribution of each feature in determining the model prediction was obtained.
32
The feature importances summary of the SHAP results has been presented in the
SHAP summary plot created through the help of the summary plot function
which provided the overall overview of feature importances. This enabled the
author to determine variables that significantly shaped the results predicted by the
model. Additionally, ICE plots were examined in relation to the feature
importance because these plots showed the conditions or patterns between the
feature and the SHAP values for many instances individually. These plots offered
an important understanding of how the results of the model change with a given
feature/change in feature values.
By combining the LIME and SHAP models, regression model interpretability and
reliability were improved by providing richer and more detailed information.
Such a model allowed for examining the decision-making of the model further
and ascertain the important features influencing its prediction. Therefore, there is
an enhanced trust in the results of our study together with the possibility of
making the right decisions regarding the model outputs. Therefore, with the
Intervention of the Interpretations from the Explainable AI frameworks, a new
model is performed that has only 1000 most basic features that contributes to the
model disregarding the rest of features that do not impact the model. This actually
helps to make our model even better in terms of Spearman Correlation
The Video captioning model is to create automatically generated captions for the
videos to get meaningful insights from the video without the need to watch the
video. The architectural design, known as an Encoder-Decoder paradigm,
33
The model processes both visual and auditory frames extracted from the video:
3.3.2.1 ResNet50v2
thus making the training of deep networks more efficient. This version applies
notBatch normalization and relu activation before convolution
layer(pre-activation) which stabilizes the gradients and enhances its flow and
subsequently increases the overall performance. ResNet50v2 network has 50
layers which comprises the convolutional layers and some identity blocks. And as
for today, it is applied mostly for image classification and other problems and
tasks in the field of computer vision since it has relatively high robustness and
accuracy.
3.3.2.2 MobileNetV2
• Visual Features: The ResNet50 network feature map is then flattened and
finally fed through a dense layer to get a 2048-dimensional feature vector.
• Audio Features: In the same way, the feature map extracted from
MobileNetV2 is flattened, and the flattened vector is created with 1280
features.
37
The two features; visual and audio are joined together as one to form the complete
feature vector which leads to an increased size of the vector from 1632 to 3328.
• Visual LSTM Encoder: The visually described features are given to the
LSTM encoder, which has 512 hidden units and produces a sequence of
hidden states along with the final states of the encoder.
The final states of both the LSTM encoders are combined to initialize the decoder
state so that both visual and audio context can be used.
3.3.7 Decoder
• Decoder LSTMCell: The LSTM cell in the decoder outputs a new word to
the current sequence: the actual previous word if we are in the training phase
or the predicted word if we are in the testing phase.
• Fully Connected Layer: Transforms the LSTM output to the word space,
generating a word probability distribution, or logits, over the vocabulary
space.
A custom dataset class is developed to load video frames and their corresponding
captions from a dataset and preprocess them. The class is also responsible for
correctly feeding data to the model in increments.
39
The loss function combines traditional cross-entropy loss with an optional BLEU
score-based loss to balance word-level accuracy and overall caption quality:
For each epoch of the training loop, the program processes batches of input,
which consist of video frames and their captions. For each batch, the train step
function updates the model weights using the calculated loss. The Adam
optimizer is adopted because it can adjust the learning rate during training.
Thus a new video captioning model is created that fully utilizes the combined
visual and auditory information to produce relevant and semantically linked
descriptions for videos. The high-level CNN architectures in the feature
extraction phase and LSTM networks for the sequence modeling of the videos
allow the model to identify key characteristics and capture the dynamics involved
in the videos. This approach increases the quality of captions as it considers
different areas and targets them, which is essential for video indexing,
accessibility, and recommendation applications.
40
IMPLEMENTATION
4.1 DATASET
The Dataset used for training the video memorability prediction model is the
Memento10K Dataset. The Dataset consists of Videos and their corresponding
short-term memorability and long-term memorability. The Dataset consist of
10,000 short videos each of length 6 seconds.
The Dataset is collected by asking the participants to watch 180 video in short-
term memorability and 120 videos in long-term memorability. The task of the
participants is to press the space bar when the participants find the video similar the
previously watched videos. By this the short-term and long-term memorabilities
are calculated.
The Dataset that is used for captioning the video is the same Memento10K Dataset
which contains captions for each 6 seconds videos.These captions are stored as
a json format in a txt file. The Dataset that is used to check the correctness of
41
42
the video captioning model is the video storytelling dataset which contains longer
videos along with the captioned texts.
The video storytelling dataset contains various categories of videos in which the
birthday, camping, christmas, vacation dataset are present. The birthday and
camping datasets are taken in order to test the video captioining models outputs as
the Memento10K Dataset contains only shorter videos of 6 seconds. The dataset
contains longer videos with duration ranging from 10 minutes to 30 minutes.
video data. Matplotlib.pyplot for visualising the frame images and video data.
tensorflow is used to extract features from the video data and in training ANN
Models. Scikit-learn is used to create machine learning models for video
memorability predicitions.
Feature extraction is a step that involves transforming raw data into a set of relevant
and informative features. These features capture essential characteristics of the
data, enabling more effective representation and analysis.
The VGG16 model is loaded in with the weights trained from the imagenet. The
frames from the video data is extracted as 20 frames per video and each frame is
given to the VGG16 model. The VGG16 model’s final CNN layer is extracted as a
feature and is given to a GlobalAveragePooling2D layer which converts the learnt
patterns into a feature vector. The final MaxPooling layer is extracted as a feature.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import VGG16
b a s e m o d e l = VGG16 . VGG16 ( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = False
f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s =0))
from t e n s o r f l o w . k e r a s . l a y e r s import G l o b a l A v e r a g e P o o l i n g 2 D
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
VGG feature vector
A sequence of video segments is fed into the C3D model, and its features are
extracted using convolutional layers. Spatial and temporal information is captured
by these features, as they represent the dynamic content of a video more fully.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import C3D
b a s e m o d e l = C3D ( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = False
f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s =0))
from t e n s o r f l o w . k e r a s . l a y e r s import G l o b a l A v e r a g e P o o l i n g 2 D
C3D feature vector = GlobalAveragePooling2D ( ) ( f e a t u r e s )
C3D feature vector
The Video Segments are given to the EfficientNetB0 Model which is used to
extract the features based on the hierarchical representations that enable the
model to distinguish both global and local visual patterns within each
frame.Image frames of size (224, 224) is fed into the EfficientNetB0 model which
gives an output feature of shape (1, 1, 1280) .
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import E f f i c i e n t N e t B 0
model = E f f i c i e n t N e t B 0 (
weights= ’ imagenet ’ ,
i n c l u d e t o p =False ,
shape =(224 , 224 , 3 ) )
45
f e a t u r e s = model . p r e d i c t ( t f . e x p a n d d i m s ( f r a m e s 2 0 [ 0 ] , a x i s = 0 ) )
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
To feed the Ensembled model, features are initially extracted from the video
segments utilizing three pre-trained models: EfficientNetB0, VGG16, and C3D.
EfficientNetB0 works with hierarchical features, VGG16 is about high-level
visual representations and C3D extracts spatio-temporal features. These identified
features are fed as input to the Ensembled model, which in turn takes a
concatenated feature vector that encapsulates all visual and temporal
characteristics.
The Ensembled training is done on Memento10k dataset have 10,000 very short
videos, with memory score of both shorter and long time observation. Ensembled
model takes the concatenated feature vectors of the three pre-trained models and
unifies them into a single input. This training process consists of fine-tuning SVR
specifics to develop an optimal hyperplane that minimizes predictions mistakes,
and the optimal number of branches in the decision tree, and choosing the most
46
Once the Ensembled model is trained, feature vectors of unseen videos are given
to the model and is tested to check the efficiency of the model in predicting the
memorability of the video. During training the metrics Mean Absolute Error
(MAE) and Mean Squared Error (MSE) are used as metrics to optimize the
hyperplane in the SVR.
4.6 EXPLAINABLE AI
From the above fig 4.1 we can find the top 1000 features which contributes to the
model performance.
lime_weights = lime_exp . as_list ()
important_features_lime = [ feature [0] for feature in lime_weights [:1000]]
important_features = list(set( important_features_perm + important_features_shap + important_features_lime ))
The above code is used to give importance to the top 1000 features from the total
features.
48
SHAP values are used to prioritize each feature’s contribution total important score
for the results of the model. By calculating SHAP for all the instances the overall
contribution of each feature in determining the model prediction was obtained.
By combining the LIME and SHAP models, regression model interpretability and
reliability were improved by providing richer and more detailed information.
Such a model allowed for examining the decision-making of the model further
and ascertain the important features influencing its prediction. Therefore, there is
enhanced trust in the results of our study together with the possibility of making
the right decisions regarding the model outputs.
49
The initial step in this architectural framework involves video encoding, where
raw video data is processed using a Video Encoder which is a pre-trained
ResnetV2 Model and the audio features are encoded using the pre-trained
MobileNetv2. Meaningful features are extracted from each frame of the video,
and these features are then aggregated with the audio features to produce a
compact representation of the entire video which is referred to as the video
feature map. This feature map along with the caption encoding will go to the
attention mechanism and then to the decoding module to generate captions.
50
The attention mechanism plays a role in merging the visual and textual elements
during the caption generation process. It dynamically directs focus towards
different segments of the video feature map while generating each word of the
caption. By evaluating the similarity between the current state of the decoder and
the encoded video features, the attention mechanism ensures that the model
attends to the most important visual information, thereby enhancing the quality
and relevance of the generated captions.
The final stage in the architecture is the decoding and caption generation. The
Decoder module contains an LSTM which is a RNN based architecture that
utilizes the attended video features along wiht the previously generated words to
predict the next word in the caption sequence. This process continues until the
end of sequence captions are generated. The generated captions represents a
textual description of the content depicted in the input video, providing valuable
insights.
51
Thus the video summaries produced by this video summarization module produces
a BLEU-1 Score of 69.8 and a ROUGE-L of 28.95
CHAPTER 5
52
53
The Next step after the frame generation is to extract features from the pretrained
models (EfficientNetB0, VGG16 and C3D). The Feature Extracted from the
pretrained models are used as the input for the ANN and the Machine Learning
models to create the Memorability Prediction Model. The following Figures: 5.2,
5.3, 5.8 are the feature extracted from each pretrained models. These features are
then given to the ANN and Machine Learning models.
The Following figures shows the input video, corresponding high memorable clips
and the generated captions.
55
F IGURE 5.8: Folder Containing High Memorable Video Frame for Long Video
5.3 EXPERIMENTS
An Artificial neural network model is created and the first feature extracted vgg16
(800 samples, 10 percent of the data) is fed into the model. The model is created
with 128 neurons on the first layer, 64 neurons on the second layer, and the final
layer is the output layer for the memorability score. The Optimizer called as Adam
is used to optimize the model on each data with a learning rate of 0.001. The model
is built on the metrics of mean absolute error and mean squared error. The Features
are fed to the model through a custom Data Generator with a batch size of 2 and
fed into the model. The model is evaluated on the 100 samples from the training
data other than this 10 percent of trained data and the model is showing an training
mae of 0.194 and training mse of 0.391 and testing mae of 0.332 and mse of 0.122.
The spearson correlation tested from the output is -0.234.
An Artificial neural network model is created and the feature extracted vgg16,
C3D features, EfficientNetB0 Features (4000 samples, 50 percent of the data) are
fed into the model. The model is created with 128 neurons on the first layer, 64
neurons on the second layer,32 neurons on the third layers, 16 and 4 and 1 neurons
respectively in the subsequent layers and the final layer is the output layer for the
57
memorability score. The Optimizer called as Adam is used to optimize the model
on each data with a learning rate of 0.001. The model is built on the metrics of
mean absolute error and mean squared error. The Features are fed to the model
through a custom Data Generator with a batch size of 32 and fed into the model.
The model is evaluated on the 100 samples from the training data other than this
10 percent of trained data and the model is showing an training mae of 0.353 and
training mse of 0.519 and testing mae of 0.277 and mse of 0.137 . The spearson
correlation tested from the output is -0.027.
An Artificial neural network model is created and the feature extracted vgg16, C3D
features, EfficientNet Features are fed into the model. The model is created with
128 neurons on the first layer, 64 neurons on the second layer,32 neurons on the
third layers, 16 and 4 and 1 neurons respectively in the subsequent layers and the
final layer is the output layer for the memorability score. The Optimizer called as
Adam is used to optimize the model on each data with a learning rate of 0.001. The
model is built on the metrics of mean absolute error and mean squared error. The
Features are fed to the model through a custom Data Generator with a batch size
of 32 and fed into the model. The model is evaluated on the 100 samples from the
training data other than this 10 percent of trained data and the model is showing
an training mae of 0.051 and training mse of 0.0045 and testing mae of 0.070 and
mse of 0.008. The spearson correlation tested from the output is 0.029.
58
The VGG Features, c3d Features, EfficientNet Features extracted from the video
data are given to an Support Vector Regression model. The amount of data given
to the SVR is 50 percent of the extracted features. The model is trained with the
metrics Mean Absolute Error (MAE) and Mean Squared Error (MSE). The Mean
Squared Error: 0.008, Mean Absolute Error: 0.083. The Spearman correlation
tested from the output is 0.233.
The VGG Features, c3d Features, EfficientNet Features extracted from the video
data are given to an SVR model. In this the SVR model is trained on all of the
data. The model is trained with the metrics Mean Absolute Error (MAE) and
Mean Squared Error (MSE). The Mean Squared Error: 0.006, Mean Absolute
Error: 0.063. The Spearman correlation tested from the output is 0.345
59
The VGG16, EfficientNetB0, C3D Features are extracted from the video data are
given to a Decision Tree model. In this the Decision Tree model is trained on all
of the data. The model is trained with the metrics Mean Absolute Error (MAE)
and Mean Squared Error (MSE). The Mean Squared Error: 0.010, Mean Absolute
Error: 0.079. The Spearman correlation tested from the output is 0.168
The VGG16, EfficientNetB0, C3D Features are extracted from the video data are
given to an Ensemble model. In this, the Ensemble model is trained on all of the
data.The Mean Squared Error: 0.006, Mean Absolute Error: 0.064. The Spearman
correlation tested from the output is 0.312
The VGG16, EfficientNetB0, C3D Features are extracted from the video data are
given to an Ensemble model. In this, the Ensemble model is trained on all of
the data.The Mean Squared Error: 0.0047, Mean Absolute Error: 0.061. The
Spearman correlation tested from the output is 0.483
60
5.4 RESULTS
The Table 5.1 represents the comparison of spearman correlations for the above
experiments to find the best suitable model for the Memorability Prediction model.
The Table 5.2 represents the comparison of Mean Squared Error for the above
experiments to find the best suitable model for the Memorability Prediction model.
The Table 5.3 represents the comparison of Mean Absolute Error for the above
experiments to find the best suitable model for the Memorability Prediction model.
61
The Table 5.4 represents the comparison of BLEU Scores of the existing video
captioning models with our model.
The Table 5.5 represents the comparison of ROUGE-L Scores of the existing
video captioning models with our model.
62
In the areas of video processing and memorability prediction we dig into, it is vital
to review in a critical sense sustainability considerations from different angles such
as Society Safety Legal Environment.
6.1 SOCIETY
The societal impact of our project concerns changes in perceptions and memory
due to the video content. Making sure that our summarisation and memorability
prediction models do not facilitate the spread of dangerous or misleading
information is a very important societal issue. There is a need to focus on ethical
considerations, encouraging responsible content development and use.
Furthermore, the possible societal implications of personalized content
recommendations based on predicted memorability should be considered with a
view – to avoid undesired effects on users’ behavior.
6.2 SAFETY
63
64
videocontent and the activity of user interactions; thus it requires reliable data
protection mechanisms. The safety of the users psychologically is mostly an issue
because one often deals with perhaps memorable or even emotionally impactful
content. Content warnings and responsible handling of video summarization
results are in line with safety concerns.
6.3 LEGAL
Legal impacts involve whether laws and regulations for data protection are being
complied with. Following standards like GDPR General Data Protection
Regulation means that user privacy is protected. Also, dealing with copyright
issues is paramount especially in the case of video content. Legal risks can be
reduced by obtaining proper licensing or making sure that the content stays on the
fair use side of things.
6.4 ENVIRONMENT
The environmental impact of our project also concerns the computational resources
necessary for training and deployment of deep learning models. We recognize the
duty to make our algorithms as energy-efficient, seek out green hardware solutions
and reduce negative impacts on the environment from our technological pursuits.
Utilizing sustainable practices in technology development is consistent with our
environmental responsibilities.
65
7.1 CONCLUSION
66
67
1. Ali, H., Gilani, S.O., Khan, M.J., Waris, A., Khattak, M.K. and Jamil, M.,
2022, May. Predicting Episodic Video Memorability Using Deep Features
Fusion Strategy. In 2022 IEEE/ACIS 20th International Conference on
Software Engineering Research, Management and Applications (SERA) (pp.
39-46).
2. Cohendet, R., Demarty, C.H., Duong, N.Q. and Engilberge, M., 2019.
VideoMem: Constructing, analyzing, predicting short-term and long-
term video memorability. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 2531-2540).
4. Dumont, T., Hevia, J.S. and Fosco, C.L., 2023. Modular Memorability: Tiered
Representations for Video Memorability Prediction. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.
10751-10760).
68
69
7. Li, J., Guo, X., Yue, F., Xue, F. and Sun, J., 2022. Adaptive Multi-Modal
Ensemble Network for Video Memorability Prediction. Applied Sciences,
12(17), p.8599.
8. Li, J., Wong, Y., Zhao, Q. and Kankanhalli, M.S., 2019. Video storytelling:
Textual summaries for events. IEEE Transactions on Multimedia, 22(2),
pp.554-565.
9. Lin, J., Zhong, S.H. and Fares, A., 2022. Deep hierarchical LSTM networks
with attention for video summarization. Computers Electrical Engineering,
97, p.107618.
10. Liu, T., Meng, Q., Huang, J.J., Vlontzos, A., Rueckert, D. and Kainz, B.,
2022. Video summarization through reinforcement learning with a 3D spatio-
temporal u-net. IEEE Transactions on Image Processing, 31, pp.1573-1586.
11. Lu, W., Zhai, Y., Han, J., Jing, P., Liu, Y. and Su, Y., 2023. VMemNet: A
Deep Collaborative Spatial-Temporal Network With Attention Representation
for Video Memorability Prediction. IEEE Transactions on Multimedia.
12. Lu, Y. and Wu, X., 2022. Video storytelling based on gated video
memorability filtering. Electronics Letters, 58(15), pp.576-578.
13. Mao, J., Xu, W., Yang, Y., Wang, J. and Yuille, A.L., 2015. Deep captioning
with multimodal recurrent neural networks (m-rnn). ICLR. arXiv preprint
arXiv:1412.6632
14. Sweeney, L., Smeaton, A. and Healy, G., 2023, September. Memories in the
Making: Predicting Video Memorability with Encoding Phase EEG. In 20th
International Conference on Content-based Multimedia Indexing (pp. 183-
187).
.
70
15. Xu, R., Xiong, C., Chen, W. and Corso, J., 2015, February. Jointly modeling
deep video and compositional text to bridge vision and language in a unified
framework. In Proceedings of the AAAI conference on artificial intelligence
(Vol. 29, No. 1).
16. Yu, L., Bansal, M. and Berg, T.L., 2017. Hierarchically-attentive rnn for album
summarization and storytelling. arXiv preprint arXiv:1708.02977.