Video Storytelling Via Retaining High Memorable Video
Clips
S Deepak Sriram 3122226001004
ME CSE, Semester 4
Dr.A.Beulah
Supervisor
Project Review: 2 (Phase 2) (09 April 2024)
Department of Computer Science and Engineering
SSN College of Engineering
1 MOTIVATION
In the current digital age, there is a vast amount of multimedia content available. Long
videos contain both useful and unnecessary information, and viewers are often un-
willing to view them. As a result, people now prefer to watch shorter or more precise
videos.Traditional methods of converting long videos to more precise short videos
have included key frame based selection, Segment selection based on content, and
Shorter boundary based detection. However, these methods have several drawbacks.
For example, in key frame Based Selection, important actions or events may occur be-
tween the key frames.In Segment selection based on content difficult to identify which
segments contain relevant information. Additionally, short boundary Based Detection
may focus too heavily on sudden changes between scenes, resulting in a lack of under-
standing of the overall story. This motivates to generate textual description for long
videos without losing its semantics.
2 PROBLEM STATEMENT
Today’s digital world is filled with all sorts of multimedia content. One of the biggest
challenges professionals in advertising, film making, education and content recovery,
face is predicting whether a video will last in the memory for both long term and short
term. Video summarization using memorability scores refers to the task of generating
a textual description of a video that contains only the most memorable and represen-
tative parts.Given a long video, the goal is to find the segments which are memorable
1
and using Explainable AI(XAI) to prove the models correctness and converting those
segments which are memorable to textual descriptions which results in video summa-
rization.
3 ABSTRACT
Video summarization is an increasingly challenging task in managing online video
content. Traditional methods often overlook the vital aspect of human attention and
memory, focusing solely on visual, audio, or text cues. However, there is a break-
through that looks upon on the significance of incorporating memorability scores into
the video summarization process. To address this issue, a novel framework is intro-
duced, to seamlessly integrate video summarization with memorability prediction.
This proposed model predicts the memory scores of different video segments, re-
sulting in more engaging and relevant summaries that align with human cognitive
functions.The effectiveness of this approach was demonstrated through various ex-
perimentation on the Memento10K dataset, revealing significant improvements in the
prediction of video memorability score.Furthermore, Explainable AI (XAI) techniques
are employed to ensure transparency and interpretability of the model’s predictions,
thereby enhancing trustworthiness and facilitating real-world adoption. This research
not only provides a fresh perspective to the analysis of video content but also enhances
the user experience by recognizing and leveraging human memory.
4 OBJECTIVE
The goal of this project is to create a model that can summarize a given video based
on its memorability scores
• To build a model that predicts the memorability score of a given video using the
features extracted from the videos.
• To use Explainable AI to analyse the memorability models correctness and fea-
ture importance.
• To Generate captions for each segments from a video which has high memora-
bility score.
• To create story from the generated captions.
2
5 LITERATURE SURVEY
In recent years, researches have been done on video summarization based on mem-
orability and other conventional methods, a model called Gated Video Memorability
Filtering(GVMF) is developed which addresses the issue of story-irrelevant visual con-
tents by filtering out video clips with lower memorability scores. These lower-scoring
clips typically contain objects and characters that appear less frequently and are con-
sidered story-irrelevant or noise. By introducing video memorability prediction, the
GVMF model identifies and filters out these irrelevant clips. The GVMF model re-
moves all the frames which doesnot have enough objects in the scene. But there may
be cases in which the frames with less objects can contribute to the memorability of
the video and contributing to the video content [1]
Another model M3-S, or Modular Memorability Model with Descriptor All HOG Mean
OF Contrast Brightness Size (bytes) Blurriness Baseline, comprises four modules: raw
perception, scene understanding, event understanding, and contextual similarity. These
modules analyze various aspects of video content, such as low-level perceptual fac-
tors, semantic segmentation for object detection, action recognition for high-level un-
derstanding, and contextual similarity for visual context. The model combines these
module outputs to predict video memorability comprehensively. However, the paper
acknowledges limitations explicitly, stating that the model may struggle with complex
semantics, extreme pixel intensity, or extreme motion, leading to suboptimal memora-
bility scores. [2]
A Framework called as The Adaptive Multi-modal Ensemble Network (AMEN) uses
three individual base learners (ResNet3D, Deep Random Forest, and Multi-Layer Per-
ception) into a weighted ensemble framework. The model is built upon three factors
from the video which are Temporal 3D information, Spatial information, Semantics
derived from video, image, and caption. However Ensemble models especially those
combining multiple modalities and employing complex base learners, can be compu-
tationally expensive. This may limit the scalability of the proposed framework for
large-scale applications or real-time processing. [3]
Another approach followed in 2022 mediaeval task is by training two sequential mod-
els for text and visuals in a video respectively. The first model is used to get the textual
features from the video. The second model is used to analyze the memorability score
from the image frames. concatenating the features from the video frames and the
3
textual features which contribute to the memorability score will result in the memo-
rability score of the videos. This paper has a limitation that the model predicts the
frames with longer captions tend to have higher memorability score. But it may not
be true in all cases.[4]
A video summarization method called Deep Hierarchical LSTM Networks with At-
tention for Video Summarization (DHAVS). This approach addresses the challenge of
processing longer frame sequences, which is often an issue in traditional methods.
DHAVS utilizes a 3D CNN for extracting spatio-temporal features, an attention-based
hierarchical LSTM module for capturing temporal dependencies, and a cost-sensitive
loss function to handle imbalanced class distribution. However, it is important to note
that this approach may have limitations when applied to videos of different lengths.
[5]
A Reinforcement based 3DST-UNet-RL framework for intelligent video summariza-
tion, aiming to efficiently convey crucial information by leveraging a 3D spatio-temporal
U-Net followed by reinforcement learning (RL). This reinforcement learning model
finds the reward for each state(frames) and concludes whether it should be included
in the summarization process of the video or not.But the limitation in this reinforce-
ment learning approach is that the future frames are not considered while selecting
the current frame for the summarization. So it may lead to summarization of content
that is not important and can ignore contents that are important. [6]
A Machine learning based approach called as Bayesian Ridge Regressor (BRR) is used
to train on the extracted features from the video for computing the memorability
scores. The feature extraction technique called as the temporal-based frame extraction
is used. However the research explicitly states that a need of improving the capturing
of the most representative frames from the video content as a future enhancement [7]
4
6 PROPOSED SYSTEM
Figure 1: System Architecture
Long Video is divided into segment of videos. These segments are inputted to a mem-
orability prediction model. The video memorability scores are ranked and the clips
corresponding to the lowest scores are filtered out. The segments which have high
memorability scores are combined. These resulting high memorable segments are in-
putted to a video captioning model which converts the high memorable segments to
textual description.
5
Figure 2: Memorability prediction model
Memento10k dataset containing shorter videos along with their short term and
long term memorability scores are used. Extracting frames from the videos and train-
ing an Ensemble Model(SVR + Ridge Regressor + Decision Trees) with the short term
and long term memorability scores which is capable of predicting the memorability
score that is used in Table 1. Then with the Explainable AI (XAI) the output pro-
6
duced by the Ensemble model are interpreted and based on the observations from
the Explainable AI(XAI) the Ensembled model is fine tuned to produce much more
optimized results.
Figure 3: Video Captioning Model
The MSR-VTT Dataset is used to train a video captioning model. The MSR-VTT
Dataset is contains videos which is of similar video lengths to the Memento10K dataset.
The Frames from the videos are captured and using a pretrained CNN (Resnet) model,
7
features are extracted to train the video captioning model. These are then passed on
to an LSTM (Encoder) which is used to capture the temporal order and to analyse the
sequence of the frames. The decoder LSTM utilizes this contextual information from
the encoder to generate the caption sequence for the input video.
7 DATASET
The Dataset used for training the video memorability prediction model is the Me-
mento10K Dataset. The Dataset consists of Videos and their corresponding short-term
memorability and long-term memorability. The Dataset consist of 10,000 short videos
each of length 6 seconds.
Video Short-term Memorability Long-term Memorability
Video1.webm 0.75 0.82
Video2.webm 0.68 0.75
Video3.webm 0.80 0.88
video4.webm 0.40 0.43
video5.webm 0.65 0.60
Table 1: Sample Memorability Data
The Dataset is collected by asking the participants to watch 180 video in short-term
memorability and 120 videos in long-term memorability. The task of the participants
is to press the space bar when the participants find the video similar the previously
watched videos. By this the short-term and long-term memorabilities are calculated.
8 EXTRACT FRAMES
Extracting Frames from the video data to feed it into the feature extraction model to
extract the features. 20 Frames at regular intervals are extracted from the video files
and passed to the feature extraction module.
9 FEATURE EXTRACTION
9.1 Vggnet16 Feature Extraction
VGG16 consists of 16 weight layers, including 13 convolutional layers and 3 fully con-
nected layers. The convolutional layers are characterized by their small 3x3 convolu-
tional filters, with a stride of 1 and zero-padding to maintain spatial resolution. The
8
depth of the convolutional layers gradually increases, allowing the network to capture
complex hierarchical features.
The VGG16 model is loaded in with the weights trained from the imagenet. The
frames from the video data is extracted as 20 frames per video and each frame is given
to the VGG16 model. The VGG16 model’s final CNN layer is extracted as a feature
and is given to a GlobalAveragePooling2D layer which converts the learnt patterns
into a feature vector.
Figure 4: VGG16 Architecture
In the above architecture Fig.10 the final MaxPooling layer is extracted as a feature.
It produces an output tensor of shape (1, 7, 7, 512), then it is converted as vector of
shape (1, 512) using the GlobalAveragePooling2D to extract the feature vector.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import VGG16
base model = VGG16 . VGG16( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = F a l s e
f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s = 0 ) )
from t e n s o r f l o w . k e r a s . l a y e r s import GlobalAveragePooling2D
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
VGG feature vector
Listing 1: VGG16 Feature Extraction
9
9.2 C3D Feature Extraction
C3D consists of a series of 3D convolutional layers that operate on both spatial and
temporal dimensions. Unlike traditional 2D CNNs that process images frame by
frame, C3D considers the temporal aspect by incorporating 3D convolutions across
consecutive video frames. The architecture typically comprises eight convolutional
layers, five pooling layers, and two fully connected layers.
Figure 5: C3D Architecture
In the above architecture Fig.5, the input video data(series of frames) is given to
the C3D model and the conv5a block’s output is extracted as a feature. It produces an
output tensor which is converted as a feature vector.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import C3D
base model = C3D( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = F a l s e
f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s = 0 ) )
from t e n s o r f l o w . k e r a s . l a y e r s import GlobalAveragePooling2D
C 3 D f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
C3D feature vector
Listing 2: C3D Feature Extraction
9.3 EfficientNetB0 Feature Extraction
EfficientNetB0 model, known for its efficiency in balance between size and perfor-
mance of models at capturing high-level visual features from video segments. By
utilizing the pre-trained weights of EfficientNetB0, it is possible to acquire semanti-
cally rich features that are instrumental in achieving an adequate comprehension of
the visual content.
The video frames are passed to EfficientNetB0 , and the features extracted from in-
termediate layer outputs. These features capture hierarchical representations that en-
able the model to distinguish both global and local visual patterns within each frame.
In the above architecture Fig.6, the input frames extracted from the video data are
fed into the efficientnetB0 model . The features are extracted from the layer before the
top layer and it produces a feature representation of shape (1, 1, 1280). These image
10
Figure 6: EfficientNetB0 Feature Extraction
features are then combined to get the feature for the entire video and then used in the
feature vector for further model training.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import E f f i c i e n t N e t B 0
model = E f f i c i e n t N e t B 0 (
weights= ’ imagenet ’ ,
include top=False ,
shape = ( 2 2 4 , 2 2 4 , 3 ) )
f e a t u r e s = model . p r e d i c t ( t f . expand dims ( f r a m e s 2 0 [ 0 ] , a x i s = 0 ) )
f e a t u r e s . shape
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
feature vector
Listing 3: EfficientNetB0 Feature Extraction
11
10 ENSEMBLED MODEL- SUPPORT VECTOR REGRES-
SOR, DECISION TREE, RIDGE REGRESSOR
The deep learning features are extracted and used as inputs to an ensembled that
functions in the regression mode. An ensembled model with the model combina-
tion of Support Vector Regressor, Ridge Regressor and Decision Trees, form a strong
machine learning algorithm is good to distinguish patterns and relationships in high-
dimensional feature spaces. By supplying our carefully selected attributes to the en-
sembled model, we help it learn and map intricate correlations between video content
memorability. The trained ensembled model becomes capable of predicting the score
for each video segment’s memorability. The Ensemble model is explained in the below
figure 7
Figure 7: Ensemble Model
from s k l e a r n . ensemble import V o t i n g R e g r e s s o r
ensemble model = V o t i n g R e g r e s s o r (
[ ( ’ s v r ’ , svm model ) ,
( ’ ridge ’ , ridge reg ) ,
( ’ decision tree ’ , decision tree model )
])
ensemble model . f i t ( X t r a i n s c a l e d , y t r a i n . f l a t t e n ( ) )
Listing 4: Sample code - Ensemble Model Creation
10.1 Dual Memorability Prediction
Importantly, the model is set to predict both short-term and long-term memorabil-
ity. This dual predictability increases the system’s flexibility and allows to understand
in what ways content is remembered by viewers at various time scales. This subtle
understanding is especially useful in applications like video editing, content recom-
mendation or the making of educational materials.
12
11 EXPLAINABLE AI
In this research, we incorporated explainable AI methodologies to provide transparency
and interpretability for our regression model. We employed two key techniques: Lo-
cal Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlana-
tions (SHAP).
11.1 LIME Explanation
LIME offers a localized view of our regression model’s predictions, allowing us to
understand the factors contributing to individual predictions. By approximating the
model’s behavior around a specific instance, LIME generates interpretable explana-
tions. For each prediction, LIME fits an interpretable model (such as linear regression)
on perturbed instances near the original input. This approach enables us to discern
the relative importance of different features for a given prediction.
Figure 8: VGG16 Architecture
We visualized the LIME explanations using the show in notebook function within
Jupyter notebooks. This facilitated a detailed examination of feature contributions,
aiding in the interpretation of the model’s decision-making process.
11.2 SHAP Explanation
In addition to LIME, we employed SHAP to provide a global perspective on our re-
gression model’s behavior. SHAP values assign each feature an importance score,
indicating its impact on the model’s output across the entire dataset. By computing
SHAP values for all instances, we gained insights into the overall influence of each
feature on the model’s predictions.
13
The SHAP summary plot, generated using the summary plot function, offered a
comprehensive overview of feature importances. This visualization allowed us to
identify the most influential features driving the model’s predictions.
Figure 9: VGG16 Architecture
Furthermore, we explored Individual Conditional Expectation (ICE) plots, which
illustrate the relationship between a single feature and SHAP values across multiple
instances. These plots provided valuable insights into how the model’s predictions
vary with changes in specific feature values.
14
Figure 10: VGG16 Architecture
The integration of LIME and SHAP explanations enhanced the transparency and
interpretability of our regression model. These techniques enabled us to gain a deeper
understanding of the model’s decision-making process and identify critical features
driving its predictions. As a result, our research benefits from increased trustworthi-
ness and the ability to make informed decisions based on the model’s outputs.
12 VIDEO CAPTIONING MODEL
Video captioning model is aimed to generate descriptive textual representations for
videos, facilitating machine understanding of visual content. This section outlines our
proposed video captioning model, which integrates visual feature extraction with se-
quence modeling techniques to achieve accurate and informative caption generation.
12.1 Feature Extraction - Resnet
The initial stage of our model involves extracting rich visual features from input video
frames to capture essential spatial information. We employ a pre-trained convolu-
tional neural network (CNN) as the visual feature extractor. Specifically, we utilize
a ResNet-based CNN architecture to process each frame independently and extract
spatial features that effectively represent the visual content.
15
12.2 Temoporal Encoding with LSTM
Following visual feature extraction, our model incorporates a Long Short-Term Mem-
ory (LSTM) network to encode temporal dependencies and contextual information.
The LSTM sequentially processes the visual features, considering the temporal order
of frames within the video. By capturing temporal dynamics, the LSTM generates
output sequences (or hidden states) that encapsulate the encoded information at each
time step. The final hidden state of the LSTM serves as a summarized representation
of the entire input video, capturing relevant temporal patterns and context.
12.3 Attention Mechanism in the Decoder
In the decoding phase, our model employs an attention mechanism to focus on salient
regions of the input video while generating captions. The decoder LSTM receives the
encoded information from the encoder and utilizes attention weights to dynamically
weight the importance of different visual features at each decoding step. This attention
mechanism allows the model to attend to relevant parts of the video when generating
each word in the caption, enhancing the relevance and coherence of the generated
textual descriptions.
12.4 Caption Generation and Output Layer
The decoder LSTM generates captions word-by-word based on the encoded informa-
tion and attention weights. At each decoding step, the model predicts the next word
in the caption sequence using a softmax layer, which computes the probability dis-
tribution over the vocabulary. During training, the model is trained to minimize the
cross-entropy loss between the predicted word distribution and the ground-truth cap-
tions. During inference, the model generates captions by sampling from the predicted
word distributions, producing coherent and informative textual representations of the
input video.
13 RESULTS
The Table 2 represents the comparison of spearman correlations for the above experi-
ments to find the best suitable model for the Memorability Prediction model.
The Table 3 represents the comparison of Mean Squared Error for the above exper-
iments to find the best suitable model for the Memorability Prediction model.
16
Experiment Spearman Correlation
Experiment 1 - ANN (10% of data) -0.234
Experiment 2 - ANN (50% of data) -0.027
Experiment 3 - ANN (100% of data) 0.029
Experiment 4 - SVR (50% of data) 0.233
Experiment 5 - SVR (100% of data) 0.345
Experiment 6 - DT (100% of data) 0.168
Experiment 7 - Ensembled Model (100 % of data) 0.312
Table 2: Analysis of Spearman Correlation among Experiments
Experiment Mean Squared Error
Experiment 1 - ANN (10% of data) 0.122
Experiment 2 - ANN (50% of data) 0.519
Experiment 3 - ANN (100% of data) 0.008
Experiment 4 - SVR (50% of data) 0.008
Experiment 5 - SVR (100% of data) 0.006
Experiment 6 - DT (100% of data) 0.011
Experiment 7 - Ensembled Model (100 % of data) 0.006
Table 3: Analysis of Mean Squared Error among Experiments
The Table 4 represents the comparison of Mean Absolute Error for the above ex-
periments to find the best suitable model for the Memorability Prediction model.
Experiment Mean Absolute Error
Experiment 1 - ANN (10% of data) 0.332
Experiment 2 - ANN (50% of data) 0.277
Experiment 3 - ANN (100% of data) 0.051
Experiment 4 - SVR (50% of data) 0.083
Experiment 5 - SVR (100% of data) 0.006
Experiment 6 - DT (100% of data) 0.079
Experiment 7 - Ensembled Model (100 % of data) 0.064
Table 4: Analysis of Mean Absolute Error among Experiments
The Table 5 represents the comparison of BLEU Scores of the existing video cap-
tioning models with our model.
17
Models BLEU-1 Score
GVMF [1] 70.5
H-RNN [8] 61.6
Xu et al. [9] 61.7
m-RNN [11] 61.9
ResBRNN [10] 69.1
Our Model 64.8
Table 5: Comparison of BLEU Scores with the existing models
14 TIMELINE
Start
First Review
(Started with Explainable AI for
Predicting Models)
Second Review
(Video Captioning Model)
Third Review
(Completed Video Captioning
Model with Explainable AI)
End
18
References
[1] Lu, Y. and Wu, X., Video storytelling based on gated video memorability filter-
ing.,Electronics Letters 2022, 58(15), pp.576-578.
[2] Dumont, T., Hevia, J.S. and Fosco, C.L.,,Modular Memorability: Tiered Representa-
tions for Video Memorability Prediction, In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition 2023 (pp. 10751-10760).
[3] Li, J., Guo, X., Yue, F., Xue, F. and Sun, J.,Adaptive Multi-Modal Ensemble Network
for Video Memorability Prediction., Applied Sciences 2022, 12(17), p.8599..
[4] Guinaudeau, C. and Xalabarder, A.G., Textual Analysis for Video Memorability Pre-
diction, In Working Notes Proceedings of the MediaEval 2022 Workshop.
[5] Lin, J., Zhong, S.H. and Fares, A., Deep hierarchical LSTM networks with attention
for video summarization, Computers and Electrical Engineering 2022, 97, p.107618.
[6] Liu, T., Meng, Q., Huang, J.J., Vlontzos, A., Rueckert, D. and Kainz, B., Video
summarization through reinforcement learning with a 3D spatio-temporal u-net., IEEE
Transactions on Image Processing 2022, 31, pp.1573-1586.
[7] Cummins, S., Sweeney, L. and Smeaton, A.,Analysing the Memorability of a Proce-
dural Crime-Drama TV Series, CSI., In Proceedings of the 19th International Con-
ference on Content-based Multimedia Indexing 2022.(pp. 174-180).
[8] Yu, L., Bansal, M. and Berg, T.L., 2017. Hierarchically-attentive rnn for album
summarization and storytelling. arXiv preprint arXiv:1708.02977.
[9] Xu, R., Xiong, C., Chen, W. and Corso, J., 2015, February. Jointly modeling deep
video and compositional text to bridge vision and language in a unified frame-
work. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29,
No. 1).
[10] Li, J., Wong, Y., Zhao, Q. and Kankanhalli, M.S., 2019. Video storytelling: Textual
summaries for events. IEEE Transactions on Multimedia, 22(2), pp.554-565.
[11] Mao, J., Xu, W., Yang, Y., Wang, J. and Yuille, A.L., 2015. Deep caption-
ing with multimodal recurrent neural networks (m-rnn). ICLR. arXiv preprint
arXiv:1412.6632.
19