0% found this document useful (0 votes)

12 views19 pages

Me Review Report Copy

Uploaded by

Sriram Sivakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views19 pages

Me Review Report Copy

Uploaded by

Sriram Sivakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Video Storytelling Via Retaining High Memorable Video

Clips

S Deepak Sriram 3122226001004

ME CSE, Semester 4

Dr.A.Beulah
Supervisor

Project Review: 2 (Phase 2) (09 April 2024)

Department of Computer Science and Engineering
SSN College of Engineering

1 MOTIVATION
In the current digital age, there is a vast amount of multimedia content available. Long
videos contain both useful and unnecessary information, and viewers are often un-
willing to view them. As a result, people now prefer to watch shorter or more precise
videos.Traditional methods of converting long videos to more precise short videos
have included key frame based selection, Segment selection based on content, and
Shorter boundary based detection. However, these methods have several drawbacks.
For example, in key frame Based Selection, important actions or events may occur be-
tween the key frames.In Segment selection based on content difficult to identify which
segments contain relevant information. Additionally, short boundary Based Detection
may focus too heavily on sudden changes between scenes, resulting in a lack of under-
standing of the overall story. This motivates to generate textual description for long
videos without losing its semantics.

2 PROBLEM STATEMENT
Today’s digital world is filled with all sorts of multimedia content. One of the biggest
challenges professionals in advertising, film making, education and content recovery,
face is predicting whether a video will last in the memory for both long term and short
term. Video summarization using memorability scores refers to the task of generating
a textual description of a video that contains only the most memorable and represen-
tative parts.Given a long video, the goal is to find the segments which are memorable

1
and using Explainable AI(XAI) to prove the models correctness and converting those
segments which are memorable to textual descriptions which results in video summa-
rization.

3 ABSTRACT
Video summarization is an increasingly challenging task in managing online video
content. Traditional methods often overlook the vital aspect of human attention and
memory, focusing solely on visual, audio, or text cues. However, there is a break-
through that looks upon on the significance of incorporating memorability scores into
the video summarization process. To address this issue, a novel framework is intro-
duced, to seamlessly integrate video summarization with memorability prediction.
This proposed model predicts the memory scores of different video segments, re-
sulting in more engaging and relevant summaries that align with human cognitive
functions.The effectiveness of this approach was demonstrated through various ex-
perimentation on the Memento10K dataset, revealing significant improvements in the
prediction of video memorability score.Furthermore, Explainable AI (XAI) techniques
are employed to ensure transparency and interpretability of the model’s predictions,
thereby enhancing trustworthiness and facilitating real-world adoption. This research
not only provides a fresh perspective to the analysis of video content but also enhances
the user experience by recognizing and leveraging human memory.

4 OBJECTIVE
The goal of this project is to create a model that can summarize a given video based
on its memorability scores

• To build a model that predicts the memorability score of a given video using the
features extracted from the videos.

• To use Explainable AI to analyse the memorability models correctness and fea-

ture importance.

• To Generate captions for each segments from a video which has high memora-
bility score.

• To create story from the generated captions.

2
5 LITERATURE SURVEY
In recent years, researches have been done on video summarization based on mem-
orability and other conventional methods, a model called Gated Video Memorability
Filtering(GVMF) is developed which addresses the issue of story-irrelevant visual con-
tents by filtering out video clips with lower memorability scores. These lower-scoring
clips typically contain objects and characters that appear less frequently and are con-
sidered story-irrelevant or noise. By introducing video memorability prediction, the
GVMF model identifies and filters out these irrelevant clips. The GVMF model re-
moves all the frames which doesnot have enough objects in the scene. But there may
be cases in which the frames with less objects can contribute to the memorability of
the video and contributing to the video content [1]

Another model M3-S, or Modular Memorability Model with Descriptor All HOG Mean
OF Contrast Brightness Size (bytes) Blurriness Baseline, comprises four modules: raw
perception, scene understanding, event understanding, and contextual similarity. These
modules analyze various aspects of video content, such as low-level perceptual fac-
tors, semantic segmentation for object detection, action recognition for high-level un-
derstanding, and contextual similarity for visual context. The model combines these
module outputs to predict video memorability comprehensively. However, the paper
acknowledges limitations explicitly, stating that the model may struggle with complex
semantics, extreme pixel intensity, or extreme motion, leading to suboptimal memora-
bility scores. [2]

A Framework called as The Adaptive Multi-modal Ensemble Network (AMEN) uses

three individual base learners (ResNet3D, Deep Random Forest, and Multi-Layer Per-
ception) into a weighted ensemble framework. The model is built upon three factors
from the video which are Temporal 3D information, Spatial information, Semantics
derived from video, image, and caption. However Ensemble models especially those
combining multiple modalities and employing complex base learners, can be compu-
tationally expensive. This may limit the scalability of the proposed framework for
large-scale applications or real-time processing. [3]

Another approach followed in 2022 mediaeval task is by training two sequential mod-
els for text and visuals in a video respectively. The first model is used to get the textual
features from the video. The second model is used to analyze the memorability score
from the image frames. concatenating the features from the video frames and the

3
textual features which contribute to the memorability score will result in the memo-
rability score of the videos. This paper has a limitation that the model predicts the
frames with longer captions tend to have higher memorability score. But it may not
be true in all cases.[4]

A video summarization method called Deep Hierarchical LSTM Networks with At-
tention for Video Summarization (DHAVS). This approach addresses the challenge of
processing longer frame sequences, which is often an issue in traditional methods.
DHAVS utilizes a 3D CNN for extracting spatio-temporal features, an attention-based
hierarchical LSTM module for capturing temporal dependencies, and a cost-sensitive
loss function to handle imbalanced class distribution. However, it is important to note
that this approach may have limitations when applied to videos of different lengths.
[5]

A Reinforcement based 3DST-UNet-RL framework for intelligent video summariza-

tion, aiming to efficiently convey crucial information by leveraging a 3D spatio-temporal
U-Net followed by reinforcement learning (RL). This reinforcement learning model
finds the reward for each state(frames) and concludes whether it should be included
in the summarization process of the video or not.But the limitation in this reinforce-
ment learning approach is that the future frames are not considered while selecting
the current frame for the summarization. So it may lead to summarization of content
that is not important and can ignore contents that are important. [6]

A Machine learning based approach called as Bayesian Ridge Regressor (BRR) is used
to train on the extracted features from the video for computing the memorability
scores. The feature extraction technique called as the temporal-based frame extraction
is used. However the research explicitly states that a need of improving the capturing
of the most representative frames from the video content as a future enhancement [7]

4
6 PROPOSED SYSTEM

Figure 1: System Architecture

Long Video is divided into segment of videos. These segments are inputted to a mem-
orability prediction model. The video memorability scores are ranked and the clips
corresponding to the lowest scores are filtered out. The segments which have high
memorability scores are combined. These resulting high memorable segments are in-
putted to a video captioning model which converts the high memorable segments to
textual description.

5
Figure 2: Memorability prediction model

Memento10k dataset containing shorter videos along with their short term and
long term memorability scores are used. Extracting frames from the videos and train-
ing an Ensemble Model(SVR + Ridge Regressor + Decision Trees) with the short term
and long term memorability scores which is capable of predicting the memorability
score that is used in Table 1. Then with the Explainable AI (XAI) the output pro-

6
duced by the Ensemble model are interpreted and based on the observations from
the Explainable AI(XAI) the Ensembled model is fine tuned to produce much more
optimized results.

Figure 3: Video Captioning Model

The MSR-VTT Dataset is used to train a video captioning model. The MSR-VTT
Dataset is contains videos which is of similar video lengths to the Memento10K dataset.
The Frames from the videos are captured and using a pretrained CNN (Resnet) model,

7
features are extracted to train the video captioning model. These are then passed on
to an LSTM (Encoder) which is used to capture the temporal order and to analyse the
sequence of the frames. The decoder LSTM utilizes this contextual information from
the encoder to generate the caption sequence for the input video.

7 DATASET
The Dataset used for training the video memorability prediction model is the Me-
mento10K Dataset. The Dataset consists of Videos and their corresponding short-term
memorability and long-term memorability. The Dataset consist of 10,000 short videos
each of length 6 seconds.

Video Short-term Memorability Long-term Memorability

Video1.webm 0.75 0.82
Video2.webm 0.68 0.75
Video3.webm 0.80 0.88
video4.webm 0.40 0.43
video5.webm 0.65 0.60

Table 1: Sample Memorability Data

The Dataset is collected by asking the participants to watch 180 video in short-term
memorability and 120 videos in long-term memorability. The task of the participants
is to press the space bar when the participants find the video similar the previously
watched videos. By this the short-term and long-term memorabilities are calculated.

8 EXTRACT FRAMES
Extracting Frames from the video data to feed it into the feature extraction model to
extract the features. 20 Frames at regular intervals are extracted from the video files
and passed to the feature extraction module.

9 FEATURE EXTRACTION

9.1 Vggnet16 Feature Extraction

VGG16 consists of 16 weight layers, including 13 convolutional layers and 3 fully con-
nected layers. The convolutional layers are characterized by their small 3x3 convolu-
tional filters, with a stride of 1 and zero-padding to maintain spatial resolution. The

8
depth of the convolutional layers gradually increases, allowing the network to capture
complex hierarchical features.

The VGG16 model is loaded in with the weights trained from the imagenet. The
frames from the video data is extracted as 20 frames per video and each frame is given
to the VGG16 model. The VGG16 model’s final CNN layer is extracted as a feature
and is given to a GlobalAveragePooling2D layer which converts the learnt patterns
into a feature vector.

Figure 4: VGG16 Architecture

In the above architecture Fig.10 the final MaxPooling layer is extracted as a feature.
It produces an output tensor of shape (1, 7, 7, 512), then it is converted as vector of
shape (1, 512) using the GlobalAveragePooling2D to extract the feature vector.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import VGG16
base model = VGG16 . VGG16( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = F a l s e

f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s = 0 ) )

from t e n s o r f l o w . k e r a s . l a y e r s import GlobalAveragePooling2D
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
VGG feature vector

Listing 1: VGG16 Feature Extraction

9
9.2 C3D Feature Extraction
C3D consists of a series of 3D convolutional layers that operate on both spatial and
temporal dimensions. Unlike traditional 2D CNNs that process images frame by
frame, C3D considers the temporal aspect by incorporating 3D convolutions across
consecutive video frames. The architecture typically comprises eight convolutional
layers, five pooling layers, and two fully connected layers.

Figure 5: C3D Architecture

In the above architecture Fig.5, the input video data(series of frames) is given to
the C3D model and the conv5a block’s output is extracted as a feature. It produces an
output tensor which is converted as a feature vector.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import C3D
base model = C3D( i n c l u d e t o p = F a l s e )
base model . t r a i n a b l e = F a l s e

f e a t u r e s = base model . p r e d i c t ( t f . expand dims ( frames [ 0 ] , a x i s = 0 ) )

from t e n s o r f l o w . k e r a s . l a y e r s import GlobalAveragePooling2D
C 3 D f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
C3D feature vector

Listing 2: C3D Feature Extraction

9.3 EfficientNetB0 Feature Extraction

EfficientNetB0 model, known for its efficiency in balance between size and perfor-
mance of models at capturing high-level visual features from video segments. By
utilizing the pre-trained weights of EfficientNetB0, it is possible to acquire semanti-
cally rich features that are instrumental in achieving an adequate comprehension of
the visual content.
The video frames are passed to EfficientNetB0 , and the features extracted from in-
termediate layer outputs. These features capture hierarchical representations that en-
able the model to distinguish both global and local visual patterns within each frame.
In the above architecture Fig.6, the input frames extracted from the video data are
fed into the efficientnetB0 model . The features are extracted from the layer before the
top layer and it produces a feature representation of shape (1, 1, 1280). These image

10
Figure 6: EfficientNetB0 Feature Extraction

features are then combined to get the feature for the entire video and then used in the
feature vector for further model training.
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import E f f i c i e n t N e t B 0
model = E f f i c i e n t N e t B 0 (
weights= ’ imagenet ’ ,
include top=False ,
shape = ( 2 2 4 , 2 2 4 , 3 ) )
f e a t u r e s = model . p r e d i c t ( t f . expand dims ( f r a m e s 2 0 [ 0 ] , a x i s = 0 ) )
f e a t u r e s . shape
f e a t u r e v e c t o r = GlobalAveragePooling2D ( ) ( f e a t u r e s )
feature vector

Listing 3: EfficientNetB0 Feature Extraction

11
10 ENSEMBLED MODEL- SUPPORT VECTOR REGRES-
SOR, DECISION TREE, RIDGE REGRESSOR
The deep learning features are extracted and used as inputs to an ensembled that
functions in the regression mode. An ensembled model with the model combina-
tion of Support Vector Regressor, Ridge Regressor and Decision Trees, form a strong
machine learning algorithm is good to distinguish patterns and relationships in high-
dimensional feature spaces. By supplying our carefully selected attributes to the en-
sembled model, we help it learn and map intricate correlations between video content
memorability. The trained ensembled model becomes capable of predicting the score
for each video segment’s memorability. The Ensemble model is explained in the below
figure 7

Figure 7: Ensemble Model

from s k l e a r n . ensemble import V o t i n g R e g r e s s o r

ensemble model = V o t i n g R e g r e s s o r (
[ ( ’ s v r ’ , svm model ) ,
( ’ ridge ’ , ridge reg ) ,
( ’ decision tree ’ , decision tree model )
])
ensemble model . f i t ( X t r a i n s c a l e d , y t r a i n . f l a t t e n ( ) )

Listing 4: Sample code - Ensemble Model Creation

10.1 Dual Memorability Prediction

Importantly, the model is set to predict both short-term and long-term memorabil-
ity. This dual predictability increases the system’s flexibility and allows to understand
in what ways content is remembered by viewers at various time scales. This subtle
understanding is especially useful in applications like video editing, content recom-
mendation or the making of educational materials.

12
11 EXPLAINABLE AI
In this research, we incorporated explainable AI methodologies to provide transparency
and interpretability for our regression model. We employed two key techniques: Lo-
cal Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlana-
tions (SHAP).

11.1 LIME Explanation

LIME offers a localized view of our regression model’s predictions, allowing us to
understand the factors contributing to individual predictions. By approximating the
model’s behavior around a specific instance, LIME generates interpretable explana-
tions. For each prediction, LIME fits an interpretable model (such as linear regression)
on perturbed instances near the original input. This approach enables us to discern
the relative importance of different features for a given prediction.

Figure 8: VGG16 Architecture

We visualized the LIME explanations using the show in notebook function within
Jupyter notebooks. This facilitated a detailed examination of feature contributions,
aiding in the interpretation of the model’s decision-making process.

11.2 SHAP Explanation

In addition to LIME, we employed SHAP to provide a global perspective on our re-
gression model’s behavior. SHAP values assign each feature an importance score,
indicating its impact on the model’s output across the entire dataset. By computing
SHAP values for all instances, we gained insights into the overall influence of each
feature on the model’s predictions.

13
The SHAP summary plot, generated using the summary plot function, offered a
comprehensive overview of feature importances. This visualization allowed us to
identify the most influential features driving the model’s predictions.

Figure 9: VGG16 Architecture

Furthermore, we explored Individual Conditional Expectation (ICE) plots, which

illustrate the relationship between a single feature and SHAP values across multiple
instances. These plots provided valuable insights into how the model’s predictions
vary with changes in specific feature values.

14
Figure 10: VGG16 Architecture

The integration of LIME and SHAP explanations enhanced the transparency and
interpretability of our regression model. These techniques enabled us to gain a deeper
understanding of the model’s decision-making process and identify critical features
driving its predictions. As a result, our research benefits from increased trustworthi-
ness and the ability to make informed decisions based on the model’s outputs.

12 VIDEO CAPTIONING MODEL

Video captioning model is aimed to generate descriptive textual representations for
videos, facilitating machine understanding of visual content. This section outlines our
proposed video captioning model, which integrates visual feature extraction with se-
quence modeling techniques to achieve accurate and informative caption generation.

12.1 Feature Extraction - Resnet

The initial stage of our model involves extracting rich visual features from input video
frames to capture essential spatial information. We employ a pre-trained convolu-
tional neural network (CNN) as the visual feature extractor. Specifically, we utilize
a ResNet-based CNN architecture to process each frame independently and extract
spatial features that effectively represent the visual content.

15
12.2 Temoporal Encoding with LSTM
Following visual feature extraction, our model incorporates a Long Short-Term Mem-
ory (LSTM) network to encode temporal dependencies and contextual information.
The LSTM sequentially processes the visual features, considering the temporal order
of frames within the video. By capturing temporal dynamics, the LSTM generates
output sequences (or hidden states) that encapsulate the encoded information at each
time step. The final hidden state of the LSTM serves as a summarized representation
of the entire input video, capturing relevant temporal patterns and context.

12.3 Attention Mechanism in the Decoder

In the decoding phase, our model employs an attention mechanism to focus on salient
regions of the input video while generating captions. The decoder LSTM receives the
encoded information from the encoder and utilizes attention weights to dynamically
weight the importance of different visual features at each decoding step. This attention
mechanism allows the model to attend to relevant parts of the video when generating
each word in the caption, enhancing the relevance and coherence of the generated
textual descriptions.

12.4 Caption Generation and Output Layer

The decoder LSTM generates captions word-by-word based on the encoded informa-
tion and attention weights. At each decoding step, the model predicts the next word
in the caption sequence using a softmax layer, which computes the probability dis-
tribution over the vocabulary. During training, the model is trained to minimize the
cross-entropy loss between the predicted word distribution and the ground-truth cap-
tions. During inference, the model generates captions by sampling from the predicted
word distributions, producing coherent and informative textual representations of the
input video.

13 RESULTS
The Table 2 represents the comparison of spearman correlations for the above experi-
ments to find the best suitable model for the Memorability Prediction model.
The Table 3 represents the comparison of Mean Squared Error for the above exper-
iments to find the best suitable model for the Memorability Prediction model.

16
Experiment Spearman Correlation
Experiment 1 - ANN (10% of data) -0.234
Experiment 2 - ANN (50% of data) -0.027
Experiment 3 - ANN (100% of data) 0.029
Experiment 4 - SVR (50% of data) 0.233
Experiment 5 - SVR (100% of data) 0.345
Experiment 6 - DT (100% of data) 0.168
Experiment 7 - Ensembled Model (100 % of data) 0.312

Table 2: Analysis of Spearman Correlation among Experiments

Experiment Mean Squared Error

Experiment 1 - ANN (10% of data) 0.122
Experiment 2 - ANN (50% of data) 0.519
Experiment 3 - ANN (100% of data) 0.008
Experiment 4 - SVR (50% of data) 0.008
Experiment 5 - SVR (100% of data) 0.006
Experiment 6 - DT (100% of data) 0.011
Experiment 7 - Ensembled Model (100 % of data) 0.006

Table 3: Analysis of Mean Squared Error among Experiments

The Table 4 represents the comparison of Mean Absolute Error for the above ex-
periments to find the best suitable model for the Memorability Prediction model.

Experiment Mean Absolute Error

Experiment 1 - ANN (10% of data) 0.332
Experiment 2 - ANN (50% of data) 0.277
Experiment 3 - ANN (100% of data) 0.051
Experiment 4 - SVR (50% of data) 0.083
Experiment 5 - SVR (100% of data) 0.006
Experiment 6 - DT (100% of data) 0.079
Experiment 7 - Ensembled Model (100 % of data) 0.064

Table 4: Analysis of Mean Absolute Error among Experiments

The Table 5 represents the comparison of BLEU Scores of the existing video cap-
tioning models with our model.

17
Models BLEU-1 Score
GVMF [1] 70.5
H-RNN [8] 61.6
Xu et al. [9] 61.7
m-RNN [11] 61.9
ResBRNN [10] 69.1
Our Model 64.8

Table 5: Comparison of BLEU Scores with the existing models

14 TIMELINE

Start

First Review
(Started with Explainable AI for
Predicting Models)

Second Review
(Video Captioning Model)

Third Review
(Completed Video Captioning
Model with Explainable AI)

End

18
References
[1] Lu, Y. and Wu, X., Video storytelling based on gated video memorability filter-
ing.,Electronics Letters 2022, 58(15), pp.576-578.

[2] Dumont, T., Hevia, J.S. and Fosco, C.L.,,Modular Memorability: Tiered Representa-
tions for Video Memorability Prediction, In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition 2023 (pp. 10751-10760).

[3] Li, J., Guo, X., Yue, F., Xue, F. and Sun, J.,Adaptive Multi-Modal Ensemble Network
for Video Memorability Prediction., Applied Sciences 2022, 12(17), p.8599..

[4] Guinaudeau, C. and Xalabarder, A.G., Textual Analysis for Video Memorability Pre-
diction, In Working Notes Proceedings of the MediaEval 2022 Workshop.

[5] Lin, J., Zhong, S.H. and Fares, A., Deep hierarchical LSTM networks with attention
for video summarization, Computers and Electrical Engineering 2022, 97, p.107618.

[6] Liu, T., Meng, Q., Huang, J.J., Vlontzos, A., Rueckert, D. and Kainz, B., Video
summarization through reinforcement learning with a 3D spatio-temporal u-net., IEEE
Transactions on Image Processing 2022, 31, pp.1573-1586.

[7] Cummins, S., Sweeney, L. and Smeaton, A.,Analysing the Memorability of a Proce-
dural Crime-Drama TV Series, CSI., In Proceedings of the 19th International Con-
ference on Content-based Multimedia Indexing 2022.(pp. 174-180).

[8] Yu, L., Bansal, M. and Berg, T.L., 2017. Hierarchically-attentive rnn for album
summarization and storytelling. arXiv preprint arXiv:1708.02977.

[9] Xu, R., Xiong, C., Chen, W. and Corso, J., 2015, February. Jointly modeling deep
video and compositional text to bridge vision and language in a unified frame-
work. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29,
No. 1).

[10] Li, J., Wong, Y., Zhao, Q. and Kankanhalli, M.S., 2019. Video storytelling: Textual
summaries for events. IEEE Transactions on Multimedia, 22(2), pp.554-565.

[11] Mao, J., Xu, W., Yang, Y., Wang, J. and Yuille, A.L., 2015. Deep caption-
ing with multimodal recurrent neural networks (m-rnn). ICLR. arXiv preprint
arXiv:1412.6632.

Yun Thesesgg
No ratings yet
Yun Thesesgg
127 pages
Synopsis On: Video Summarization
No ratings yet
Synopsis On: Video Summarization
11 pages
Automated Video Summarization Using Speech Transcripts
No ratings yet
Automated Video Summarization Using Speech Transcripts
12 pages
Video To Sequence
No ratings yet
Video To Sequence
9 pages
TRANSFORMER
No ratings yet
TRANSFORMER
5 pages
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
No ratings yet
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
10 pages
Key-Shots Based Video Summarization by Applying Self-Attention Mechanism
No ratings yet
Key-Shots Based Video Summarization by Applying Self-Attention Mechanism
7 pages
Deeper, Sharper, Faster - Application of Efficient Transformer To Galaxy Image Restoration
No ratings yet
Deeper, Sharper, Faster - Application of Efficient Transformer To Galaxy Image Restoration
17 pages
Video Summarization Techniques
No ratings yet
Video Summarization Techniques
2 pages
Delve Deep Into End-To-End Automatic Speech Recognition Models
No ratings yet
Delve Deep Into End-To-End Automatic Speech Recognition Models
6 pages
Satellite Image Classification with ViTs
No ratings yet
Satellite Image Classification with ViTs
6 pages
Mẫu Trình Bày (Tham Khảo)
No ratings yet
Mẫu Trình Bày (Tham Khảo)
82 pages
BEVFormer V2
No ratings yet
BEVFormer V2
13 pages
AI-based Video Summarization Using FFmpeg and NLP
No ratings yet
AI-based Video Summarization Using FFmpeg and NLP
6 pages
Module 3 - Ref Paper
No ratings yet
Module 3 - Ref Paper
10 pages
Project Proposal SaniahandArban
No ratings yet
Project Proposal SaniahandArban
7 pages
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
No ratings yet
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
13 pages
Video Summarization Using Deep Semantic Features
No ratings yet
Video Summarization Using Deep Semantic Features
16 pages
ML Algorithms
No ratings yet
ML Algorithms
5 pages
Base Paper
No ratings yet
Base Paper
4 pages
Unsupervised Video Summarization Framework Using Keyframe Extraction and Video Skimming
No ratings yet
Unsupervised Video Summarization Framework Using Keyframe Extraction and Video Skimming
6 pages
An Attention Based YOLOv 5 Networkfor Small Traffic Sign Recognition
No ratings yet
An Attention Based YOLOv 5 Networkfor Small Traffic Sign Recognition
8 pages
Video Summarization Overview: Cyberagent, Inc. Otani - Mayu@Cyberagent - Co.Jp
No ratings yet
Video Summarization Overview: Cyberagent, Inc. Otani - Mayu@Cyberagent - Co.Jp
55 pages
Sensors 23 07395
No ratings yet
Sensors 23 07395
18 pages
A Lightweight CNN-Transformer Network For Pixel-Based Crop Mapping Using Time-Series Sentinel-2 Imagery
No ratings yet
A Lightweight CNN-Transformer Network For Pixel-Based Crop Mapping Using Time-Series Sentinel-2 Imagery
17 pages
Video Summarization for Researchers
No ratings yet
Video Summarization for Researchers
10 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
Context Based Vad
No ratings yet
Context Based Vad
54 pages
Base Paper Inference by Me
No ratings yet
Base Paper Inference by Me
2 pages
Recent Advances in Text To SQL
No ratings yet
Recent Advances in Text To SQL
22 pages
Video Summarizing
No ratings yet
Video Summarizing
12 pages
5.5.2 Video To Text With LSTM Models
No ratings yet
5.5.2 Video To Text With LSTM Models
10 pages
A Self-Attention Based Cross-Sectional Return Forecasting Model With Evidence From The Chinese Market
No ratings yet
A Self-Attention Based Cross-Sectional Return Forecasting Model With Evidence From The Chinese Market
7 pages
Video Summarization Using Deep Neural Networks: A Survey
No ratings yet
Video Summarization Using Deep Neural Networks: A Survey
26 pages
Survey Paper - Complete
No ratings yet
Survey Paper - Complete
35 pages
Video Transcript Summarizer
No ratings yet
Video Transcript Summarizer
11 pages
Data Science and Applications (Satyasai Jagannath Nanda, Rajendra Prasad Yadav Etc.) (Z-Library)
No ratings yet
Data Science and Applications (Satyasai Jagannath Nanda, Rajendra Prasad Yadav Etc.) (Z-Library)
546 pages
Youtube Video Summery Project
No ratings yet
Youtube Video Summery Project
5 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
Long Video Compression for AI
No ratings yet
Long Video Compression for AI
17 pages
Mahasseni Unsupervised Video Summarization CVPR 2017 Paperdfdffvfdfgdgf
No ratings yet
Mahasseni Unsupervised Video Summarization CVPR 2017 Paperdfdffvfdfgdgf
10 pages
Prediction of Currency Exchange Rate Based On Transformers
No ratings yet
Prediction of Currency Exchange Rate Based On Transformers
16 pages
Deepak Report Phase1
No ratings yet
Deepak Report Phase1
80 pages
AudioVisual Video Summarization
No ratings yet
AudioVisual Video Summarization
8 pages
Discover Artificial Intelligence: Wanet: Weight and Attention Network For Video Summarization
No ratings yet
Discover Artificial Intelligence: Wanet: Weight and Attention Network For Video Summarization
13 pages
Video Flow A Conditional Flow Based On Nanoparticles
No ratings yet
Video Flow A Conditional Flow Based On Nanoparticles
18 pages
Exploring Global Diverse Attention Via Pairwise
No ratings yet
Exploring Global Diverse Attention Via Pairwise
12 pages
1 Recurrent Neural Networks
No ratings yet
1 Recurrent Neural Networks
34 pages
Hybrid Algorithms For Summarization of Video Surveillance Systems
No ratings yet
Hybrid Algorithms For Summarization of Video Surveillance Systems
35 pages
Attention-Guided Multi-Granularity Fusion Model For Video Summarization
No ratings yet
Attention-Guided Multi-Granularity Fusion Model For Video Summarization
11 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
Aaai 2025
No ratings yet
Aaai 2025
9 pages
CS388N Practice Questions Answers
No ratings yet
CS388N Practice Questions Answers
48 pages
Unsupervised Video Summarization via Attention
No ratings yet
Unsupervised Video Summarization via Attention
12 pages
Multi Scale Contrastive Learning For Video Temporal Grounding
No ratings yet
Multi Scale Contrastive Learning For Video Temporal Grounding
9 pages
Computers 12 00186
No ratings yet
Computers 12 00186
14 pages
56856-Proof
No ratings yet
56856-Proof
12 pages
Article Title
No ratings yet
Article Title
23 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
06 Chapter2
No ratings yet
06 Chapter2
30 pages
Intern Video 2
No ratings yet
Intern Video 2
27 pages
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
No ratings yet
Deep Reinforcement Learning For Unsupervised Video Summarization WithDiversity-Representativeness Reward
9 pages
IJISAE 2 Dikshendra+Daulat+Sarpate 12 4004
No ratings yet
IJISAE 2 Dikshendra+Daulat+Sarpate 12 4004
10 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
8 pages
AI-driven Video Summarization For Optimizing Content Retrieval and Management Through Deep Learning Techniques
No ratings yet
AI-driven Video Summarization For Optimizing Content Retrieval and Management Through Deep Learning Techniques
13 pages
Summary
No ratings yet
Summary
5 pages
Syrja - Saku-Matti - Retrieval-Augmented Generation
No ratings yet
Syrja - Saku-Matti - Retrieval-Augmented Generation
57 pages
1 s2.0 S1877050916311309 Main
No ratings yet
1 s2.0 S1877050916311309 Main
8 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
MolFORMER Paper
No ratings yet
MolFORMER Paper
29 pages
Job SDF
No ratings yet
Job SDF
23 pages
RLHF3
No ratings yet
RLHF3
13 pages
Handwritten Amharic Word Recognition With Additive Attention Mechanism
No ratings yet
Handwritten Amharic Word Recognition With Additive Attention Mechanism
13 pages
Packing Input Frame Context in Next-Frame Prediction Models For Video Generation
No ratings yet
Packing Input Frame Context in Next-Frame Prediction Models For Video Generation
13 pages
Inception Recurrent Neural Network Architecture For Video Frame Prediction
No ratings yet
Inception Recurrent Neural Network Architecture For Video Frame Prediction
16 pages
IEEE Paper Draft
No ratings yet
IEEE Paper Draft
5 pages
LLM Ai
No ratings yet
LLM Ai
19 pages
xAI Methods For LLMs
No ratings yet
xAI Methods For LLMs
9 pages
BEVFormer Learning Birds-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers
No ratings yet
BEVFormer Learning Birds-Eye-View Representation From LiDAR-Camera Via Spatiotemporal Transformers
17 pages
Muhammad 2018
No ratings yet
Muhammad 2018
9 pages
19 LS
No ratings yet
19 LS
15 pages
OCI AI Foundations
No ratings yet
OCI AI Foundations
13 pages
23 Video Summarization LTC-SUM Lightweight Client-Driven Personalized Video Summarization Framework Using 2D CNN
No ratings yet
23 Video Summarization LTC-SUM Lightweight Client-Driven Personalized Video Summarization Framework Using 2D CNN
15 pages
1 - 4. An Approach For Video Summarization Based On Unsupervised Learning Using Deep Semantic Features and Keyframe Extraction
No ratings yet
1 - 4. An Approach For Video Summarization Based On Unsupervised Learning Using Deep Semantic Features and Keyframe Extraction
8 pages
Video Summarization Research Guidance
No ratings yet
Video Summarization Research Guidance
20 pages
A Machine Learning Based Approach To Video Summarization
No ratings yet
A Machine Learning Based Approach To Video Summarization
5 pages
Infinitypot V
No ratings yet
Infinitypot V
23 pages
Lee Video Summarization With Large Language Models CVPR 2025 Paper
No ratings yet
Lee Video Summarization With Large Language Models CVPR 2025 Paper
11 pages
OmAgent A Multi-Modal Agent Framework For Complex Video
No ratings yet
OmAgent A Multi-Modal Agent Framework For Complex Video
15 pages