Language-Guided Graph Representation Learning for Video Summarization

Wenrui Li, Wei Han, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan Yonghong Tian,

🔖 Introduction

With the rapid growth of video content on social media platforms, video summarization has become a crucial task in multimedia processing. However, existing methods encounter significant challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward, and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an inner graph relationship reasoning module with a dual-threshold graph convolution mechanism. This mechanism distinguishes semantically relevant frames from irrelevant ones by calculating cosine similarity between nodes, thereby intelligently filtering and aggregating information from adjacent frames. Additionally, our proposed language-guided cross-modal embedding module integrates user-provided language instructions into video sequences, generating personalized video summaries that align with specific textual descriptions. To resolve the one-to-many mapping problem in video summarization, we model the output of the summary generation process as a mixture Bernoulli distribution and solve it using the EM algorithm, accommodating the diverse annotation strategies employed by different annotators for the same video. Finally, we introduce a bi-threshold cross-entropy loss function to manage varying annotations from different annotators for the same video. Experimental results show that our method outperforms existing approaches across multiple benchmarks, particularly excelling in handling multimodal tasks. Moreover, we proposed LGRLN is more suitable for real-world applications, as it reduces inference time and model parameters by 87.8% and 91.7%, respectively.

Experment

📑 Download

download LGRLN using git

git clone https://github.com/liwrui/LGRLN.git

download datasets to ./data/
- download TVsum and SumMe from TVSum & SumMe
- download VideoXum from VideoXum
If you want to download model parameters, put them in ./results/

the final dictionary struture should be:

|-- LGRLN
  |-- configs
    |-- SumMe
      |-- SPELL_default.yaml
    |-- TVSum
      |-- SPELL_default.yaml
    |-- VideoXum
      |-- SPELL_default.yaml
  |-- data
    |-- annotations
      |-- SumMe
        |-- eccv16_dataset_summe_google_pool5.h5
      |-- TVSum
        |-- eccv16_dataset_tvsum_google_pool5.h5
      |-- videoxum
        |-- blip
        |-- test_videoxum.json
        |-- train_videoxum.json
        |-- val_videoxum.json
        |-- vt_clipscore
    |-- graphs
    |-- generate_temporal_graphs.py
  |-- gravit
  |-- results
    |-- SPELL_VS_SumMe_default
    |-- SPELL_VS_TVSum_default
    |-- SPELL_VS_VideXum_default
  |-- tools

🛠️ Pretreatment

split TVSum and SumMe and generate graphs offline by running:

  python data/generate_temporal_graphs.py --dataset SumMe --features eccv16_dataset_summe_google_pool5 --tauf 10 --skip_factor 0

  python data/generate_temporal_graphs.py --dataset TVSum --features eccv16_dataset_tvsum_google_pool5 --tauf 5 --skip_factor 0

videoxum doesn't need splitting, and its dataloader will generate graphs online

🚀Training

training on SumMe

  python tools/train.py --cfg configs/SumMe/SPELL_default.yaml --split 4

training on TVSum

  python tools/train.py --cfg configs/TVSum/SPELL_default.yaml --split 4

training on VideoXum

  python tools/train_videoxum.py --cfg configs/VideoXum/SPELL_default.yaml --split 4

👀Evaluation

evaluation on SumMe

python tools/eval.py --exp_name SPELL_VS_SumMe_default --eval_type VS_max --split 4

evaluation on TVSum

python tools/eval.py --exp_name SPELL_VS_TVSum_default --eval_type VS_avg --split 4

evaluation on VideoXum

python tools/eval_videoxum.py --exp_name SPELL_VS_VideoXum_default --eval_type VS_avg --split 4

the final evaluation results will be close to:

dataset f1 tau rho

SumMe 54.7 0.14 0.19

TVSum 58.3 0.30 0.43

VideoXum 32.1 0.19 0.26
The absolute advantage of LGRLN is its parameter count

model parameters(M) Total (MB)

PGL-SUM 36.02 55.17

A2Summ 9.60 50.56

LGRLN 2.97 13.96

📦Model Zoo

you can download checkpoints from

url password

https://pan.baidu.com/s/1vRphKVBYuIxBzyg5xoGn8w tagc

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
__pycache__		__pycache__
configs		configs
data		data
gravit		gravit
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-Guided Graph Representation Learning for Video Summarization

🔖 Introduction

Experment

📑 Download

🛠️ Pretreatment

🚀Training

👀Evaluation

📦Model Zoo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

dataset	f1	tau	rho
SumMe	54.7	0.14	0.19
TVSum	58.3	0.30	0.43
VideoXum	32.1	0.19	0.26

model	parameters(M)	Total (MB)
PGL-SUM	36.02	55.17
A2Summ	9.60	50.56
LGRLN	2.97	13.96

liwrui/LGRLN

Folders and files

Latest commit

History

Repository files navigation

Language-Guided Graph Representation Learning for Video Summarization

🔖 Introduction

Experment

📑 Download

🛠️ Pretreatment

🚀Training

👀Evaluation

📦Model Zoo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages