Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

liwrui/LGRLN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language-Guided Graph Representation Learning for Video Summarization

🔖 Introduction

With the rapid growth of video content on social media platforms, video summarization has become a crucial task in multimedia processing. However, existing methods encounter significant challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward, and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an inner graph relationship reasoning module with a dual-threshold graph convolution mechanism. This mechanism distinguishes semantically relevant frames from irrelevant ones by calculating cosine similarity between nodes, thereby intelligently filtering and aggregating information from adjacent frames. Additionally, our proposed language-guided cross-modal embedding module integrates user-provided language instructions into video sequences, generating personalized video summaries that align with specific textual descriptions. To resolve the one-to-many mapping problem in video summarization, we model the output of the summary generation process as a mixture Bernoulli distribution and solve it using the EM algorithm, accommodating the diverse annotation strategies employed by different annotators for the same video. Finally, we introduce a bi-threshold cross-entropy loss function to manage varying annotations from different annotators for the same video. Experimental results show that our method outperforms existing approaches across multiple benchmarks, particularly excelling in handling multimodal tasks. Moreover, we proposed LGRLN is more suitable for real-world applications, as it reduces inference time and model parameters by 87.8% and 91.7%, respectively.

Experment

📑 Download

  • download LGRLN using git
    git clone https://github.com/liwrui/LGRLN.git
    
  • download datasets to ./data/
  • If you want to download model parameters, put them in ./results/
  • the final dictionary struture should be:
    |-- LGRLN
      |-- configs
        |-- SumMe
          |-- SPELL_default.yaml
        |-- TVSum
          |-- SPELL_default.yaml
        |-- VideoXum
          |-- SPELL_default.yaml
      |-- data
        |-- annotations
          |-- SumMe
            |-- eccv16_dataset_summe_google_pool5.h5
          |-- TVSum
            |-- eccv16_dataset_tvsum_google_pool5.h5
          |-- videoxum
            |-- blip
            |-- test_videoxum.json
            |-- train_videoxum.json
            |-- val_videoxum.json
            |-- vt_clipscore
        |-- graphs
        |-- generate_temporal_graphs.py
      |-- gravit
      |-- results
        |-- SPELL_VS_SumMe_default
        |-- SPELL_VS_TVSum_default
        |-- SPELL_VS_VideXum_default
      |-- tools
    

🛠️ Pretreatment

  • split TVSum and SumMe and generate graphs offline by running:
      python data/generate_temporal_graphs.py --dataset SumMe --features eccv16_dataset_summe_google_pool5 --tauf 10 --skip_factor 0
    
      python data/generate_temporal_graphs.py --dataset TVSum --features eccv16_dataset_tvsum_google_pool5 --tauf 5 --skip_factor 0
    
  • videoxum doesn't need splitting, and its dataloader will generate graphs online

🚀Training

  • training on SumMe
      python tools/train.py --cfg configs/SumMe/SPELL_default.yaml --split 4
    
  • training on TVSum
      python tools/train.py --cfg configs/TVSum/SPELL_default.yaml --split 4
    
  • training on VideoXum
      python tools/train_videoxum.py --cfg configs/VideoXum/SPELL_default.yaml --split 4
    

👀Evaluation

  • evaluation on SumMe
    python tools/eval.py --exp_name SPELL_VS_SumMe_default --eval_type VS_max --split 4
    
  • evaluation on TVSum
    python tools/eval.py --exp_name SPELL_VS_TVSum_default --eval_type VS_avg --split 4
    
  • evaluation on VideoXum
    python tools/eval_videoxum.py --exp_name SPELL_VS_VideoXum_default --eval_type VS_avg --split 4
    
  • the final evaluation results will be close to:
    dataset f1 tau rho
    SumMe 54.7 0.14 0.19
    TVSum 58.3 0.30 0.43
    VideoXum 32.1 0.19 0.26
  • The absolute advantage of LGRLN is its parameter count
    model parameters(M) Total (MB)
    PGL-SUM 36.02 55.17
    A2Summ 9.60 50.56
    LGRLN 2.97 13.96

📦Model Zoo

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •