Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ ViTCoT Public

[ACM MM 2025] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

Notifications You must be signed in to change notification settings

BRZ911/ViTCoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models


πŸ“· This is the code repository for the paper: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models. ACM MM 2025.

Overview: (a) Vanilla Text CoT Reasoning; (b) Video-Text Interleaved CoT Reasoning; (c) Video-Text Interleaved Data Construction; (d) Performance Comparison: Vanilla Reasoning Paradigm (Vanilla CoT, Vanilla Desp-CoT, and Vanilla Plan-and-Solve) vs. Video-Text Interleaved Reasoning Paradigm (ViT CoT, ViT Desp-CoT and ViT Plan-and-Solve) on Qwen2.5-VL-7B.

Preparation steps: environment installation

(1) Environment installation command:

pip install -r requirements.txt

(2) Please fill in the API information in the file: src/ViTCoT_stage1 and src/ViTCoT_stage2.

API_KEYS = []

(3) Download datasets πŸ€— all_video.zip and πŸ€— key_video.zip and unzip them into the src folder.

πŸ’» To get the performance results for Gemini-2.0-Flash, run the following command:

cd src
bash run.sh

πŸ’― Model Performance

πŸ’¬ Contact

Please create Github issues here or email Yongheng Zhang or Libo Qin if you have any questions or suggestions.

About

[ACM MM 2025] ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published