ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
π· This is the code repository for the paper: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models. ACM MM 2025.
Overview: (a) Vanilla Text CoT Reasoning; (b) Video-Text Interleaved CoT Reasoning; (c) Video-Text Interleaved Data Construction; (d) Performance Comparison: Vanilla Reasoning Paradigm (Vanilla CoT, Vanilla Desp-CoT, and Vanilla Plan-and-Solve) vs. Video-Text Interleaved Reasoning Paradigm (ViT CoT, ViT Desp-CoT and ViT Plan-and-Solve) on Qwen2.5-VL-7B.
(1) Environment installation command:
pip install -r requirements.txt(2) Please fill in the API information in the file: src/ViTCoT_stage1 and src/ViTCoT_stage2.
API_KEYS = [](3) Download datasets π€ all_video.zip and π€ key_video.zip and unzip them into the src folder.
cd src
bash run.shPlease create Github issues here or email Yongheng Zhang or Libo Qin if you have any questions or suggestions.