Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Darcyddx/Video-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Video-LLM

πŸ”₯πŸ”₯πŸ”₯Do Language Models Understand Time?πŸ€”

image Do language models understand time?🧐 In the kitchen arenaπŸ§‘β€πŸ³, where burritos are rolled🌯, rice waits patiently🍚, and sauce steals the spotlight, LLMs try their best to keep up. Captions flow like a recipeβ€”precise and temptingβ€”but can they truly tell the difference between prepping, cooking, and eating? After all, in cooking, timing isn’t just everythingβ€”it’s the secret sauce!πŸ₯³πŸ₯³πŸ₯³

πŸ‘‹πŸ‘‹πŸ‘‹ A collection of papers and resources related to Large Language Models in video domain🎞️.

πŸ“Œ More details please refer to our paper.

πŸ› οΈ Please let us know if you find out a mistake or have any suggestions by e-mail: [email protected]

πŸ“‘ Citation

DOI

If you find our work useful for your research, please cite the following paper:

@inproceedings{10.1145/3701716.3717744,
author = {Ding, Xi and Wang, Lei},
title = {Do Language Models Understand Time?},
year = {2025},
isbn = {9798400713316},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3701716.3717744},
doi = {10.1145/3701716.3717744},
pages = {1855–1868},
numpages = {14},
keywords = {interaction, language language models, temporal, videos},
location = {Sydney NSW, Australia},
series = {WWW '25}
}

πŸš€ News

  • [10/02/2025] 🎁 The GitHub repository for our paper has been released.
  • [27/01/2025] 🎈 Our paper has been accepted as an oral presentation at the Companion Proceedings of The Web Conference 2025 (WWW 2025)

πŸ”¦ Table of Contents


image Performance comparison of visual encoders. (Left): Image classification accuracy for various image encoders pretrained and fine-tuned on the ImageNet-1K dataset. (Right): Action recognition accuracy for different video encoders pretrained and fine-tuned on the Kinetics-400 and Something-Something V2 datasets.

πŸ“Έ Models with Image Encoder

✨ The tables below present summaries of the latest multimodal video-LLMs with image encoders and their interaction and fusion mechanisms.

Normalizer-Free ResNet

Click to expand Table 1
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
Flamingo NeurIPS 2022 Text: Chinchilla Perceiver Resampler & Gated XATTN-DENSE Visual-language model. Github

CLIP ViT

Click to expand Table 2
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
mPLUG-2 ICML 2023 Text: BERT Universal layers & cross-attention modules Modularized multi-modal foundation model. GitHub
Vid2Seq CVPR 2023 Text: T5-Base Cross-modal attention Sequence-to-sequence video-language model. GitHub
Video-LLaMA EMNLP 2023 Text: Vicuna, Audio: ImageBind Aligned via Q-Formers for video and audio Instruction-tuned multimodal model. GitHub
Video-ChatGPT ACL 2023 Text: Vicuna-v1.1 Spatiotemporal features projected via linear layer Integration of vision and language for video understanding. GitHub
Valley arXiv 2023 Text: StableVicuna Projection layer LLM for video assistant tasks. GitHub
Macaw-LLM arXiv 2023 Text: LLAMA-7B, Audio: Whisper Alignment module unifies multi-modal representations Multimodal integration using image, audio, and video inputs. GitHub
Auto-AD II CVPR 2023 Text: BERT Cross-attention layers Movie description using vision and language. GitHub
GPT4Video ACMMM 2023 Text: LLaMA 2 Transformer-based cross-attention layer Video understanding with LLM-based reasoning. -
LLaMA-VID ECCV 2023 Text: Vicuna Context attention and linear projector LLaMA-VID for visual-textual alignment in video. GitHub
COSMO arXiv 2024 Text: OPT-IML/RedPajama/Mistral Gated cross-attention Contrastive-streamlined multimodal model. -
VTimeLLM CVPR 2024 Text: Vicuna Linear layer Temporal video understanding enhanced with LLMs. GitHub
VILA CVPR 2024 Text: LLaMA-2-7B/13B Linear layer Vision-language model. GitHub
PLLaVA arXiv 2024 Text: LLAMA-7B MM projector with adaptive pooling Parameter-free extension for video captioning tasks. GitHub
V2Xum-LLaMA arXiv 2024 Text: LLaMA 2 Vision adapter Video summarization using temporal prompt tuning. GitHub
VideoGPT+ arXiv 2024 Text: Phi-3-Mini-3.8B MLP Enhanced video understanding. GitHub
EmoLLM arXiv 2024 Text: Vicuna-v1.5, Audio: Whisper Multi-perspective visual projection Multimodal emotional understanding with improved reasoning. GitHub
ShareGPT4Video arXiv 2024 Text: Mistral-7B-Instruct-v0.2 MLP Precise and detailed video captions with hierarchical prompts. GitHub
VideoLLaMA 2 arXiv 2024 Text: LLAMA 1.5, Audio: BEATs Spatial-Temporal Convolution (STC) connector Advancing spatial-temporal modeling and audio understanding. GitHub
VideoLLM-online CVPR 2024 Text: Llama-2-Chat/Llama-3-Instruct MLP projector Online video large language model for streaming video. GitHub
LongVA arXiv 2024 Text: Qwen2-Extended MLP Long context video understanding. GitHub
InternLM-XComposer-2.5 arXiv 2024 Text: InternLM2-7B[15], Audio: Whisper MLP Long-context LVLM supporting ultra-high-resolution video tasks. GitHub
Qwen2-VL arXiv 2024 Text: Qwen2-7B Cross-attention modules Vision-language model for multimodal tasks. GitHub
Video-XL arXiv 2024 Text: Qwen2-7B Visual-language projector Long-context video understanding model. GitHub
SlowFocus NeurIPS 2024 Text: Vicuna-7B v1.5 Visual adapter (projector layer) Fine-grained temporal understanding in video LLM. GitHub
VideoStudio ECCV 2024 Text: CLIP ViT-H/14 Cross-attention modules Multi-scene video generation GitHub
VideoINSTA arXiv 2024 Text: Llama-3-8B-Instruc Self-reflective spatial-temporal fusion Zero-shot long video understanding model. GitHub
TRACE arXiv 2024 Text: Mistral-7B Task-interleaved sequence modeling & Adaptive head-switching Video temporal grounding via causal event modeling. GitHub

EVA-CLIP ViT

Click to expand Table 3
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
VideoChat arXiv 2023 Text: StableVicuna, Audio: Whisper Q-Former bridges visual features to LLMs for reasoning Chat-centric model for video analysis. GitHub
VAST NeurIPS 2023 Text: BERT, Audio: BEATs Cross-attention layers Omni-modality foundational model. GitHub
VTG-LLM arXiv 2024 Text: LLaMA-2-7B Projection layer Enhanced video temporal grounding. GitHub
AutoAD III CVPR 2024 Text: GPT-3.5-turbo Shared Q-Former Video description enhancement with LLMs. GitHub
MA-LMM CVPR 2024 Text: Vicuna A trainable Q-Former Memory-augmented large multimodal model. GitHub
MiniGPT4-Video arXiv 2024 Text: LLaMA 2 Concatenates visual tokens and projects into LLM space Video understanding with visual-textual token interleaving. GitHub
Vriptor arXiv 2024 Text: ST-LLM, Audio: Whisper Scene-level sequential alignment Vriptor for dense video captioning. GitHub
Kangaroo arXiv 2024 Text: Llama-3-8B-Instruct Multi-modal projector Video-language model supporting long-context video input. -

BLIP-2 ViT

Click to expand Table 4
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
LAVAD CVPR 2024 Text: Llama-2-13b-chat Converts video features into textual prompts for LLMs Training-free video anomaly detection using LLMs. GitHub

SigLIP

Click to expand Table 5
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
Video-CCAM arXiv 2024 Text: Phi-3-4k-instruct/ Yi-1.5-9B-Chat Cross-attention-based projector Causal cross-attention masks for short and long videos. GitHub
Apollo arXiv 2024 Text: Qwen2.5-7B Perceiver Resampler & Token Integration with Timestamps Video understanding model. -

Oryx ViT

Click to expand Table 6
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
Oryx arXiv 2024 Text: Qwen2-7B/32B Cross attention Spatial-temporal model for high-resolution understanding. GitHub

πŸŽ₯ Models with Video Encoder

✨ The tables below present summaries of the latest multimodal video-LLMs with video encoders and their interaction and fusion mechanisms.

Traditional (e.g., I3D, SlowFast)

Click to expand Table 7
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
VideoLLM arXiv 2023 Text: e.g., BERT, T5 Semantic translator aligns visual and text encodings Video sequence modeling using LLMs. GitHub
Loong arXiv 2024 Text: Standard text tokenizer Decoder-only autoregressive LLM with causal attention Decoder-only autoregressive LLM with causal attention. -

TimeSformer

Click to expand Table 8
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
LaViLa CVPR 2022 Text: 12-layer Transformer Cross-attention modules Large-scale language model. GitHub
Video ReCap CVPR 2024 Text: GPT-2 Cross-attention layers Recursive hierarchical captioning model GitHub

VideoSwin

Click to expand Table 9
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
OmniViD CVPR 2024 Text: BART MQ-Former Generative model for universal video understanding. GitHub

UMT

Click to expand Table 10
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
VideoChat2 CVPR 2024 Text: Vicuna Linear projection A comprehensive multi-modal video understanding benchmark. GitHub

LanguageBind

Click to expand Table 11
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
Video-LLaVA arXiv 2023 Text: Vicuna v1.5 MLP projection layer Unified visual representation learning for video. GitHub
MotionLLM arXiv 2024 Text: Vicuna Motion / Video translator Understanding human behaviors from human motions and videos. GitHub
Holmes-VAD arXiv 2024 Text: LLaMA3-Instruct-70B Temporal sampler Multimodal LLM for video anomaly detection. GitHub

VideoMAE V2

Click to expand Table 12
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
InternVideo2 ECCV 2023 Text: BERT-Large, Audio: BEATs Q-Former aligns multi-modal embeddings Foundation model for multimodal video understanding. GitHub

InternVL

Click to expand Table 13
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
InternVideo2 ECCV 2023 Text: BERT-Large, Audio: BEATs Q-Former aligns multi-modal embeddings Foundation model for multimodal video understanding. GitHub
VITA arXiv 2024 Text: Mixtral-8x7B, Audio: Mel Filter Bank MLP Open-source interactive multimodal LLM. GitHub

InternVideo/InternVideo2

Click to expand Table 14
Model Venue Other modality encoders Interaction / Fusion mechanism Description Code
ChatVideo arXiv 2023 Text: ChatGPT, Audio: e.g., Whisper Tracklet-centric with ChatGPT reasoning Chat-based video understanding system. Coming soon

image

The distributions of interaction/fusion mechanisms and data modalities in 66 closely related video-LLMs from January 2024 to December 2024. (Left): Fusion mechanisms are classified into five categories: Cross-attention (e.g., crossattention modules, gated cross-attention), Projection layers (e.g., linear projection, MLP projection), Q-Former-based methods (e.g., Q-Former aligns multi-modal embeddings, Trainable Q-Former), Motion/Temporal-Specific mechanisms (e.g., temporal samplers, scene-level sequential alignment), and Other Methods (e.g., Tracklet-centric, Perceiver Resampler, MQ-Former). (Right): The distribution of data modalities used in these video-LLMs, with text modalities appearing across all models. Note that a model may use multiple fusion methods and/or data modalities.

πŸ’» Datasets

✨The tables below provide a comprehensive overview of video datasets across various tasks.

Action Recognition

Click to expand Table 15
Dataset Year Source # Videos Modality Avg. length (s) Temporal annotation Description
HMDB51 2011 YouTube 6,766 Video 3~4 No Daily human actions
UCF101 2012 YouTube 13,320 Video+Audio 7.21 No Human actions (e.g., sports, daily activities)
ActivityNet 2015 YouTube 27,801 Video+Text 300~1200 Temporal extent provided Human-centric activities
Charades 2016 Crowdsourced 9,848 Video+Text 30.1 Start and end timestamps provided Household activities
Kinetics-400 2017 YouTube 306,245 Video 10 No Human actions (e.g., sports, tasks)
AVA 2018 Movies 430 Video Variable Start and end timestamps provided Action localization in movie scenes
Something-Something V2 2018 Crowdsourced 220,847 Video 2~6 Weak Human-object interactions
COIN 2019 YouTube 11,827 Video+Text 141.6 Start and end timestamps provided Comprehensive instructional tasks (e.g., cooking, repair)
Kinetics-700 2019 YouTube 650,317 Video 10 No Expanded version of Kinetics-400 and Kinetics-600
EPIC-KITCHENS 2020 Participant kitchens 432 Video+Text+Audio ~458 Start and end timestamps provided Largest egocentric video dataset
Ego4D 2021 Wearable Cameras 3,850 hours Video+Text+Audio Variable Start and end timestamps provided First-person activities and interactions
VidSitu 2021 YouTube 29,000 Video+Text ~10 Temporal extent for events provided Event-centric and causal activity annotations

Video QA

Click to expand Table 16
Dataset Year Source # Videos Modality Avg. length (s) Temporal annotation Description
MovieQA 2016 Multiple platforms 408 Video+Text 202.7 Start and end timestamps provided QA for movie scenes
TGIF-QA 2016 Tumblr GIFs 56,720 Video+Text 3~5 Action timestamps provided QA over social media GIFs
MSVD-QA 2017 YouTube 1,970 Video+Text 27.5 Start and end timestamps provided QA for actions description
MSRVTT-QA 2017 YouTube 10,000 Video+Text 15~30 Weak QA across diverse scenes
TVQA 2019 TV Shows 21,793 Video+Text 60~90 Start and end timestamps provided QA over medical dramas, sitcoms, crime shows
ActivityNet-QA 2019 YouTube 5,800 Video+Text 180 Implicit (derived from ActivityNet) QA for human-annotated videos
How2QA 2020 HowTo100M (YouTube) 22,000 Video+Text 60 Temporal extent provided QA over instructional videos
YouCookQA 2021 YouCook2 (YouTube) 2,000 Video+Text 316.2 Temporal boundaries provided Cooking-related instructional QA
STAR 2021 Human activity datasets 22,000 Video+Text Variable Action-level boundaries provided QA over human-object interactions
MVBench 2023 Public datasets 3,641 Video+Text 5~35 Start and end timestamps provided Multi-domain QA (e.g., sports, indoor scenes)
EgoSchema 2023 Ego4D (Wearable Cameras) 5,063 Video+Text 180 Timestamped narrations provided Long-form egocentric activities

Video Captioning

Click to expand Table 17
Dataset Year Source # Videos Modality Avg. length (s) Temporal annotation Description
YouCook 2013 YouTube 88 Video+Text 180~300 Weak Cooking instructional videos
MSR-VTT 2016 YouTube 7,180 Video+Text+Audio 10~30 Weak General scenarios (e.g., sports, transport)
ActivityNet Captions 2017 YouTube 20,000 Video+Text 180 Start and end timestamps provided Dense captions for human-centered activities
VATEX 2019 YouTube 41,250 Video+Text ~10 Weak Multilingual descriptions with English-Chinese parallel captions
HowTo100M 2019 YouTube 1.22M Video+Text+Audio 390 Subtitle timestamps provided Instructional video captions
TVC 2020 TV Shows 108,965 Video+Text 76.2 Start and end timestamps provided Multimodal video captioning dataset

Video Retrieval

Click to expand Table 18
Dataset Year Source # Videos Modality Avg. length (s) Temporal annotation Description
LSMDC 2015 Movies 118,114 Video+Text 4.8 Start and end timestamps provided Large-scale dataset for movie description tasks
DiDeMo 2017 Flickr (YFCC100M) 10,464 Video+Text 27.5 Start and end timestamps provided Moment localization in diverse, unedited personal videos
FIVR-200K 2019 YouTube 225,960 Video ~120 Start and end timestamps provided Large-scale incident video retrieval dataset with diverse news events
TVR 2020 TV Shows 21,793 Video+Text 76.2 Start and end timestamps provided Video-subtitle multimodal moment retrieval dataset
TextVR 2023 YouTube 10,500 Video+Text 15 Weak Cross-modal video retrieval with text reading comprehension
EgoCVR 2024 Ego4D 2,295 Video+Text 3.9~8.1 Weak Egocentric dataset for fine-grained composed video retrieval

Anomaly Detection

Click to expand Table 19
Dataset Year Source # Videos Modality Avg. length (s) Temporal annotation Description
Subway Entrance 2008 Surveillance cameras 1 Video 4,800 No Crowd monitoring for unusual event detection at subway entrances
Subway Exit 2008 Surveillance cameras 1 Video 5,400 No Crowd monitoring for unusual event detection at subway exits
CUHK Avenue 2013 Surveillance cameras 15 Video 120 No Urban avenue scenes with anomalies like running, loitering, etc.
Street Scene 2020 Urban street surveillance 81 Video 582 Spatial and temporal bounding boxes Urban street anomalies (e.g., jaywalking, loitering, illegal parking, etc.)
XD-Violence 2020 Movies and in-the-wild scenes 4,754 Video+Audio ~180 Start and end timestamps provided Multimodal violence detection covering six violence types
CUVA 2024 YouTube, Bilibili 1,000 Video+Text ~117 Start and end timestamps provided Causation-focused anomaly understanding across 42 anomaly types
MSAD 2024 Online Surveillance 720 Video ~20 Frame-level annotations in test set Multi-scenario dataset with 14 scenarios

Multimodal Video Tasks

Click to expand Table 20
Dataset Year Source # Videos Modality Avg. length (s) Temporal annotation Description
VIDAL-10M 2023 Multiple platforms 10M Video+Infrared+Depth+Audio+Text ~20 Weak Multi-domain retrieval dataset
Video-MME 2024 YouTube 900 Video+Text+Audio 1017.9 Temporal ranges via certificate length Comprehensive evaluation benchmark across many domains

image Left: Performance (accuracy) comparison of recent videoLLMs on the Video-MME benchmark. Right: Performance comparison of recent video-LLMs on video QA benchmarks. Models using pretrained video encoders (e.g., Video-LLaVA and VideoChat2) are marked with squares, while models using pretrained image encoders are represented by circles.

image Performance comparison of recent video-LLMs on (a) video retrieval and (b) video captioning benchmarks.

❀️‍πŸ”₯❀️‍πŸ”₯❀️‍πŸ”₯ Contribution

We warmly invite everyone to contribute to this repository and help enhance its quality and scope. Feel free to submit pull requests to add new methods, datasets or other useful resources, as well as to correct any errors you discover. To ensure consistency, please format your pull requests using our tables' structures. We greatly appreciate your valuable contributions and support!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published