Video-LLM

🔥🔥🔥Do Language Models Understand Time?🤔

Do language models understand time?🧐 In the kitchen arena🧑‍🍳, where burritos are rolled🌯, rice waits patiently🍚, and sauce steals the spotlight, LLMs try their best to keep up. Captions flow like a recipe—precise and tempting—but can they truly tell the difference between prepping, cooking, and eating? After all, in cooking, timing isn’t just everything—it’s the secret sauce!🥳🥳🥳

👋👋👋 A collection of papers and resources related to Large Language Models in video domain🎞️.

📌 More details please refer to our paper.

🛠️ Please let us know if you find out a mistake or have any suggestions by e-mail: [email protected]

📑 Citation

If you find our work useful for your research, please cite the following paper:

@inproceedings{10.1145/3701716.3717744,
author = {Ding, Xi and Wang, Lei},
title = {Do Language Models Understand Time?},
year = {2025},
isbn = {9798400713316},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3701716.3717744},
doi = {10.1145/3701716.3717744},
pages = {1855–1868},
numpages = {14},
keywords = {interaction, language language models, temporal, videos},
location = {Sydney NSW, Australia},
series = {WWW '25}
}

🚀 News

[10/02/2025] 🎁 The GitHub repository for our paper has been released.
[27/01/2025] 🎈 Our paper has been accepted as an oral presentation at the Companion Proceedings of The Web Conference 2025 (WWW 2025)

🔦 Table of Contents

Video-LLM

Performance comparison of visual encoders. (Left): Image classification accuracy for various image encoders pretrained and fine-tuned on the ImageNet-1K dataset. (Right): Action recognition accuracy for different video encoders pretrained and fine-tuned on the Kinetics-400 and Something-Something V2 datasets.

📸 Models with Image Encoder

✨ The tables below present summaries of the latest multimodal video-LLMs with image encoders and their interaction and fusion mechanisms.

Normalizer-Free ResNet

Click to expand Table 1

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
Flamingo	NeurIPS 2022	Text: Chinchilla	Perceiver Resampler & Gated XATTN-DENSE	Visual-language model.	Github

CLIP ViT

Click to expand Table 2

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
mPLUG-2	ICML 2023	Text: BERT	Universal layers & cross-attention modules	Modularized multi-modal foundation model.	GitHub
Vid2Seq	CVPR 2023	Text: T5-Base	Cross-modal attention	Sequence-to-sequence video-language model.	GitHub
Video-LLaMA	EMNLP 2023	Text: Vicuna, Audio: ImageBind	Aligned via Q-Formers for video and audio	Instruction-tuned multimodal model.	GitHub
Video-ChatGPT	ACL 2023	Text: Vicuna-v1.1	Spatiotemporal features projected via linear layer	Integration of vision and language for video understanding.	GitHub
Valley	arXiv 2023	Text: StableVicuna	Projection layer	LLM for video assistant tasks.	GitHub
Macaw-LLM	arXiv 2023	Text: LLAMA-7B, Audio: Whisper	Alignment module unifies multi-modal representations	Multimodal integration using image, audio, and video inputs.	GitHub
Auto-AD II	CVPR 2023	Text: BERT	Cross-attention layers	Movie description using vision and language.	GitHub
GPT4Video	ACMMM 2023	Text: LLaMA 2	Transformer-based cross-attention layer	Video understanding with LLM-based reasoning.	-
LLaMA-VID	ECCV 2023	Text: Vicuna	Context attention and linear projector	LLaMA-VID for visual-textual alignment in video.	GitHub
COSMO	arXiv 2024	Text: OPT-IML/RedPajama/Mistral	Gated cross-attention	Contrastive-streamlined multimodal model.	-
VTimeLLM	CVPR 2024	Text: Vicuna	Linear layer	Temporal video understanding enhanced with LLMs.	GitHub
VILA	CVPR 2024	Text: LLaMA-2-7B/13B	Linear layer	Vision-language model.	GitHub
PLLaVA	arXiv 2024	Text: LLAMA-7B	MM projector with adaptive pooling	Parameter-free extension for video captioning tasks.	GitHub
V2Xum-LLaMA	arXiv 2024	Text: LLaMA 2	Vision adapter	Video summarization using temporal prompt tuning.	GitHub
VideoGPT+	arXiv 2024	Text: Phi-3-Mini-3.8B	MLP	Enhanced video understanding.	GitHub
EmoLLM	arXiv 2024	Text: Vicuna-v1.5, Audio: Whisper	Multi-perspective visual projection	Multimodal emotional understanding with improved reasoning.	GitHub
ShareGPT4Video	arXiv 2024	Text: Mistral-7B-Instruct-v0.2	MLP	Precise and detailed video captions with hierarchical prompts.	GitHub
VideoLLaMA 2	arXiv 2024	Text: LLAMA 1.5, Audio: BEATs	Spatial-Temporal Convolution (STC) connector	Advancing spatial-temporal modeling and audio understanding.	GitHub
VideoLLM-online	CVPR 2024	Text: Llama-2-Chat/Llama-3-Instruct	MLP projector	Online video large language model for streaming video.	GitHub
LongVA	arXiv 2024	Text: Qwen2-Extended	MLP	Long context video understanding.	GitHub
InternLM-XComposer-2.5	arXiv 2024	Text: InternLM2-7B[15], Audio: Whisper	MLP	Long-context LVLM supporting ultra-high-resolution video tasks.	GitHub
Qwen2-VL	arXiv 2024	Text: Qwen2-7B	Cross-attention modules	Vision-language model for multimodal tasks.	GitHub
Video-XL	arXiv 2024	Text: Qwen2-7B	Visual-language projector	Long-context video understanding model.	GitHub
SlowFocus	NeurIPS 2024	Text: Vicuna-7B v1.5	Visual adapter (projector layer)	Fine-grained temporal understanding in video LLM.	GitHub
VideoStudio	ECCV 2024	Text: CLIP ViT-H/14	Cross-attention modules	Multi-scene video generation	GitHub
VideoINSTA	arXiv 2024	Text: Llama-3-8B-Instruc	Self-reflective spatial-temporal fusion	Zero-shot long video understanding model.	GitHub
TRACE	arXiv 2024	Text: Mistral-7B	Task-interleaved sequence modeling & Adaptive head-switching	Video temporal grounding via causal event modeling.	GitHub

EVA-CLIP ViT

Click to expand Table 3

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
VideoChat	arXiv 2023	Text: StableVicuna, Audio: Whisper	Q-Former bridges visual features to LLMs for reasoning	Chat-centric model for video analysis.	GitHub
VAST	NeurIPS 2023	Text: BERT, Audio: BEATs	Cross-attention layers	Omni-modality foundational model.	GitHub
VTG-LLM	arXiv 2024	Text: LLaMA-2-7B	Projection layer	Enhanced video temporal grounding.	GitHub
AutoAD III	CVPR 2024	Text: GPT-3.5-turbo	Shared Q-Former	Video description enhancement with LLMs.	GitHub
MA-LMM	CVPR 2024	Text: Vicuna	A trainable Q-Former	Memory-augmented large multimodal model.	GitHub
MiniGPT4-Video	arXiv 2024	Text: LLaMA 2	Concatenates visual tokens and projects into LLM space	Video understanding with visual-textual token interleaving.	GitHub
Vriptor	arXiv 2024	Text: ST-LLM, Audio: Whisper	Scene-level sequential alignment	Vriptor for dense video captioning.	GitHub
Kangaroo	arXiv 2024	Text: Llama-3-8B-Instruct	Multi-modal projector	Video-language model supporting long-context video input.	-

BLIP-2 ViT

Click to expand Table 4

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
LAVAD	CVPR 2024	Text: Llama-2-13b-chat	Converts video features into textual prompts for LLMs	Training-free video anomaly detection using LLMs.	GitHub

SigLIP

Click to expand Table 5

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
Video-CCAM	arXiv 2024	Text: Phi-3-4k-instruct/ Yi-1.5-9B-Chat	Cross-attention-based projector	Causal cross-attention masks for short and long videos.	GitHub
Apollo	arXiv 2024	Text: Qwen2.5-7B	Perceiver Resampler & Token Integration with Timestamps	Video understanding model.	-

Oryx ViT

Click to expand Table 6

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
Oryx	arXiv 2024	Text: Qwen2-7B/32B	Cross attention	Spatial-temporal model for high-resolution understanding.	GitHub

🎥 Models with Video Encoder

✨ The tables below present summaries of the latest multimodal video-LLMs with video encoders and their interaction and fusion mechanisms.

Traditional (e.g., I3D, SlowFast)

Click to expand Table 7

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
VideoLLM	arXiv 2023	Text: e.g., BERT, T5	Semantic translator aligns visual and text encodings	Video sequence modeling using LLMs.	GitHub
Loong	arXiv 2024	Text: Standard text tokenizer	Decoder-only autoregressive LLM with causal attention	Decoder-only autoregressive LLM with causal attention.	-

TimeSformer

Click to expand Table 8

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
LaViLa	CVPR 2022	Text: 12-layer Transformer	Cross-attention modules	Large-scale language model.	GitHub
Video ReCap	CVPR 2024	Text: GPT-2	Cross-attention layers	Recursive hierarchical captioning model	GitHub

VideoSwin

Click to expand Table 9

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
OmniViD	CVPR 2024	Text: BART	MQ-Former	Generative model for universal video understanding.	GitHub

UMT

Click to expand Table 10

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
VideoChat2	CVPR 2024	Text: Vicuna	Linear projection	A comprehensive multi-modal video understanding benchmark.	GitHub

LanguageBind

Click to expand Table 11

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
Video-LLaVA	arXiv 2023	Text: Vicuna v1.5	MLP projection layer	Unified visual representation learning for video.	GitHub
MotionLLM	arXiv 2024	Text: Vicuna	Motion / Video translator	Understanding human behaviors from human motions and videos.	GitHub
Holmes-VAD	arXiv 2024	Text: LLaMA3-Instruct-70B	Temporal sampler	Multimodal LLM for video anomaly detection.	GitHub

VideoMAE V2

Click to expand Table 12

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
InternVideo2	ECCV 2023	Text: BERT-Large, Audio: BEATs	Q-Former aligns multi-modal embeddings	Foundation model for multimodal video understanding.	GitHub

InternVL

Click to expand Table 13

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
InternVideo2	ECCV 2023	Text: BERT-Large, Audio: BEATs	Q-Former aligns multi-modal embeddings	Foundation model for multimodal video understanding.	GitHub
VITA	arXiv 2024	Text: Mixtral-8x7B, Audio: Mel Filter Bank	MLP	Open-source interactive multimodal LLM.	GitHub

InternVideo/InternVideo2

Click to expand Table 14

Model	Venue	Other modality encoders	Interaction / Fusion mechanism	Description	Code
ChatVideo	arXiv 2023	Text: ChatGPT, Audio: e.g., Whisper	Tracklet-centric with ChatGPT reasoning	Chat-based video understanding system.	Coming soon

The distributions of interaction/fusion mechanisms and data modalities in 66 closely related video-LLMs from January 2024 to December 2024. (Left): Fusion mechanisms are classified into five categories: Cross-attention (e.g., crossattention modules, gated cross-attention), Projection layers (e.g., linear projection, MLP projection), Q-Former-based methods (e.g., Q-Former aligns multi-modal embeddings, Trainable Q-Former), Motion/Temporal-Specific mechanisms (e.g., temporal samplers, scene-level sequential alignment), and Other Methods (e.g., Tracklet-centric, Perceiver Resampler, MQ-Former). (Right): The distribution of data modalities used in these video-LLMs, with text modalities appearing across all models. Note that a model may use multiple fusion methods and/or data modalities.

💻 Datasets

✨The tables below provide a comprehensive overview of video datasets across various tasks.

Action Recognition

Click to expand Table 15

Dataset	Year	Source	# Videos	Modality	Avg. length (s)	Temporal annotation	Description
HMDB51	2011	YouTube	6,766	Video	3~4	No	Daily human actions
UCF101	2012	YouTube	13,320	Video+Audio	7.21	No	Human actions (e.g., sports, daily activities)
ActivityNet	2015	YouTube	27,801	Video+Text	300~1200	Temporal extent provided	Human-centric activities
Charades	2016	Crowdsourced	9,848	Video+Text	30.1	Start and end timestamps provided	Household activities
Kinetics-400	2017	YouTube	306,245	Video	10	No	Human actions (e.g., sports, tasks)
AVA	2018	Movies	430	Video	Variable	Start and end timestamps provided	Action localization in movie scenes
Something-Something V2	2018	Crowdsourced	220,847	Video	2~6	Weak	Human-object interactions
COIN	2019	YouTube	11,827	Video+Text	141.6	Start and end timestamps provided	Comprehensive instructional tasks (e.g., cooking, repair)
Kinetics-700	2019	YouTube	650,317	Video	10	No	Expanded version of Kinetics-400 and Kinetics-600
EPIC-KITCHENS	2020	Participant kitchens	432	Video+Text+Audio	~458	Start and end timestamps provided	Largest egocentric video dataset
Ego4D	2021	Wearable Cameras	3,850 hours	Video+Text+Audio	Variable	Start and end timestamps provided	First-person activities and interactions
VidSitu	2021	YouTube	29,000	Video+Text	~10	Temporal extent for events provided	Event-centric and causal activity annotations

Video QA

Click to expand Table 16

Dataset	Year	Source	# Videos	Modality	Avg. length (s)	Temporal annotation	Description
MovieQA	2016	Multiple platforms	408	Video+Text	202.7	Start and end timestamps provided	QA for movie scenes
TGIF-QA	2016	Tumblr GIFs	56,720	Video+Text	3~5	Action timestamps provided	QA over social media GIFs
MSVD-QA	2017	YouTube	1,970	Video+Text	27.5	Start and end timestamps provided	QA for actions description
MSRVTT-QA	2017	YouTube	10,000	Video+Text	15~30	Weak	QA across diverse scenes
TVQA	2019	TV Shows	21,793	Video+Text	60~90	Start and end timestamps provided	QA over medical dramas, sitcoms, crime shows
ActivityNet-QA	2019	YouTube	5,800	Video+Text	180	Implicit (derived from ActivityNet)	QA for human-annotated videos
How2QA	2020	HowTo100M (YouTube)	22,000	Video+Text	60	Temporal extent provided	QA over instructional videos
YouCookQA	2021	YouCook2 (YouTube)	2,000	Video+Text	316.2	Temporal boundaries provided	Cooking-related instructional QA
STAR	2021	Human activity datasets	22,000	Video+Text	Variable	Action-level boundaries provided	QA over human-object interactions
MVBench	2023	Public datasets	3,641	Video+Text	5~35	Start and end timestamps provided	Multi-domain QA (e.g., sports, indoor scenes)
EgoSchema	2023	Ego4D (Wearable Cameras)	5,063	Video+Text	180	Timestamped narrations provided	Long-form egocentric activities

Video Captioning

Click to expand Table 17

Dataset	Year	Source	# Videos	Modality	Avg. length (s)	Temporal annotation	Description
YouCook	2013	YouTube	88	Video+Text	180~300	Weak	Cooking instructional videos
MSR-VTT	2016	YouTube	7,180	Video+Text+Audio	10~30	Weak	General scenarios (e.g., sports, transport)
ActivityNet Captions	2017	YouTube	20,000	Video+Text	180	Start and end timestamps provided	Dense captions for human-centered activities
VATEX	2019	YouTube	41,250	Video+Text	~10	Weak	Multilingual descriptions with English-Chinese parallel captions
HowTo100M	2019	YouTube	1.22M	Video+Text+Audio	390	Subtitle timestamps provided	Instructional video captions
TVC	2020	TV Shows	108,965	Video+Text	76.2	Start and end timestamps provided	Multimodal video captioning dataset

Video Retrieval

Click to expand Table 18

Dataset	Year	Source	# Videos	Modality	Avg. length (s)	Temporal annotation	Description
LSMDC	2015	Movies	118,114	Video+Text	4.8	Start and end timestamps provided	Large-scale dataset for movie description tasks
DiDeMo	2017	Flickr (YFCC100M)	10,464	Video+Text	27.5	Start and end timestamps provided	Moment localization in diverse, unedited personal videos
FIVR-200K	2019	YouTube	225,960	Video	~120	Start and end timestamps provided	Large-scale incident video retrieval dataset with diverse news events
TVR	2020	TV Shows	21,793	Video+Text	76.2	Start and end timestamps provided	Video-subtitle multimodal moment retrieval dataset
TextVR	2023	YouTube	10,500	Video+Text	15	Weak	Cross-modal video retrieval with text reading comprehension
EgoCVR	2024	Ego4D	2,295	Video+Text	3.9~8.1	Weak	Egocentric dataset for fine-grained composed video retrieval

Anomaly Detection

Click to expand Table 19

Dataset	Year	Source	# Videos	Modality	Avg. length (s)	Temporal annotation	Description
Subway Entrance	2008	Surveillance cameras	1	Video	4,800	No	Crowd monitoring for unusual event detection at subway entrances
Subway Exit	2008	Surveillance cameras	1	Video	5,400	No	Crowd monitoring for unusual event detection at subway exits
CUHK Avenue	2013	Surveillance cameras	15	Video	120	No	Urban avenue scenes with anomalies like running, loitering, etc.
Street Scene	2020	Urban street surveillance	81	Video	582	Spatial and temporal bounding boxes	Urban street anomalies (e.g., jaywalking, loitering, illegal parking, etc.)
XD-Violence	2020	Movies and in-the-wild scenes	4,754	Video+Audio	~180	Start and end timestamps provided	Multimodal violence detection covering six violence types
CUVA	2024	YouTube, Bilibili	1,000	Video+Text	~117	Start and end timestamps provided	Causation-focused anomaly understanding across 42 anomaly types
MSAD	2024	Online Surveillance	720	Video	~20	Frame-level annotations in test set	Multi-scenario dataset with 14 scenarios

Multimodal Video Tasks

Click to expand Table 20

Dataset	Year	Source	# Videos	Modality	Avg. length (s)	Temporal annotation	Description
VIDAL-10M	2023	Multiple platforms	10M	Video+Infrared+Depth+Audio+Text	~20	Weak	Multi-domain retrieval dataset
Video-MME	2024	YouTube	900	Video+Text+Audio	1017.9	Temporal ranges via certificate length	Comprehensive evaluation benchmark across many domains

Left: Performance (accuracy) comparison of recent videoLLMs on the Video-MME benchmark. Right: Performance comparison of recent video-LLMs on video QA benchmarks. Models using pretrained video encoders (e.g., Video-LLaVA and VideoChat2) are marked with squares, while models using pretrained image encoders are represented by circles.

Performance comparison of recent video-LLMs on (a) video retrieval and (b) video captioning benchmarks.

❤️‍🔥❤️‍🔥❤️‍🔥 Contribution

We warmly invite everyone to contribute to this repository and help enhance its quality and scope. Feel free to submit pull requests to add new methods, datasets or other useful resources, as well as to correct any errors you discover. To ensure consistency, please format your pull requests using our tables' structures. We greatly appreciate your valuable contributions and support!

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video-LLM

🔥🔥🔥Do Language Models Understand Time?🤔

📑 Citation

🚀 News

🔦 Table of Contents

📸 Models with Image Encoder

Normalizer-Free ResNet

CLIP ViT

EVA-CLIP ViT

BLIP-2 ViT

SigLIP

Oryx ViT

🎥 Models with Video Encoder

Traditional (e.g., I3D, SlowFast)

TimeSformer

VideoSwin

UMT

LanguageBind

VideoMAE V2

InternVL

InternVideo/InternVideo2

💻 Datasets

Action Recognition

Video QA

Video Captioning

Video Retrieval

Anomaly Detection

Multimodal Video Tasks

❤️‍🔥❤️‍🔥❤️‍🔥 Contribution

About

Uh oh!

Releases 1

Packages

License

Darcyddx/Video-LLM

Folders and files

Latest commit

History

Repository files navigation

Video-LLM

🔥🔥🔥Do Language Models Understand Time?🤔

📑 Citation

🚀 News

🔦 Table of Contents

📸 Models with Image Encoder

Normalizer-Free ResNet

CLIP ViT

EVA-CLIP ViT

BLIP-2 ViT

SigLIP

Oryx ViT

🎥 Models with Video Encoder

Traditional (e.g., I3D, SlowFast)

TimeSformer

VideoSwin

UMT

LanguageBind

VideoMAE V2

InternVL

InternVideo/InternVideo2

💻 Datasets

Action Recognition

Video QA

Video Captioning

Video Retrieval

Anomaly Detection

Multimodal Video Tasks

❤️‍🔥❤️‍🔥❤️‍🔥 Contribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Packages