π₯π₯π₯Do Language Models Understand Time?π€
Do language models understand time?π§ In the kitchen arenaπ§βπ³, where burritos are rolledπ―, rice waits patientlyπ, and sauce steals the spotlight, LLMs try their best to keep up. Captions flow like a recipeβprecise and temptingβbut can they truly tell the difference between prepping, cooking, and eating? After all, in cooking, timing isnβt just everythingβitβs the secret sauce!π₯³π₯³π₯³
πππ A collection of papers and resources related to Large Language Models in video domainποΈ.
π More details please refer to our paper.
π οΈ Please let us know if you find out a mistake or have any suggestions by e-mail: [email protected]
If you find our work useful for your research, please cite the following paper:
@inproceedings{10.1145/3701716.3717744,
author = {Ding, Xi and Wang, Lei},
title = {Do Language Models Understand Time?},
year = {2025},
isbn = {9798400713316},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3701716.3717744},
doi = {10.1145/3701716.3717744},
pages = {1855β1868},
numpages = {14},
keywords = {interaction, language language models, temporal, videos},
location = {Sydney NSW, Australia},
series = {WWW '25}
}- [10/02/2025] π The GitHub repository for our paper has been released.
- [27/01/2025] π Our paper has been accepted as an oral presentation at the Companion Proceedings of The Web Conference 2025 (WWW 2025)
- Video-LLM
Performance comparison of visual encoders. (Left): Image classification accuracy for various image encoders pretrained and fine-tuned on the ImageNet-1K dataset. (Right): Action recognition accuracy for different video encoders pretrained and fine-tuned on the Kinetics-400 and Something-Something V2 datasets.
β¨ The tables below present summaries of the latest multimodal video-LLMs with image encoders and their interaction and fusion mechanisms.
Click to expand Table 1
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| Flamingo | NeurIPS 2022 | Text: Chinchilla | Perceiver Resampler & Gated XATTN-DENSE | Visual-language model. | Github |
Click to expand Table 2
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| mPLUG-2 | ICML 2023 | Text: BERT | Universal layers & cross-attention modules | Modularized multi-modal foundation model. | GitHub |
| Vid2Seq | CVPR 2023 | Text: T5-Base | Cross-modal attention | Sequence-to-sequence video-language model. | GitHub |
| Video-LLaMA | EMNLP 2023 | Text: Vicuna, Audio: ImageBind | Aligned via Q-Formers for video and audio | Instruction-tuned multimodal model. | GitHub |
| Video-ChatGPT | ACL 2023 | Text: Vicuna-v1.1 | Spatiotemporal features projected via linear layer | Integration of vision and language for video understanding. | GitHub |
| Valley | arXiv 2023 | Text: StableVicuna | Projection layer | LLM for video assistant tasks. | GitHub |
| Macaw-LLM | arXiv 2023 | Text: LLAMA-7B, Audio: Whisper | Alignment module unifies multi-modal representations | Multimodal integration using image, audio, and video inputs. | GitHub |
| Auto-AD II | CVPR 2023 | Text: BERT | Cross-attention layers | Movie description using vision and language. | GitHub |
| GPT4Video | ACMMM 2023 | Text: LLaMA 2 | Transformer-based cross-attention layer | Video understanding with LLM-based reasoning. | - |
| LLaMA-VID | ECCV 2023 | Text: Vicuna | Context attention and linear projector | LLaMA-VID for visual-textual alignment in video. | GitHub |
| COSMO | arXiv 2024 | Text: OPT-IML/RedPajama/Mistral | Gated cross-attention | Contrastive-streamlined multimodal model. | - |
| VTimeLLM | CVPR 2024 | Text: Vicuna | Linear layer | Temporal video understanding enhanced with LLMs. | GitHub |
| VILA | CVPR 2024 | Text: LLaMA-2-7B/13B | Linear layer | Vision-language model. | GitHub |
| PLLaVA | arXiv 2024 | Text: LLAMA-7B | MM projector with adaptive pooling | Parameter-free extension for video captioning tasks. | GitHub |
| V2Xum-LLaMA | arXiv 2024 | Text: LLaMA 2 | Vision adapter | Video summarization using temporal prompt tuning. | GitHub |
| VideoGPT+ | arXiv 2024 | Text: Phi-3-Mini-3.8B | MLP | Enhanced video understanding. | GitHub |
| EmoLLM | arXiv 2024 | Text: Vicuna-v1.5, Audio: Whisper | Multi-perspective visual projection | Multimodal emotional understanding with improved reasoning. | GitHub |
| ShareGPT4Video | arXiv 2024 | Text: Mistral-7B-Instruct-v0.2 | MLP | Precise and detailed video captions with hierarchical prompts. | GitHub |
| VideoLLaMA 2 | arXiv 2024 | Text: LLAMA 1.5, Audio: BEATs | Spatial-Temporal Convolution (STC) connector | Advancing spatial-temporal modeling and audio understanding. | GitHub |
| VideoLLM-online | CVPR 2024 | Text: Llama-2-Chat/Llama-3-Instruct | MLP projector | Online video large language model for streaming video. | GitHub |
| LongVA | arXiv 2024 | Text: Qwen2-Extended | MLP | Long context video understanding. | GitHub |
| InternLM-XComposer-2.5 | arXiv 2024 | Text: InternLM2-7B[15], Audio: Whisper | MLP | Long-context LVLM supporting ultra-high-resolution video tasks. | GitHub |
| Qwen2-VL | arXiv 2024 | Text: Qwen2-7B | Cross-attention modules | Vision-language model for multimodal tasks. | GitHub |
| Video-XL | arXiv 2024 | Text: Qwen2-7B | Visual-language projector | Long-context video understanding model. | GitHub |
| SlowFocus | NeurIPS 2024 | Text: Vicuna-7B v1.5 | Visual adapter (projector layer) | Fine-grained temporal understanding in video LLM. | GitHub |
| VideoStudio | ECCV 2024 | Text: CLIP ViT-H/14 | Cross-attention modules | Multi-scene video generation | GitHub |
| VideoINSTA | arXiv 2024 | Text: Llama-3-8B-Instruc | Self-reflective spatial-temporal fusion | Zero-shot long video understanding model. | GitHub |
| TRACE | arXiv 2024 | Text: Mistral-7B | Task-interleaved sequence modeling & Adaptive head-switching | Video temporal grounding via causal event modeling. | GitHub |
Click to expand Table 3
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| VideoChat | arXiv 2023 | Text: StableVicuna, Audio: Whisper | Q-Former bridges visual features to LLMs for reasoning | Chat-centric model for video analysis. | GitHub |
| VAST | NeurIPS 2023 | Text: BERT, Audio: BEATs | Cross-attention layers | Omni-modality foundational model. | GitHub |
| VTG-LLM | arXiv 2024 | Text: LLaMA-2-7B | Projection layer | Enhanced video temporal grounding. | GitHub |
| AutoAD III | CVPR 2024 | Text: GPT-3.5-turbo | Shared Q-Former | Video description enhancement with LLMs. | GitHub |
| MA-LMM | CVPR 2024 | Text: Vicuna | A trainable Q-Former | Memory-augmented large multimodal model. | GitHub |
| MiniGPT4-Video | arXiv 2024 | Text: LLaMA 2 | Concatenates visual tokens and projects into LLM space | Video understanding with visual-textual token interleaving. | GitHub |
| Vriptor | arXiv 2024 | Text: ST-LLM, Audio: Whisper | Scene-level sequential alignment | Vriptor for dense video captioning. | GitHub |
| Kangaroo | arXiv 2024 | Text: Llama-3-8B-Instruct | Multi-modal projector | Video-language model supporting long-context video input. | - |
Click to expand Table 4
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| LAVAD | CVPR 2024 | Text: Llama-2-13b-chat | Converts video features into textual prompts for LLMs | Training-free video anomaly detection using LLMs. | GitHub |
Click to expand Table 5
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| Video-CCAM | arXiv 2024 | Text: Phi-3-4k-instruct/ Yi-1.5-9B-Chat | Cross-attention-based projector | Causal cross-attention masks for short and long videos. | GitHub |
| Apollo | arXiv 2024 | Text: Qwen2.5-7B | Perceiver Resampler & Token Integration with Timestamps | Video understanding model. | - |
Click to expand Table 6
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| Oryx | arXiv 2024 | Text: Qwen2-7B/32B | Cross attention | Spatial-temporal model for high-resolution understanding. | GitHub |
β¨ The tables below present summaries of the latest multimodal video-LLMs with video encoders and their interaction and fusion mechanisms.
Click to expand Table 7
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| VideoLLM | arXiv 2023 | Text: e.g., BERT, T5 | Semantic translator aligns visual and text encodings | Video sequence modeling using LLMs. | GitHub |
| Loong | arXiv 2024 | Text: Standard text tokenizer | Decoder-only autoregressive LLM with causal attention | Decoder-only autoregressive LLM with causal attention. | - |
Click to expand Table 8
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| LaViLa | CVPR 2022 | Text: 12-layer Transformer | Cross-attention modules | Large-scale language model. | GitHub |
| Video ReCap | CVPR 2024 | Text: GPT-2 | Cross-attention layers | Recursive hierarchical captioning model | GitHub |
Click to expand Table 9
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| OmniViD | CVPR 2024 | Text: BART | MQ-Former | Generative model for universal video understanding. | GitHub |
Click to expand Table 10
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| VideoChat2 | CVPR 2024 | Text: Vicuna | Linear projection | A comprehensive multi-modal video understanding benchmark. | GitHub |
Click to expand Table 11
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| Video-LLaVA | arXiv 2023 | Text: Vicuna v1.5 | MLP projection layer | Unified visual representation learning for video. | GitHub |
| MotionLLM | arXiv 2024 | Text: Vicuna | Motion / Video translator | Understanding human behaviors from human motions and videos. | GitHub |
| Holmes-VAD | arXiv 2024 | Text: LLaMA3-Instruct-70B | Temporal sampler | Multimodal LLM for video anomaly detection. | GitHub |
Click to expand Table 12
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| InternVideo2 | ECCV 2023 | Text: BERT-Large, Audio: BEATs | Q-Former aligns multi-modal embeddings | Foundation model for multimodal video understanding. | GitHub |
Click to expand Table 13
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| InternVideo2 | ECCV 2023 | Text: BERT-Large, Audio: BEATs | Q-Former aligns multi-modal embeddings | Foundation model for multimodal video understanding. | GitHub |
| VITA | arXiv 2024 | Text: Mixtral-8x7B, Audio: Mel Filter Bank | MLP | Open-source interactive multimodal LLM. | GitHub |
Click to expand Table 14
| Model | Venue | Other modality encoders | Interaction / Fusion mechanism | Description | Code |
|---|---|---|---|---|---|
| ChatVideo | arXiv 2023 | Text: ChatGPT, Audio: e.g., Whisper | Tracklet-centric with ChatGPT reasoning | Chat-based video understanding system. | Coming soon |
The distributions of interaction/fusion mechanisms and data modalities in 66 closely related video-LLMs from January 2024 to December 2024. (Left): Fusion mechanisms are classified into five categories: Cross-attention (e.g., crossattention modules, gated cross-attention), Projection layers (e.g., linear projection, MLP projection), Q-Former-based methods (e.g., Q-Former aligns multi-modal embeddings, Trainable Q-Former), Motion/Temporal-Specific mechanisms (e.g., temporal samplers, scene-level sequential alignment), and Other Methods (e.g., Tracklet-centric, Perceiver Resampler, MQ-Former). (Right): The distribution of data modalities used in these video-LLMs, with text modalities appearing across all models. Note that a model may use multiple fusion methods and/or data modalities.
β¨The tables below provide a comprehensive overview of video datasets across various tasks.
Click to expand Table 15
| Dataset | Year | Source | # Videos | Modality | Avg. length (s) | Temporal annotation | Description |
|---|---|---|---|---|---|---|---|
| HMDB51 | 2011 | YouTube | 6,766 | Video | 3~4 | No | Daily human actions |
| UCF101 | 2012 | YouTube | 13,320 | Video+Audio | 7.21 | No | Human actions (e.g., sports, daily activities) |
| ActivityNet | 2015 | YouTube | 27,801 | Video+Text | 300~1200 | Temporal extent provided | Human-centric activities |
| Charades | 2016 | Crowdsourced | 9,848 | Video+Text | 30.1 | Start and end timestamps provided | Household activities |
| Kinetics-400 | 2017 | YouTube | 306,245 | Video | 10 | No | Human actions (e.g., sports, tasks) |
| AVA | 2018 | Movies | 430 | Video | Variable | Start and end timestamps provided | Action localization in movie scenes |
| Something-Something V2 | 2018 | Crowdsourced | 220,847 | Video | 2~6 | Weak | Human-object interactions |
| COIN | 2019 | YouTube | 11,827 | Video+Text | 141.6 | Start and end timestamps provided | Comprehensive instructional tasks (e.g., cooking, repair) |
| Kinetics-700 | 2019 | YouTube | 650,317 | Video | 10 | No | Expanded version of Kinetics-400 and Kinetics-600 |
| EPIC-KITCHENS | 2020 | Participant kitchens | 432 | Video+Text+Audio | ~458 | Start and end timestamps provided | Largest egocentric video dataset |
| Ego4D | 2021 | Wearable Cameras | 3,850 hours | Video+Text+Audio | Variable | Start and end timestamps provided | First-person activities and interactions |
| VidSitu | 2021 | YouTube | 29,000 | Video+Text | ~10 | Temporal extent for events provided | Event-centric and causal activity annotations |
Click to expand Table 16
| Dataset | Year | Source | # Videos | Modality | Avg. length (s) | Temporal annotation | Description |
|---|---|---|---|---|---|---|---|
| MovieQA | 2016 | Multiple platforms | 408 | Video+Text | 202.7 | Start and end timestamps provided | QA for movie scenes |
| TGIF-QA | 2016 | Tumblr GIFs | 56,720 | Video+Text | 3~5 | Action timestamps provided | QA over social media GIFs |
| MSVD-QA | 2017 | YouTube | 1,970 | Video+Text | 27.5 | Start and end timestamps provided | QA for actions description |
| MSRVTT-QA | 2017 | YouTube | 10,000 | Video+Text | 15~30 | Weak | QA across diverse scenes |
| TVQA | 2019 | TV Shows | 21,793 | Video+Text | 60~90 | Start and end timestamps provided | QA over medical dramas, sitcoms, crime shows |
| ActivityNet-QA | 2019 | YouTube | 5,800 | Video+Text | 180 | Implicit (derived from ActivityNet) | QA for human-annotated videos |
| How2QA | 2020 | HowTo100M (YouTube) | 22,000 | Video+Text | 60 | Temporal extent provided | QA over instructional videos |
| YouCookQA | 2021 | YouCook2 (YouTube) | 2,000 | Video+Text | 316.2 | Temporal boundaries provided | Cooking-related instructional QA |
| STAR | 2021 | Human activity datasets | 22,000 | Video+Text | Variable | Action-level boundaries provided | QA over human-object interactions |
| MVBench | 2023 | Public datasets | 3,641 | Video+Text | 5~35 | Start and end timestamps provided | Multi-domain QA (e.g., sports, indoor scenes) |
| EgoSchema | 2023 | Ego4D (Wearable Cameras) | 5,063 | Video+Text | 180 | Timestamped narrations provided | Long-form egocentric activities |
Click to expand Table 17
| Dataset | Year | Source | # Videos | Modality | Avg. length (s) | Temporal annotation | Description |
|---|---|---|---|---|---|---|---|
| YouCook | 2013 | YouTube | 88 | Video+Text | 180~300 | Weak | Cooking instructional videos |
| MSR-VTT | 2016 | YouTube | 7,180 | Video+Text+Audio | 10~30 | Weak | General scenarios (e.g., sports, transport) |
| ActivityNet Captions | 2017 | YouTube | 20,000 | Video+Text | 180 | Start and end timestamps provided | Dense captions for human-centered activities |
| VATEX | 2019 | YouTube | 41,250 | Video+Text | ~10 | Weak | Multilingual descriptions with English-Chinese parallel captions |
| HowTo100M | 2019 | YouTube | 1.22M | Video+Text+Audio | 390 | Subtitle timestamps provided | Instructional video captions |
| TVC | 2020 | TV Shows | 108,965 | Video+Text | 76.2 | Start and end timestamps provided | Multimodal video captioning dataset |
Click to expand Table 18
| Dataset | Year | Source | # Videos | Modality | Avg. length (s) | Temporal annotation | Description |
|---|---|---|---|---|---|---|---|
| LSMDC | 2015 | Movies | 118,114 | Video+Text | 4.8 | Start and end timestamps provided | Large-scale dataset for movie description tasks |
| DiDeMo | 2017 | Flickr (YFCC100M) | 10,464 | Video+Text | 27.5 | Start and end timestamps provided | Moment localization in diverse, unedited personal videos |
| FIVR-200K | 2019 | YouTube | 225,960 | Video | ~120 | Start and end timestamps provided | Large-scale incident video retrieval dataset with diverse news events |
| TVR | 2020 | TV Shows | 21,793 | Video+Text | 76.2 | Start and end timestamps provided | Video-subtitle multimodal moment retrieval dataset |
| TextVR | 2023 | YouTube | 10,500 | Video+Text | 15 | Weak | Cross-modal video retrieval with text reading comprehension |
| EgoCVR | 2024 | Ego4D | 2,295 | Video+Text | 3.9~8.1 | Weak | Egocentric dataset for fine-grained composed video retrieval |
Click to expand Table 19
| Dataset | Year | Source | # Videos | Modality | Avg. length (s) | Temporal annotation | Description |
|---|---|---|---|---|---|---|---|
| Subway Entrance | 2008 | Surveillance cameras | 1 | Video | 4,800 | No | Crowd monitoring for unusual event detection at subway entrances |
| Subway Exit | 2008 | Surveillance cameras | 1 | Video | 5,400 | No | Crowd monitoring for unusual event detection at subway exits |
| CUHK Avenue | 2013 | Surveillance cameras | 15 | Video | 120 | No | Urban avenue scenes with anomalies like running, loitering, etc. |
| Street Scene | 2020 | Urban street surveillance | 81 | Video | 582 | Spatial and temporal bounding boxes | Urban street anomalies (e.g., jaywalking, loitering, illegal parking, etc.) |
| XD-Violence | 2020 | Movies and in-the-wild scenes | 4,754 | Video+Audio | ~180 | Start and end timestamps provided | Multimodal violence detection covering six violence types |
| CUVA | 2024 | YouTube, Bilibili | 1,000 | Video+Text | ~117 | Start and end timestamps provided | Causation-focused anomaly understanding across 42 anomaly types |
| MSAD | 2024 | Online Surveillance | 720 | Video | ~20 | Frame-level annotations in test set | Multi-scenario dataset with 14 scenarios |
Click to expand Table 20
| Dataset | Year | Source | # Videos | Modality | Avg. length (s) | Temporal annotation | Description |
|---|---|---|---|---|---|---|---|
| VIDAL-10M | 2023 | Multiple platforms | 10M | Video+Infrared+Depth+Audio+Text | ~20 | Weak | Multi-domain retrieval dataset |
| Video-MME | 2024 | YouTube | 900 | Video+Text+Audio | 1017.9 | Temporal ranges via certificate length | Comprehensive evaluation benchmark across many domains |
Left: Performance (accuracy) comparison of recent videoLLMs on the Video-MME benchmark. Right: Performance comparison of recent video-LLMs on video QA benchmarks. Models using pretrained video encoders (e.g., Video-LLaVA and VideoChat2) are marked with squares, while models using pretrained image encoders are represented by circles.
Performance comparison of recent video-LLMs on
(a) video retrieval and (b) video captioning benchmarks.
We warmly invite everyone to contribute to this repository and help enhance its quality and scope. Feel free to submit pull requests to add new methods, datasets or other useful resources, as well as to correct any errors you discover. To ensure consistency, please format your pull requests using our tables' structures. We greatly appreciate your valuable contributions and support!