Yuhao Dong1*, Shulin Tian1*, Shuai Liu1, Shuangrui Ding2,3, Yuhang Zang2, Xiaoyi Dong2, Yuhang Cao2, Jiaqi Wang2, Ziwei Liu1
1S-Lab, Nanyang Technological University 2Shanghai AI Lab 3CUHK-MMLab
*Equal contribution
- [2026-02-09] 🚀 We release Demo-ICL!
Demo-ICL explores a challenging new frontier: Demo-driven Video In-Context Learning. While existing benchmarks rely on static, internal knowledge, Demo-ICL evaluates whether Multimodal Large Language Models (MLLMs) can acquire procedural knowledge dynamically from provided demonstrations (text or video) to solve novel tasks.
We introduce:
- Demo-ICL-Bench: A benchmark of 1,200 samples derived from HowTo100M instructional videos, requiring models to predict next steps based on context.
- Demo-ICL Model (7B): Built on Ola-Video, this model utilizes a novel two-stage training strategy to achieve state-of-the-art performance in utilizing video demonstrations.
Demo-ICL-Bench consists of three distinct settings designed to test adaptability:
- Text-demo ICL (500 samples): The model must retrieve relevant procedure steps from textual instructions to predict the next action in a target video.
- Video-demo ICL (500 samples): The model is provided with a reference video of a similar task and must transfer that visual procedural knowledge to the target video.
- Demonstration Selection (200 samples): A realistic setting where the model must select from a candidate pool (containing distractors) to find the correct demonstration before solving the task.
We construct Demo-ICL-Bench using a coarse-to-fine pipeline on HowTo100M, ensuring high-quality demonstrations:
- Data Processing: WhisperX provides precise timestamps, while Qwen2.5-72B summarizes transcripts into structured instructions, filtering irrelevant steps.
- Demonstration Selection: Video pairs are identified via search rankings (coarse) and validated by LLMs for semantic similarity (fine) to ensure transferability.
Our model, Demo-ICL, employs a two-stage strategy:
- Video SFT: Fine-tuning Ola-Video on diverse video/image-text data to establish foundational understanding.
- Information-Assisted DPO: A novel pipeline utilizing assistive information during training to align responses with human preferences, enabling accurate inference without auxiliary aids.
Demo-ICL (7B) significantly outperforms existing open-source and proprietary models on Demo-ICL-Bench:
- State-of-the-Art Performance: Demo-ICL achieves an average accuracy of 33.1%, surpassing Qwen2.5-VL-72B (29.5%) despite being 10x smaller.
-
Positive Transfer: Unlike many baselines which degrade when given video demonstrations (negative
$\Delta_{ICL}$ ), Demo-ICL achieves a +4.4 improvement with video demos and +14.0 with text demos.
Demo-ICL maintains robust performance on standard benchmarks, achieving 52.6% on VideoMMMU (Knowledge Acquisition), surpassing Qwen2.5-VL-7B and LLaVA-OneVision-7B.
Figure 2. Visualization of Text-demo In-Context Learning. Text instructions guide the model to identify the next procedural step in cooking tasks.
Figure 3. Visualization of Video-demo In-Context Learning. A video demonstration provides visual procedural knowledge for the model to transfer to the target video.
If you find this work useful, please cite our paper:
@article{dong2025demoicl,
title={Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition},
author={Dong, Yuhao and Tian, Shulin and Liu, Shuai and Ding, Shuangrui and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Wang, Jiaqi and Liu, Ziwei},
journal={arXiv preprint arXiv:2602.08439},
year={2026}
}Demo-ICL is built with reference to the code of the following projects: Ola, LLaVA-NeXT, lmms-eval, and HowTo100M. We thank the open-source community for their contributions.