A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Updates

2025-04-02 First Release Open-Omni-Nexus. a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
2025-04-01 Meet M4. M4 enables multiplexed modeling capabilities for a video language model at minimal cost.
2025-04-01 First Release OmniMMI. We're excited to share the first version of OmniMMI.

Table of Contents

OmniMMI

Introduction

we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks.

Streaming Temporal State Awareness. Streaming video understanding must build an understanding w.r.t. the current and historical temporal state incrementally, without accessing the future context. This contrasts with traditional MLLM that can leverage the entire multi-modal contexts, posing challenges in our distinguished tasks of action prediction (AP), state grounding (SG) and multi-turn dependencies (MD).
Proactive Reasoning and Turn-Taking. Generating responses proactively and appropriately anticipating the turn-taking time spot w.r.t. user's intentions and dynamic contexts is a crucial feature for general interactive agents. This typically requires models to identify speakers (SI), distinguish between noise or legitimate query (PT), and proactively initiate a response (PA).

Statistics

Statistic	SG	AP	MP	PT	PA	SI
Videos	300	200	300	78	200	200
Queries	704	200	786	200	200	200
Avg. Turns	2.35	1.00	2.62	1.00	1.00	1.00
Avg. Vid.(s)	350.82	234.95	374.80	2004.10	149.82	549.64
Avg. Que.	16.00	25.99	26.27	8.45	17.49	60.91

Data

You can download the OmniMMI from huggingface, containing both json and raw videos.

videohallucer_datasets                
    ├── clips/ # video clips for assessing offline models
    ├── videos/
    ├── action_prediction.json
    ├── dynamic_state_grounding.json
    ├── multiturn_dependency_reasoning.json
    ├── proactive_alerting.json
    ├── proactive_turntaking.json
    └── speaker_identification.json

We offer a selection of data samples (multiturn dependency reasoning) from our dataset for further elucidation:

[
    {
        "video": "jGM8EImhpXg.mp4",
        "qa": [
            {
                "question": "What is the task of the instruction video?",
                "answer": "Cook Omelet",
                "timestamp": "0.0 -- 36.0"
            },
            {
                "question": "What's the first step of ##ANSWER##?",
                "answer": "Pouring the egg into the pot",
                "timestamp": "36.0 -- 40.0"
            },
            {
                "question": "What do you do after the step from ##ANSWER##?",
                "answer": "fry eggs",
                "timestamp": "41.0 -- 70.0"
            }
        ]
    },
...
]

Evaluation

If you want to upload results from your models, feel free to submit a PR following one of these baselines, or send an email to me ([email protected]) to update your results on our page.

Installation

Available Baselines

Large Video Language Model

VideoChatGPT-7B
VideoChat2-7B
Video-LLaVA-7B
LLaMA-VID-7B/13B
MiniGPT4-Video-7B
PLLaVA-7B/13B/34B
LLaVA-NeXT-Video-DPO-DPO-7B/34B
ShareGPT4Video-8B
LongVA-7B
LongVILA-8B
LongLLaVA-9B
VideoLLM-online-8B
VideoLLaMB-7B
InternLM-XComposer-2.5-OmniLive-1.5B
Gemini-1.5-pro
GPT4O (Azure)

Large Omni Language Model

VideoLLaMA2-7B
VITA-7B
MiniOmni2-0.5B
M4-7B

For detailed instructions on installation and checkpoints of these baselines, please consult the INSTALLATION guide.

Note: To evaluate audio-input models, we use a TTS model to automatically convert text queries into audio. Please install ChatTTS beforehand.

Usage

debug inference pipeline

cd baselines
python ../model_testing_zoo.py --model_name Gemini-1.5-pro # ["VideoChatGPT", "VideoChat2", "VideoLLaVA", "LLaMA-VID", "PLLaVA", "PLLaVA-13B", "PLLaVA-34B", "LLaVA-NeXT-Video", "LLaVA-NeXT-Video-34B", "LongVA", "LongVILA", "LongLLaVA", "VideoLLaMB", "VideoOnline", "VideoLLaMBOneline", "Gemini-1.5-pro", "GPT4O"])

1. evaluate on OmniMMI at one step

cd baselines
bash run_all.sh
bash eval_all.sh

2. evaluate step by step

for each evaluation script for specific subtask, e.g. run_ap.sh

2.1 set the model and conda environment

model_names=("LongVA" "PLLaVA-13B")
environments=("llongva" "pllava")

There are two groups of models

(i) common models

"MiniGPT4-Video" "VideoChatGPT" "VideoLLaVA" "VideoChat2" "LLaMA-VID" "PLLaVA" "LLaVA-NeXT-Video" "ShareGPT4Video" "PLLaVA-13B" "PLLaVA-34B" "LLaVA-NeXT-Video-34B" "LongVA" "LongVILA" "LongLLaVA" "VideoLLaMB" "InternLMXCO" "InterSuit" "InterSuitAV"

(ii) online models

"VideoOnline" "VideoLLaMBOnline" "InterSuitOnline" "InterSuitOnlineAV"

2.2 set the conda environment path

source ~/scratch/anaconda3/bin/activate

2.3 run the evaluation script for each subtask

cd baselines
bash run_ap.sh # action prediction
bash eval_ap.sh

Leaderboard

Leaderboard of Large Video Language Model

Models	LLM	Num Frames	SG avg.	AP	MD avg.	SI	PA	PT	avg. w/o P
Gemini-1.5-Pro	-	128	16.33	43.00	12.00	38.50	✗	✗	27.46
GPT-4o	-	50	15.00	39.50	12.33	17.00	✗	✗	20.96
IXC2.5-OL	Qwen2-1.5B	512	4.03	30.50	4.00	23.00	✗	14.50	15.04
LongVILA	Llama3-8B	128	4.33	39.00	3.00	10.00	✗	✗	14.21
M4	Qwen2-7B	32 / 1 fps	5.67	35.67	1.67	9.00	25.50	62.00	12.46
LongVA	Qwen2-7B	32	3.33	33.33	2.33	3.00	✗	✗	11.54
LongLLaVA	Jamba-9B	128	3.33	29.00	3.67	10.00	✗	✗	11.50
LLAMA-VID-13B	Vicuna-13B	128	1.33	30.50	3.33	8.50	✗	✗	10.91
VideoChatGPT	LLama-7B	100	3.33	33.35	3.00	3.50	✗	✗	10.83
LLAMA-VID	Vicuna-7B	128	2.33	29.00	2.67	7.50	✗	✗	10.38
VideoLLM-online	Llama3-8B	1 fps	4.67	35.00	1.33	5.00	✗	✗	10.25
PLLAVA-34B	Yi-34B	16	3.67	25.00	3.00	5.00	✗	✗	10.04
PLLAVA-13B	Vicuna-13B	16	2.67	25.00	4.33	6.50	✗	✗	9.62
LLAVA-NeXT-Video-34B	Yi-34B	32	2.67	30.50	1.67	1.50	✗	✗	9.59
VideoLLAMB	Vicuna-7B	32 / 1 fps	2.33	29.50	3.00	3.00	✗	✗	9.46
PLLAVA	Vicuna-7B	16	3.33	30.00	1.33	3.00	✗	✗	9.41
ShareGPT4Video	Llama3-8B	16	2.00	29.00	2.00	4.50	✗	✗	9.38
LLAVA-NeXT-Video	Vicuna-7B	32	3.00	30.50	3.00	1.50	✗	✗	9.25
Video-LLaVA	Vicuna-7B	8	1.67	28.00	2.00	1.50	✗	✗	9.25
VideoChat2	Vicuna-7B	8	2.33	27.50	2.67	1.00	✗	✗	8.38
MiniGPT4-Video	Mistrl-7B	45	4.00	23.00	1.67	1.00	✗	✗	7.92

Leaderboard of Large Omni Language Model

Models	LLM	Num Frames	SG avg.	AP	MD avg.	SI	PA	PT	avg. w/o P
VideoLLaMA2	Qwen2-7B	8	10.33	35.00	3.00	5.00	✗	✗	13.33
VITA	Mistrl-8×7B	16	0.00	39.00	2.00	1.50	✗	67.00	10.62
M4-audio	Qwen2-7B	32 / 1 fps	2.00	13.00	3.00	7.50	1.50	68.50	6.38
MiniOmni2	Qwen2-0.5B	1	4.67	14.00	1.00	1.00	✗	✗	5.17

Please check the detailed leaderboard in our page

Citation

If you find our work helpful, please consider citing it.

@article{omnimmi,
    title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
    author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
    journal={arxiv},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
baselines		baselines
evaluations		evaluations
omnimmi		omnimmi
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
LICENSE		LICENSE
README.md		README.md
model_testing_zoo.py		model_testing_zoo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Updates

OmniMMI

Introduction

Statistics

Data

Evaluation

Installation

Usage

1. evaluate on OmniMMI at one step

2. evaluate step by step

Leaderboard

Leaderboard of Large Video Language Model

Leaderboard of Large Omni Language Model

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

OmniMMI/OmniMMI

Folders and files

Latest commit

History

Repository files navigation

A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Updates

OmniMMI

Introduction

Statistics

Data

Evaluation

Installation

Usage

1. evaluate on OmniMMI at one step

2. evaluate step by step

Leaderboard

Leaderboard of Large Video Language Model

Leaderboard of Large Omni Language Model

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages