Yufan Deng,
Zilin Pan,
Hongyu Zhang,
Xiaojie Li,
Ruoqing Hu,
Yufei Ding,
Yiming Zou,
Yan Zeng,
Daquan Zhou
This repository is the official implementation of our work, consisting of (i) RBench, a fineβgrained benchmark tailored for robotics video generation, and (ii) RoVid-X, a millionβscale dataset for training robotics video models. We reveal
the limitations of current video foundation models and potential directions for improvement, offering new perspectives for researchers exploring the embodied domain using video world models. Our goal is to establish a solid foundation for the rigorous assessment and scalable training of video generation models in the field of physical AI, accelerating the progress of embodied AI toward general intelligence.
[Ongoing]π₯ We are actively training a physically plausible robotic video world model and applying it for real-world deployment in downstream robotic tasks. Stay tuned![2026.1.22]π₯ Once the internal review is approved, we will release the RoVid-X robotic video dataset on Hugging Face and open-source the RBench on Hugging Face.[2026.1.22]π₯ Our Research Paper is now available. The Project Page is created.
small.mp4
- Embodied Execution Evaluation: Measure the action execution success rate of generated videos using Inverse Dynamics Model (IDM).
# 0. Clone the repo
git clone https://github.com/DAGroup-PKU/ReVidgen.git
cd ReVidgen
# 1. Environment for RBench
conda create -n rbench python=3.10.18
conda activate rbench
pip install --upgrade setuptools
pip install torch==2.5.1 torchvision==0.20.1
# Install Grounded-Segment-Anything module
cd pkgs/Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install -r requirements.txt
# Install Groudned-SAM-2 module
cd ../Grounded-SAM-2
pip install -e .
# Install Q-Align module
cd ../Q-Align
pip install -e .
cd ..
pip install -r requirements.txt
Please download the checkpoint files from RBench and organize them under the following directory before running the evaluation:
ReVidgen/
βββ checkpoints/
β βββ BERT
β β βββ google-bert
β β βββ bert-base-uncased
β β βββ LICENSE
β β βββ ...
β βββ GroundingDino
β β βββ groundingdino_swinb_cogcoor.pth
β βββ q-future
β β βββ one-align
β β βββ README.md
β β βββ ...
β βββ SAM
β β βββ sam2.1_hiera_large.pt
β βββ Cotracker
β βββ scaled_offline.pth
β
βββ eval/
β βββ 4_embodiments/
β βββ 5_tasks/
β βββ ...
β
βββ pkgs/
β βββ Grounded-Segment-Anything/
β βββ ...
βββ ...RBench evaluates mainstream video generation models and shows a strong alignment with human evaluations, achieving a Spearman correlation of 0.96.
Evaluations across task-oriented and embodiment-specific dimensions for 25 models spanning open-source, commercial, and robotics-specific families.
RoVid-X.mp4
We present RoVid-X, a large-scale robotic video dataset for real-world robotic interactions, providing RGB videos, depth videos, and optical flow videos to facilitate the training of embodied video models.
# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
# pip install -U "huggingface_hub[cli]"
huggingface-cli download DAGroup-PKU/RBench
Generated videos should be organized following the directory structure below.
ReVidgen/
βββ data/
βββ {model_name}/
βββ {task_name/embodiment_name}/
βββ videos/
βββ 0001.mp4
βββ 0002.mp4
βββ 0003.mp4
βββ ...
> **Note:** To enable GPT-based evaluation, please prepare your OpenAI API key in advance and set the `API_KEY` field in the following evaluation scripts accordingly.
# Run embodiment-oriented evaluation
bash scripts/rbench_eval_4embodiments.sh
# Run task-oriented evaluation
bash scripts/rbench_eval_5tasks.shThe videos used in these demos are sourced from public domains or generated by models, and are intended solely to showcase the capabilities of this research. If you have any concerns, please contact us at [email protected], and we will promptly remove them.
If you find our paper and code useful in your research, please consider giving a star β and citation π.
@article{deng2026rethinking,
title={Rethinking Video Generation Model for the Embodied World},
author={Deng, Yufan and Pan, Zilin and Zhang, Hongyu and Li, Xiaojie and Hu, Ruoqing and Ding, Yufei and Zou, Yiming and Zeng, Yan and Zhou, Daquan},
journal={arXiv preprint arXiv:2601.15282},
year={2026}
}