Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

Rethinking Video Generation Model for the Embodied World

Notifications You must be signed in to change notification settings

DAGroup-PKU/ReVidgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Rethinking Video Generation Model for the Embodied World

hf_space arXiv Home Page Dataset Benchmark Video License

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu,
Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

πŸ“£ Overview

teaser This repository is the official implementation of our work, consisting of (i) RBench, a fine‑grained benchmark tailored for robotics video generation, and (ii) RoVid-X, a million‑scale dataset for training robotics video models. We reveal the limitations of current video foundation models and potential directions for improvement, offering new perspectives for researchers exploring the embodied domain using video world models. Our goal is to establish a solid foundation for the rigorous assessment and scalable training of video generation models in the field of physical AI, accelerating the progress of embodied AI toward general intelligence.

πŸ”₯ News

  • [Ongoing] πŸ”₯ We are actively training a physically plausible robotic video world model and applying it for real-world deployment in downstream robotic tasks. Stay tuned!
  • [2026.1.22] πŸ”₯ Once the internal review is approved, we will release the RoVid-X robotic video dataset on Hugging Face and open-source the RBench on Hugging Face.
  • [2026.1.22] πŸ”₯ Our Research Paper is now available. The Project Page is created.

πŸŽ₯ Demo

small.mp4

πŸ“‘ Todo List

  • Embodied Execution Evaluation: Measure the action execution success rate of generated videos using Inverse Dynamics Model (IDM).

βš™οΈ Installation

Environment

# 0. Clone the repo
git clone https://github.com/DAGroup-PKU/ReVidgen.git
cd ReVidgen

# 1. Environment for RBench
conda create -n rbench python=3.10.18
conda activate rbench

pip install --upgrade setuptools
pip install torch==2.5.1 torchvision==0.20.1

# Install Grounded-Segment-Anything module
cd pkgs/Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install -r requirements.txt

# Install Groudned-SAM-2 module
cd ../Grounded-SAM-2
pip install -e .

# Install Q-Align module
cd ../Q-Align
pip install -e .

cd ..
pip install -r requirements.txt

Download Checkpoints

Please download the checkpoint files from RBench and organize them under the following directory before running the evaluation:

ReVidgen/
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ BERT
β”‚   β”‚   └── google-bert
β”‚   β”‚       └── bert-base-uncased
β”‚   β”‚           β”œβ”€β”€ LICENSE
β”‚   β”‚           └── ...
β”‚   β”œβ”€β”€ GroundingDino
β”‚   β”‚   └── groundingdino_swinb_cogcoor.pth
β”‚   β”œβ”€β”€ q-future
β”‚   β”‚   └── one-align
β”‚   β”‚       β”œβ”€β”€ README.md
β”‚   β”‚       └── ...
β”‚   β”œβ”€β”€ SAM
β”‚   β”‚   └── sam2.1_hiera_large.pt
β”‚   └── Cotracker
β”‚       └── scaled_offline.pth
β”‚
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ 4_embodiments/
β”‚   β”œβ”€β”€ 5_tasks/
β”‚   └── ...
β”‚
β”œβ”€β”€ pkgs/
β”‚   β”œβ”€β”€ Grounded-Segment-Anything/
β”‚   └── ...
└── ...

πŸ“ˆ RBench Results

RBench evaluates mainstream video generation models and shows a strong alignment with human evaluations, achieving a Spearman correlation of 0.96.

πŸ“Š RBench Results Across Tasks and Embodiments

RBench Table Evaluations across task-oriented and embodiment-specific dimensions for 25 models spanning open-source, commercial, and robotics-specific families.

πŸ“¦ Dataset

RoVid-X.mp4

We present RoVid-X, a large-scale robotic video dataset for real-world robotic interactions, providing RGB videos, depth videos, and optical flow videos to facilitate the training of embodied video models.

πŸ”§ Usage

πŸ“₯ Download RBench Validation Set

# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
# pip install -U "huggingface_hub[cli]"
huggingface-cli download DAGroup-PKU/RBench

🎬 Video Generation Format

Generated videos should be organized following the directory structure below.

ReVidgen/
└── data/
    └── {model_name}/
        └── {task_name/embodiment_name}/
            └── videos/
                β”œβ”€β”€ 0001.mp4
                β”œβ”€β”€ 0002.mp4
                β”œβ”€β”€ 0003.mp4
                └── ...

πŸ€— Quick Start

> **Note:** To enable GPT-based evaluation, please prepare your OpenAI API key in advance and set the `API_KEY` field in the following evaluation scripts accordingly.

# Run embodiment-oriented evaluation
bash scripts/rbench_eval_4embodiments.sh

# Run task-oriented evaluation
bash scripts/rbench_eval_5tasks.sh

πŸ“§ Ethics Concerns

The videos used in these demos are sourced from public domains or generated by models, and are intended solely to showcase the capabilities of this research. If you have any concerns, please contact us at [email protected], and we will promptly remove them.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation πŸ“.

BibTeX

@article{deng2026rethinking,
  title={Rethinking Video Generation Model for the Embodied World},
  author={Deng, Yufan and Pan, Zilin and Zhang, Hongyu and Li, Xiaojie and Hu, Ruoqing and Ding, Yufei and Zou, Yiming and Zeng, Yan and Zhou, Daquan},
  journal={arXiv preprint arXiv:2601.15282},
  year={2026}
}

About

Rethinking Video Generation Model for the Embodied World

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •