Fangqi Zhu1,2
·
Zhengyang Yan1
·
Zicong Hong1
·
Quanxin Shou1
·
Xiao Ma2*
·
Song Guo1*
1Hong Kong University of Science and Technology | 2ByteDance Seed
* Corresponding author
- [2025-11] We release the training code, data and checkpoints for WMPO. Check it out!
- Release the training code and training scripts for WMPO
- Release the checkpoints (VLAs and world models) and training data for WMPO
- Release the training code, training scripts and training data for world models
We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
The overall training procedure consists of three components: (1) Imagined Trajectory Generation, where policy model and world model interact alternately to generate a full imagined trajectory; (2) Trajectory Sampling, where multiple trajectories are sampled and evaluated by the reward model ; and (3) Policy Update, where the policy parameters are optimized.
We recommend use python=3.11.x and torch=2.5.1 on a Linux server (e.g., Ubuntu), and use pip or conda to manage the environment.
Run the following command to complete environment installation:
pip install -r requirements.txt
bash install.shDownload the datasets and checkpoints using the following command:
python download_hf.py
The checkpoint files are in the following organization, the name of each checkpoint indicates the corresponding task that is trained on.
checkpoint_files
├── reward_models
│ ├── videomae_coffee.pth
│ ├── videomae_square.pth
│ ├── videomae_stack_three.pth
│ └── videomae_three_piece_assembly.pth
├── SFT_models
│ ├── coffee
│ ├── square
│ ├── stack_three
│ └── three_piece_assembly
├── WMPO_models
│ ├── coffee
│ ├── square
│ ├── stack_three
│ └── three_piece_assembly
└── world_models
├── coffee
├── OpenSora-STDiT-v3
├── OpenX_pretrained
├── square
├── stack_three
├── three_piece_assembly
└── vae
reward_models are the VideoMAE models we used to calculated the reward of a world model rollout.
SFT_models are the base policy VLAs before performing WMPO.
WMPO_models are the final optimized VLAs by performing RL in the world models with our WMPO framework.
world_models are first fine-tuned from OpenSora-STDiT-v3 checkpoint into OpenX_pretrained checkpoint on the Open X-Embodiment (OXE) dataset, and then trained within the WMPO framework. All world models share the same variational auto-encoder vae.
Note
We would like to kindly remind you that the complete checkpoint files and datasets are relatively large (364GiB + 530GiB). You can adjust the directories in download_hf.py and only download the checkpoints and datasets you need.
Use bash examples/mimicgen/{Task Name}/train_wmpo_{Rollout Budget}.sh directly to run the experiments, set NUM_NODES=1 and NUM_GPUS_PER_NODE to the number of GPUs that you have.
{Task Name} can be coffee, square, stack_three, three_piece_assembly and Rollout Budget can be 128 or 1280.
Set your WANDB_API_KEY in both the config file and align.json if you want to use wandb to log the experiments.
Below shows the example paths of config files.
| Task Name | Settings | Example Config File Path |
|---|---|---|
| coffee | WMPO with rollout budget P=128 |
examples/mimicgen/coffee/train_wmpo_128.sh |
| three_piece_assembly | WMPO with rollout budget P=1280 |
examples/mimicgen/three_piece_assembly/train_wmpo_1280.sh |
We use Ray to manage the clusters. Run bash launch_head.sh on the head node and bash launch_worker.sh on worker nodes. Set NUM_NODES and NUM_GPUS_PER_NODE to the number of nodes and GPUs you have, and adjust the placeholder IP address in the scripts accordingly.
After setting up the clusters, simply start the task on the head node!
We use OpenSora framework to train our world models with rollout trajectories. Directly use our training scripts via bash examples/opensora/{Task Name}_{Rollout Budge}.sh to train the world model for task {Task Name} with rollout budget {Rollout Budge}.
Set GPUS_PER_NODE and NNODES to the number of GPUs and nodes you have. Set MASTER_ADDR to the address and port of your master node, and node_rank to the distinct rank of each node.
{Task Name} can be coffee, square, stack_three, three_piece_assembly and Rollout Budget can be 128 or 1280.
Below shows the example paths of config files.
| Task Name | Settings | Example Config File Path |
|---|---|---|
| coffee | rollout budget P=128 |
examples/opensora/coffee_128.sh |
| three_piece_assembly | rollout budget P=1280 |
examples/opensora/three_piece_assembly_1280.sh |
Use bash examples/mimicgen/{Task Name}/evaluate.sh to evaluate the pre-trained models. Follow the default training settings to ensure reproducibility.
You can also adjust TARGET_MODEL_PATH to evaluate other checkpoints. Make sure to include the (un)normalization keys in data_files/statistics in the checkpoints.
Table 1. Comparison of policy optimization methods across four manipulation tasks in the Mimicgen simulation benchmark. Results show that WMPO consistently outperforms both GRPO and DPO baselines under different budgets. As the rollout budget increases from 128 to 1280, WMPO continues to exhibit substantial improvements, highlighting both its data efficiency and scalability.
Figure 3. Behavior analysis of the Square task (inserting the square into the stick) shows that, compared with the base policy, WMPO demonstrates the ability to self-correct.
Figure 5. Relative average trajectory length of successful trials across different policies (Base Policy = 100%).
We thank the following great open-sourced project which our codebase relies on: Open-Sora, openvla-oft, VideoMAE, verl, mimicgen, SimpleVLA-RL.
We would like to express our sincere gratitude to Yikun Miao for his valuable assistance in preparing the open-source release.
If you find WMPO useful for your research and applications, please consider starring this repository and citing:
@article{WMPO2025,
title={WMPO: World Model-based Policy Optimization for Vision-Language-Action Models},
author={Fangqi, Zhu and Zhengyang, Yan and Zicong, Hong and Quanxin, Shou and Xiao, Ma and Song, Guo},
journal={ArXiv},
year={2025},
url={https://WM-PO.github.io}
}