WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Fangqi Zhu^1,2 · Zhengyang Yan¹ · Zicong Hong¹ · Quanxin Shou¹ · Xiao Ma^2* · Song Guo^1*

¹Hong Kong University of Science and Technology | ²ByteDance Seed
* Corresponding author

📢 News

[2025-11] We release the training code, data and checkpoints for WMPO. Check it out!

✅ Milestones

Release the training code and training scripts for WMPO
Release the checkpoints (VLAs and world models) and training data for WMPO
Release the training code, training scripts and training data for world models

📖 Introducing WMPO

We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

How WMPO Works?

The overall training procedure consists of three components: (1) Imagined Trajectory Generation, where policy model and world model interact alternately to generate a full imagined trajectory; (2) Trajectory Sampling, where multiple trajectories are sampled and evaluated by the reward model ; and (3) Policy Update, where the policy parameters are optimized.

⚒️ Getting Started

Install the environment

We recommend use python=3.11.x and torch=2.5.1 on a Linux server (e.g., Ubuntu), and use pip or conda to manage the environment.

Run the following command to complete environment installation:

pip install -r requirements.txt
bash install.sh

Prepare datasets and pre-trained checkpoints

Download the datasets and checkpoints using the following command:

python download_hf.py

The checkpoint files are in the following organization, the name of each checkpoint indicates the corresponding task that is trained on.

checkpoint_files
├── reward_models
│   ├── videomae_coffee.pth
│   ├── videomae_square.pth
│   ├── videomae_stack_three.pth
│   └── videomae_three_piece_assembly.pth
├── SFT_models
│   ├── coffee
│   ├── square
│   ├── stack_three
│   └── three_piece_assembly
├── WMPO_models
│   ├── coffee
│   ├── square
│   ├── stack_three
│   └── three_piece_assembly
└── world_models
    ├── coffee
    ├── OpenSora-STDiT-v3
    ├── OpenX_pretrained
    ├── square
    ├── stack_three
    ├── three_piece_assembly
    └── vae

reward_models are the VideoMAE models we used to calculated the reward of a world model rollout.

SFT_models are the base policy VLAs before performing WMPO.

WMPO_models are the final optimized VLAs by performing RL in the world models with our WMPO framework.

world_models are first fine-tuned from OpenSora-STDiT-v3 checkpoint into OpenX_pretrained checkpoint on the Open X-Embodiment (OXE) dataset, and then trained within the WMPO framework. All world models share the same variational auto-encoder vae.

Note

We would like to kindly remind you that the complete checkpoint files and datasets are relatively large (364GiB + 530GiB). You can adjust the directories in download_hf.py and only download the checkpoints and datasets you need.

🚀 Running the experiments

WMPO Single Node Training

Use bash examples/mimicgen/{Task Name}/train_wmpo_{Rollout Budget}.sh directly to run the experiments, set NUM_NODES=1 and NUM_GPUS_PER_NODE to the number of GPUs that you have.

{Task Name} can be coffee, square, stack_three, three_piece_assembly and Rollout Budget can be 128 or 1280.

Set your WANDB_API_KEY in both the config file and align.json if you want to use wandb to log the experiments.

Below shows the example paths of config files.

Task Name	Settings	Example Config File Path
coffee	WMPO with rollout budget `P=128`	`examples/mimicgen/coffee/train_wmpo_128.sh`
three_piece_assembly	WMPO with rollout budget `P=1280`	`examples/mimicgen/three_piece_assembly/train_wmpo_1280.sh`

WMPO Multiple Nodes Training

We use Ray to manage the clusters. Run bash launch_head.sh on the head node and bash launch_worker.sh on worker nodes. Set NUM_NODES and NUM_GPUS_PER_NODE to the number of nodes and GPUs you have, and adjust the placeholder IP address in the scripts accordingly.

After setting up the clusters, simply start the task on the head node!

Training World Models

We use OpenSora framework to train our world models with rollout trajectories. Directly use our training scripts via bash examples/opensora/{Task Name}_{Rollout Budge}.sh to train the world model for task {Task Name} with rollout budget {Rollout Budge}.

Set GPUS_PER_NODE and NNODES to the number of GPUs and nodes you have. Set MASTER_ADDR to the address and port of your master node, and node_rank to the distinct rank of each node.

{Task Name} can be coffee, square, stack_three, three_piece_assembly and Rollout Budget can be 128 or 1280.

Below shows the example paths of config files.

Task Name	Settings	Example Config File Path
coffee	rollout budget `P=128`	`examples/opensora/coffee_128.sh`
three_piece_assembly	rollout budget `P=1280`	`examples/opensora/three_piece_assembly_1280.sh`

📊 Evaluation and Results

Evaluating WMPO

Use bash examples/mimicgen/{Task Name}/evaluate.sh to evaluate the pre-trained models. Follow the default training settings to ensure reproducibility.

You can also adjust TARGET_MODEL_PATH to evaluate other checkpoints. Make sure to include the (un)normalization keys in data_files/statistics in the checkpoints.

Result - WMPO outperforms DPO and GRPO

Table 1. Comparison of policy optimization methods across four manipulation tasks in the Mimicgen simulation benchmark. Results show that WMPO consistently outperforms both GRPO and DPO baselines under different budgets. As the rollout budget increases from 128 to 1280, WMPO continues to exhibit substantial improvements, highlighting both its data efficiency and scalability.

Result - Emergent Behaviors with WMPO

Figure 3. Behavior analysis of the Square task (inserting the square into the stick) shows that, compared with the base policy, WMPO demonstrates the ability to self-correct.

Figure 5. Relative average trajectory length of successful trials across different policies (Base Policy = 100%).

🙏 Acknowledgement

We thank the following great open-sourced project which our codebase relies on: Open-Sora, openvla-oft, VideoMAE, verl, mimicgen, SimpleVLA-RL.

We would like to express our sincere gratitude to Yikun Miao for his valuable assistance in preparing the open-source release.

🎓 Citation

If you find WMPO useful for your research and applications, please consider starring this repository and citing:

@article{WMPO2025,
  title={WMPO: World Model-based Policy Optimization for Vision-Language-Action Models},
  author={Fangqi, Zhu and Zhengyang, Yan and Zicong, Hong and Quanxin, Shou and Xiao, Ma and Song, Guo},
  journal={ArXiv},
  year={2025},
  url={https://WM-PO.github.io}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

📢 News

✅ Milestones

📖 Introducing WMPO

How WMPO Works?

⚒️ Getting Started

Install the environment

Prepare datasets and pre-trained checkpoints

🚀 Running the experiments

WMPO Single Node Training

WMPO Multiple Nodes Training

Training World Models

📊 Evaluation and Results

Evaluating WMPO

Result - WMPO outperforms DPO and GRPO

Result - Emergent Behaviors with WMPO

🙏 Acknowledgement

🎓 Citation

About

Uh oh!

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
dependencies		dependencies
examples		examples
reward_model		reward_model
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
align.json		align.json
download_hf.py		download_hf.py
install.sh		install.sh
launch_head.sh		launch_head.sh
launch_worker.sh		launch_worker.sh
requirements.txt		requirements.txt
upload_hf.py		upload_hf.py

License

WM-PO/WMPO

Folders and files

Latest commit

History

Repository files navigation

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

📢 News

✅ Milestones

📖 Introducing WMPO

How WMPO Works?

⚒️ Getting Started

Install the environment

Prepare datasets and pre-trained checkpoints

🚀 Running the experiments

WMPO Single Node Training

WMPO Multiple Nodes Training

Training World Models

📊 Evaluation and Results

Evaluating WMPO

Result - WMPO outperforms DPO and GRPO

Result - Emergent Behaviors with WMPO

🙏 Acknowledgement

🎓 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages