Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li
(* equal contribution)
We introduce VAGEN, a multi-turn reinforcement learning framework designed specifically for training vision-language model (VLM) agents. Built upon this framework, we propose World Modeling RL, a novel reinforcement learning approach that significantly improves the multi-turn performance of VLMs by explicitly supervising their worldmodel reasoning process, as shown in Figure 1.
We frame multi-turn VLM agentic tasks as a Partially Observable Markov Decision Process (POMDP), shown in Figure 2.
| Figure 1. Overview of the VAGEN framework. | Figure 2. POMDP formulation of multi-turn VLM agentic tasks. |
[2026/02] We have migrated the main branch to VAGEN-Lite, a lightweight and clean reimplementation built on VERL agent-loop for easy customization and stable performance. For the previous full-featured release, please visit the vagen-legacy branch.
[2025/12] Introducing VAGEN-Lite: a lightweight and clean reimplementation of VAGEN, built on the VERL agent-loop for easy customization and stable performance.
[2025/09] VAGEN is accepted by Neurips 2025
[2025/04] We've introduced a new modular design for environments and services in VAGEN:
- Enhanced environment framework for easier creation of custom environments
- New service architecture for efficient distributed training
- Check out our new guides:
- Creating Environments: New environment protocal.
- Creating Services: We now support hosting environments in a separate process
[2025/03] We release VAGEN, a multi-turn reinforcement learning framework for training VLM Agents!
conda create -n vagen python=3.12 -y
conda activate vagen
git clone https://github.com/mll-lab-nu/VAGEN.git
cd VAGEN
git submodule update --init --recursive
cd verl
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd ..
pip install -e .
pip install "trl==0.26.2"VAGEN currently supports PPO / GRPO with two multi-turn training paradigms:
Multi-turn Concatenated Training: All turns in a trajectory are concatenated into a single training instance.
# Qwen/Qwen2.5-VL-3B-Instruct
cd VAGEN
bash examples/sokoban/train_ppo_qwen25vl3b.sh# Qwen/Qwen3-VL-4B-Instruct
# pip install transformers==0.57.1
# pip install "sglang[all]==0.5.3.post3"
cd VAGEN
bash examples/sokoban/train_grpo_qwen3vl4b.sh# Enable reward variance based top-p filtering
cd VAGEN
bash examples/frozenlake/train_grpo_qwen25vl3b_filtertopp_vision.shMulti-turn Non-Concatenated Training: Each trajectory is split into multiple turn-level training instances.
cd VAGEN
bash examples/sokoban/train_ppo_no_concat_qwen25vl3b.shEvaluation (supported by ViewSuite)
VAGEN supports evaluation using different backends (OpenAI, Claude, Gemini, sglang, vLLM). For details, see vagen/evaluate/adapters/README.md.
cd VAGEN
# FrozenLake evaluation with sglang
bash examples/evaluate/frozenlake/eval_qwen25_vl_3b.shcd VAGEN
# Sokoban evaluation
bash examples/evaluate/sokoban/run_eval.sh
To train on your own environment, follow the steps below.
-
Use
GymImageEnvas the base class: -
Refer to Sokoban for a full implementation example:
Add your environment entry to:
vagen/configs/env_registry.yamlPrepare training and validation configs:
train.yamlval.yaml
You can follow the Sokoban examples as templates:
Write your training script based on:
See the Documentation for more customization options:
- Custom Filter - Preprocess training data (supported by RAGEN)
- Custom Metric - Add W&B logging metrics
- Configuration - Training configuration reference
refer to vagen/configs/vagen_multiturn.yaml
# Warning:
# - If you set a training-data rollout dir AND enable image logging, training images will also be dumped to disk.
# This can consume a large amount of storage very quickly. Monitor disk usage and consider cleanup/limits.
trainer:
log_image:
enable: false # true can enable saving rollout/validation images to disk
max_pending: 2 # max concurrent async image dump tasks
png_compress_level: 0 # PNG compression (0 = fastest, 9 = smallest)# export HF_TOKEN=xxx
huggingface_hub:
hf_save_freq: null # upload every N steps (must be a multiple of trainer.save_freq); null = disabled
repo_id: null # HuggingFace repo id, e.g. "user/my-model"
private: false # whether the repo is privatefilter:
name: reward_variance # filter strategy name (registered in FILTER_REGISTRY)
filter_kwargs: {} # extra kwargs passed to the filter function
enable: false # set to true to enable filteringIf you find our framework and paper useful, we appreciate it if you could cite our work:
@misc{wang2025vagen,
title={VAGEN:Reinforcing World Model Reasoning for Multi-Turn VLM Agents},
author={Kangrui Wang* and Pingyue Zhang* and Zihan Wang* and Yaning Gao* and Linjie Li* and Qineng Wang and Hanyang Chen and Chi Wan and Yiping Lu and Zhengyuan Yang and Lijuan Wang and Ranjay Krishna and Jiajun Wu and Li Fei-Fei and Yejin Choi and Manling Li},
year={2025},
url={https://arxiv.org/abs/2510.16907}
}