This project is a clean fork of the original veRL project to support vision language models, we thank all the authors for providing such a high-performance RL training framework.
EasyR1 is efficient and scalable due to the design of HybirdEngine and the latest release of vLLM's SPMD mode.
- 
Supported models - Llama3/Qwen2/Qwen2.5 language models
- Qwen2/Qwen2.5-VL vision language models
- DeepSeek-R1 distill models
 
- 
Supported algorithms - GRPO
- Reinforce++
- Remax
- RLOO
 
- 
Supported datasets - Any text, vision-text dataset in a specific format
 
- 
Supported tricks - Padding-free training
- Resuming from checkpoint
- Wandb & SwanLab tracking
 
- Python 3.9+
- transformers>=4.49.0
- flash-attn>=2.4.3
- vllm>=0.7.3 (0.8.0 is recommended)
We provide a Dockerfile to easily build environments.
We recommend using the pre-built docker image in EasyR1.
docker pull hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.0* estimated
| Method | Bits | 1.5B | 3B | 7B | 
|---|---|---|---|---|
| GRPO Full Fine-Tuning | AMP | 2*24GB | 4*40GB | 8*40GB | 
Note
We are working hard to reduce the VRAM in RL training, LoRA support will be integrated in next updates.
Tutorial: Run Qwen2.5-VL GRPO on Geometry3K Dataset in Just 3 Steps
git clone https://github.com/hiyouga/EasyR1.git
cd EasyR1
pip install -e .bash examples/qwen2_5_vl_7b_geo3k.shpython3 scripts/model_merger.py --local_dir path_to_your_last_actor_checkpointTip
If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com.
If you want to use SwanLab logger, consider using bash examples/qwen2_5_vl_7b_geo3k_swanlab.sh.
Please refer to the example datasets to prepare your own dataset.
- Text dataset: https://huggingface.co/datasets/hiyouga/math12k
- Vision-text dataset: https://huggingface.co/datasets/hiyouga/geometry3k
Tip
EasyR1 already supports multi-image dataset.
- To learn about the GRPO algorithm, you can refer to Hugging Face's blog.
- Different from TRL's GRPO trainer, our trainer supports mini-batch update as described in the original PPO paper.
Please see the veRL's official doc for multi-node training and Ray debugger.
We also reproduced the following two baselines of the R1-V project.
- CLEVR-70k-Counting: Train the Qwen2.5-VL-3B-Instruct model on counting problem.
- GeoQA-8k: Train the Qwen2.5-VL-3B-Instruct model on GeoQA problem.
- MMR1: Advancing the Frontiers of Multimodal Reasoning. 
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. 
- Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement. 
- MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse. 
- Support LoRA (high priority).
- Support ulysses parallelism for VLMs (middle priority).
- Support more VLM architectures.
Note
We will not provide scripts for supervised fine-tuning and inference in this project. If you have such requirements, we recommend using LLaMA-Factory.
These features are temporarily disabled for now, we plan to fix them one-by-one in the future updates.
- Vision language models are not compatible with ulysses parallelism yet.
👋 Join our WeChat group.
Core contributors: Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang and Yuwen Xiong
We also thank Guangming Sheng and Chi Zhang for helpful discussions.
@misc{zheng2025easyr1,
  title        = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
  author       = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
  howpublished = {\url{https://github.com/hiyouga/EasyR1}},
  year         = {2025}
}We recommend to also cite the original work.
@article{sheng2024hybridflow,
  title   = {HybridFlow: A Flexible and Efficient RLHF Framework},
  author  = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2409.19256}
}