RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.
- [2025/10] 🔥 RLinf supports reinforcement learning fine-tuning for Pi 0 and Pi0.5! Doc: RL on π₀ and π₀.₅ Models
- [2025/10] 🔥 RLinf now officially supports online reinforcement learning! Doc: coding_online_rl, Blog post: The first open-source agent online RL framework RLinf-Online.
- [2025/10] 🔥 The RLinf Algorithm Technical Report RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training is released.
- [2025/09] 🔥 Example Gallery is updated, users can find various off-the-shelf examples!
- [2025/09] The paper RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation is released.
- [2025/09] The report on RLinf by Machine Heart is released.
- [2025/08] RLinf is open-sourced. The formal v0.1 will be released soon.
| Simulators | Real-world Robotics | Models | Algorithms |
|---|---|---|---|
|
|
|
RLinf supports mainstream VLA models, mainstream CPU & GPU-based simulators via standardized Worker interfaces, and enables the first RL fine-tuning of the
Agentic RL includes both RL training for improving LLM reasoning ability, such as Math Reasoning, and RL training for Agents, for example, RL training of coding agent. RLinf can also well support agentic RL. We believe embodied intelligence will also integrate the ability of agents in the future to complete complex tasks.
Besides the rich functionalities introduced above, RLinf has high flexibility to support diverse RL training workflows (e.g., simulator integrated embodied RL, PPO/RLHF), while hiding the complexity of distributed programming. Users can easily scale RL training to a large number of GPU nodes without modifying code, meeting the increasing demand of computation for RL training.
The high flexibility allows RLinf to explore more efficient scheduling and execution. The hybrid execution mode for embodied RL achieves a 100%+ throughput improvement compared to baseline solutions.
Multiple Backend Integrations
- FSDP + HuggingFace/SGLang/vLLM: rapid adaptation to new models and algorithms, ideal for beginners and fast prototyping.
- Megatron + SGLang/vLLM: optimized for large-scale training, delivering maximum efficiency for expert users with demanding workloads.
Installation: Users can refer to our installation guide to install RLinf. We recommend users to use our provided docker image (i.e., Installation Method 1), as the environment and dependencies of embodied RL are complex.
Run a simple example: After setting up the environment, users can run a simple example of embodied RL with ManiSkill3 simulator following this document.
For more tutorials of RLinf and application examples, checkout our documentation and example gallery.
- RLinf supports both PPO and GRPO algorithms, enabling state-of-the-art training for Vision-Language-Action models.
- The framework provides seamless integration with mainstream embodied intelligence benchmarks, including ManiSkill3 and LIBERO, and achieves strong performance across diverse evaluation metrics.
- Training curves on ManiSkill “PutOnPlateInScene25Mani-v3” with OpenVLA and OpenVLA-OFT models, using PPO and GRPO algorithms. PPO consistently outperforms GRPO and exhibits greater stability.
| Evaluation results on ManiSkill. Values denote success rates | |||||
|---|---|---|---|---|---|
| In-Distribution | Out-Of-Distribution | ||||
| Vision | Semantic | Execution | Avg. | ||
| OpenVLA (Base) | 53.91% | 38.75% | 35.94% | 42.11% | 39.10% |
| 93.75% | 80.47% | 75.00% | 81.77% | 79.15% | |
| 84.38% | 74.69% | 72.99% | 77.86% | 75.15% | |
| 96.09% | 82.03% | 78.35% | 85.42% | 81.93% | |
| OpenVLA-OFT (Base) | 28.13% | 27.73% | 12.95% | 11.72% | 18.29% |
| 94.14% | 84.69% | 45.54% | 44.66% | 60.64% | |
| 97.66% | 92.11% | 64.84% | 73.57% | 77.05% | |
| Evaluation results of the unified model on the five LIBERO task groups | ||||||
|---|---|---|---|---|---|---|
| Model | Spatial | Object | Goal | 10 | 90 | Avg. |
| 72.18% | 71.48% | 64.06% | 48.44% | 70.97% | 65.43% | |
| 99.40% | 99.80% | 98.79% | 93.95% | 98.59% | 98.11% | |
| Δ Improvement | +27.22 | +28.32 | +34.73 | +45.51 | +27.62 | +32.68 |
| Evaluation results on the four LIBERO task groups | ||||||
|---|---|---|---|---|---|---|
| Model | LIBERO | |||||
| Spatial | Object | Goal | 10 | Avg. | ||
| Full Dataset SFT | ||||||
| Octo | 78.9% | 85.7% | 84.6% | 51.1% | 75.1% | |
| OpenVLA | 84.7% | 88.4% | 79.2% | 53.7% | 76.5% | |
| πfast | 96.4% | 96.8% | 88.6% | 60.2% | 85.5% | |
| OpenVLA-OFT | 91.6% | 95.3% | 90.6% | 86.5% | 91.0% | |
| Few-shot Dataset SFT + RL | ||||||
| π0 | 65.3% | 64.4% | 49.8% | 51.2% | 57.6% | |
| + GRPO | 97.8% | 97.8% | 83.2% | 81.4% | 90.0% | |
| + PPO | 98.4% | 99.4% | 96.2% | 90.2% | 96.0% | |
| Few-shot Dataset SFT + RL | ||||||
| π0.5 | 84.6% | 95.4% | 84.6% | 44.2% | 77.2% | |
| + GRPO | 97.4% | 99.8% | 91.2% | 77.6% | 91.5% | |
| + PPO | 99.6% | 100% | 97.4% | 90.6% | 96.9% | |
| 1.5B model results | ||||
|---|---|---|---|---|
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
| 28.33 | 24.90 | 27.45 | 26.89 | |
| 37.80 | 30.42 | 32.11 | 33.44 | |
| 40.41 | 30.93 | 27.54 | 32.96 | |
| 40.73 | 31.56 | 28.10 | 33.46 | |
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
| 43.65 | 32.49 | 35.00 | 37.05 | |
| 48.44 | 35.63 | 38.46 | 40.84 | |
* We retrain the model using the default settings for 600 steps.
| 7B model results | ||||
|---|---|---|---|---|
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
| 54.90 | 40.20 | 45.48 | 46.86 | |
| 61.66 | 49.38 | 46.93 | 52.66 | |
| 66.87 | 52.49 | 44.43 | 54.60 | |
| 68.55 | 51.24 | 43.88 | 54.56 | |
| 67.30 | 55.00 | 45.57 | 55.96 | |
| 68.33 | 52.19 | 48.18 | 56.23 | |
- RLinf achieves state-of-the-art performance on math reasoning tasks, consistently outperforming existing models across multiple benchmarks (AIME 24, AIME 25, GPQA-diamond) for both 1.5B and 7B model sizes.
- Support for heterogeneous GPUs
- Support for asynchronous pipeline execution
- Support for Mixture of Experts (MoE)
- Support for vLLM inference backend
- Support for Vision-Language Models (VLMs) training
- Support for deep searcher agent training
- Support for multi-agent training
- Support for integration with more embodied simulators (e.g., Meta-World, GENESIS, RoboTwin)
- Support for more Vision Language Action models (VLAs), such as GR00T, WALL-OSS
- Support for world model
- Support for real-world RL embodied intelligence
RLinf has comprehensive CI tests for both the core components (via unit tests) and end-to-end RL training workflows of embodied, agent, and reasoning scenarios. Below is the summary of the CI test status of the main branch:
| Test Name | Status |
|---|---|
| unit-tests | |
| agent-reason-e2e-tests | |
| embodied-e2e-tests | |
| scheduler-tests |
We welcome contributions to RLinf. Please read contribution guide before taking action. Thank the following contributors and welcome more developers to join us on this open source project.
If you find RLinf helpful, please cite the paper:
@misc{yu2025rlinfflexibleefficientlargescale,
title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
author={Chao Yu and Yuanqing Wang and Zhen Guo and Hao Lin and Si Xu and Hongzhi Zang and Quanlu Zhang and Yongji Wu and Chunyang Zhu and Junhao Hu and Zixiao Huang and Mingjie Wei and Yuqing Xie and Ke Yang and Bo Dai and Zhexuan Xu and Xiangyuan Wang and Xu Fu and Zhihao Liu and Kang Chen and Weilin Liu and Gang Liu and Boxun Li and Jianlei Yang and Zhi Yang and Guohao Dai and Yu Wang},
year={2025},
eprint={2509.15965},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.15965},
}If you use RL+VLA in RLinf, you can also cite our technical report and empirical study paper:
@misc{zang2025rlinfvlaunifiedefficientframework,
title={RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training},
author={Hongzhi Zang and Mingjie Wei and Si Xu and Yongji Wu and Zhen Guo and Yuanqing Wang and Hao Lin and Liangzhi Shi and Yuqing Xie and Zhexuan Xu and Zhihao Liu and Kang Chen and Wenhao Tang and Quanlu Zhang and Weinan Zhang and Chao Yu and Yu Wang},
year={2025},
eprint={2510.06710},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.06710},
}@misc{liu2025rlbringvlageneralization,
title={What Can RL Bring to VLA Generalization? An Empirical Study},
author={Jijia Liu and Feng Gao and Bingwen Wei and Xinlei Chen and Qingmin Liao and Yi Wu and Chao Yu and Yu Wang},
year={2025},
eprint={2505.19789},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.19789},
}Acknowledgements RLinf has been inspired by, and benefits from, the ideas and tooling of the broader open-source community. In particular, we would like to thank the teams and contributors behind VeRL, AReaL, Megatron-LM, SGLang, and PyTorch Fully Sharded Data Parallel (FSDP), and if we have inadvertently missed your project or contribution, please open an issue or a pull request so we can properly credit you.
Contact: We welcome applications from Postdocs, PhD/Master's students, and interns. Join us in shaping the future of RL infrastructure and embodied AI!
- Chao Yu: [email protected]
- Yu Wang: [email protected]