RL2: Ray Less Reinforcement Learning

A concise library of post-training for large language models.

This is the right library for you if you want to learn reinforcement learning for large language models or have a quick test for your own algorithm. We deliver a clear implementation without complicated abstractions.

Despite the simplicity, you should be able to scale up to moderate-sized, e.g., 72B, language models with

Training engine partition via Fully Sharded Data Parallelism and Tensor Parallelism
Sequence partition via Llama Context Parallelism
Inference engine and KV cache partition via Tensor Parallelism

We also support

Balanced sequence packing for higher throughput
Multi-turn rollout with SGLang async inference engine

RL2 is a production-ready library! Check our wandb report on OpenThoughts, SkyworkRM, UltraFeedback, OpenReasonerZero, and SearchR1.

Getting Started

Installation

git clone https://github.com/ChenmienTan/RL2.git
cd RL2
pip install .

Data Preperation [Examples]

Hugging Face dataset and various file types, i.e., JSON, JSONL, CSV, Parquet, and Arrow, are accepted. All trainers support formats of both raw text and messages. The former is more flexible but may be model-specific.

SFT

[
    {
        "prompt": "The capital of China is",
        "response": "Beijing."
    }
]

[
    {
        "messages": [
            {"role": "user", "content": "What is the capital of China?"},
            {"role": "assistant", "content": "Beijing."}
        ]
    }
]

Multi-turn is only supported by the latter format.

RM and DPO

[
    {
        "prompt": "The capital of China is",
        "chosen": "Beijing.",
        "rejected": "Shanghai."
    }
]

[
    {
        "messages": [
            {"role": "user", "content": "What is the capital of China?"}
        ],
        "chosen": "Beijing.",
        "rejected": "Shanghai."
    }
]

PPO

[
    {
        "prompt": "The capital of China is",
        "answer": "Beijing"
    }
]

[
    {
        "messages": [
            {"role": "user", "content": "What is the capital of China?"}
        ],
        "answer": "Beijing"
    }
]

Environments [Examples]

In PPO, the language model interacts with the environment through a user-defined function step in the following format.

async def step(
    state: str, action: str, answer
) -> Tuple[str, float, bool]:
    action_type = parse_action_type(action)
    if action_type == "search":
        query = parse_query(action)
        passage = await search_result(query)
        next_state = state + action + passage
        reward = 0.0
        done = False
    elif action_type == "answer":
        next_state = None
        pred = parse_pred(action)
        reward = float(is_equivalent(pred, answer))
        done = True
    return next_state, reward, done

state and action are the input and output of language model in the last turn and next_state is the input of language model in the next turn. When state + action is a prefix of next_state, the two turns will be processed in one sequence. reward is the reward of last turn and done indicates whether to terminate in the last turn. The function should be included in a Python script where the path is specified by actor.rollout.env_path.

Launch [Examples]

Use torchrun to launch the trainer. For example, for single node

torchrun \
    --nproc_per_node=<number of GPUs> \
    -m RL2.trainer.ppo \
    <args>

For multi nodes

torchrun \
    --nnodes=<number of nodes> \
    --node_rank=<rank of node> \
    --nproc_per_node=<number of GPUs on a node> \
    --master_addr=<address of master node> \
    --master_port=<port of master node> \
    -m RL2.trainer.ppo \
    <args>

Hyper-Parameters

Training Engine Partition

By default, i.e., ddp_size=1, tp_size=1, your model will be partitioned via ZeRO stage 3. ddp_size specifies the number of model parameter copies. Larger ddp_size leads to higher memory consumption and lower communication cost. For large models, you may specify tp_size > 1 to enable tensor parallelism. The product of ddp_size and tp_size should be a factor of the total number of GPUs.

Sequence Length

For SFT, RM, and DPO, max_length is used to truncate sequences. In RM and DPO, the chosen and rejected sequences will be packed together, so the actual sequence length can be up to twice of max_length. For PPO, max_new_tokens is used to terminate generations. The length of any sequence cannot exceed sp_size * tp_size * max_length_per_device.

Algorithm

The default algorithm is Dr. GRPO, where the loss is averaged at the token level and the advantage is not divided by the standard deviation.

To use OpenAI PPO, set kl.type=reward, kl.reward_estimator=k1, and adv.estimator=gae
To use DeepSeek GRPO, set actor.avg_level=sequence, kl.type=loss, kl.loss_estimator=k3, and adv.norm_var=true

Acknowledgement

This project is built upon the basis of many remarkable projects, including but not limited to

DeepSpeedChat for the proposal of hybrid engine
RingFlashAttention for the support of Llama context parallelism
SGLang for the support of async inference engine

We also thank OpenRLHF and veRL for their pioneering work.

Citation

If you find this library useful, please cite in the following format

@misc{Tan2025RL2,
    author={Chenmien Tan and Simon Yu and Lanbo Lin and Ze Zhang and Yuanwu Xu and Chenhao Jiang and Tianyuan Yang and Sicong Xie and Guannan Zhang},
    title={RL2: Ray Less Reinforcement Learning},
    note={GitHub repository},
    howpublished={\url{https://github.com/ChenmienTan/RL2}},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 352 Commits
RL2		RL2
envs		envs
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL2: Ray Less Reinforcement Learning

Getting Started

Installation

Data Preperation [Examples]

SFT

RM and DPO

PPO

Environments [Examples]

Launch [Examples]

Hyper-Parameters

Training Engine Partition

Sequence Length

Algorithm

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

Athe-kunal/RL2

Folders and files

Latest commit

History

Repository files navigation

RL2: Ray Less Reinforcement Learning

Getting Started

Installation

Data Preperation [Examples]

SFT

RM and DPO

PPO

Environments [Examples]

Launch [Examples]

Hyper-Parameters

Training Engine Partition

Sequence Length

Algorithm

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages