ToolRL: Reward is All Tool Learning Needs

🤗 Model | 📊 Dataset | 📖 Paper

ToolRL is the code repository for paper "ToolRL: Reward is All Tool Learning Needs".

Our code is built upon veRL and TinyZero.

🔍 Installation

Please install torch, vllm and ray according to your own environment configuration. We provide a configuration example adapted from TinyZero in the following:

# install torch
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip install vllm==0.6.3
pip install ray

Please further install the verl in the current project and flash attention.

# verl
pip install -e .

# flash attention 2
pip install flash-attn --no-build-isolation

📊 Dataset

We provide the raw dataset in ./dataset/rlla_4k_raw, which consists of 2K ToolACE data, 1K Hammer (Masked) data, and 1K xLAM data.

The SFT data includes thought content distilled from Deepseek-R1, whereas the RL data contains only placeholders in the thought field.

For training purposes, the raw data must be further processed. The processed RL training data is available at ./dataset/rlla_4k.

🧪 Training

For GRPO and PPO training, please specify the configuration in ./train_grpo.sh, including the BASE_MODEL and EXPERIMENT_NAME variables. The dataset is set by default to ./dataset/rlla_4k.

bash train_grpo.sh  # For GRPO Training
bash train_ppo.sh  # For PPO Training

Reward variants

The training script uses by default the reward function introduced in Section 3.3 (\textit{Reward Design}) of the paper. In the following, we introduce several reward variants that can be activated via environment variables. All reward-specific environment variables are handled in the core reward module located at ./verl/utils/reward\_score/rlla.py.

export WITHLENGTH=1 # Add the settled length reward function (Section 5.1)
export SCHEDULELENGTH=1 # Add the dynamic length reward function (Section 5.1)
export CORRECTMAX1=1 # Change to equal max (Section 5.2)
export MAX1STEP30MAX3=1 # Change to two stage scale (Section 5.2)
export SCHEDULEREWARD=1 # Chenge to smooth dynamic scale (Section 5.2)
export REFINEDREWARD=1 # Change to finegrained reward (Section 5.3)
export INTERMEDIATEREWARD=1 # Change to intermediate reward (Section 5.3)
export COARSEREWARD=1 # Change to coarse reward (Section 5.3)

To train the model with a specific reward variant, please enable one of the environment variables described above.

🖊️ Citation

@article{qian2025toolrl,
  title={ToolRL: Reward is All Tool Learning Needs},
  author={Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Ji, Heng},
  journal={arXiv preprint arXiv:2504.13958},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
benchmarks		benchmarks
dataset		dataset
examples		examples
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train_grpo.sh		train_grpo.sh
train_ppo.sh		train_ppo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ToolRL: Reward is All Tool Learning Needs

🔍 Installation

📊 Dataset

🧪 Training

Reward variants

🖊️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

qiancheng0/ToolRL

Folders and files

Latest commit

History

Repository files navigation

ToolRL: Reward is All Tool Learning Needs

🔍 Installation

📊 Dataset

🧪 Training

Reward variants

🖊️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages