SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Zhenghai Xue* · Longtao Zheng* · Qian Liu · Yingru Li · Zejun Ma · Bo An

This repository trains LLMs to perform multi-turn Tool-Integrated Reasoning (TIR) with RL, where LLMs iteratively generate code, execute it, and think upon the execution results. This capability enables models to tackle complex mathematical problems, conduct sophisticated data analysis, and perform multi-step reasoning that mirrors human problem-solving approaches.

Key takeaways

Instability in multi-turn training. While Reinforcement Learning (RL) on a base model is stable for single-turn Tool-Integrated Reasoning (TIR), extending RL to multi-turn TIR suffers from erratic performance.
Stabilize training via filtering "void" turns. SimpleTIR identifies the problem as distributional drift from external tool outputs and multi-turn compounding errors. By filtering out trajectories that yield neither a code block nor a final answer, SimpleTIR achieves stable multi-turn training and outperforms other approaches.
Diverse reasoning patterns with end-to-end multi-turn RL. Unlike the biased reasoning pattern imposed by Supervised Fine-Tuning (SFT), our end-to-end RL approach delivers more diverse reasoning patterns like inductive reasoning, self-correction, cross validation, and progressive reasoning.

For implementation details and our experimental findings, please see the accompanying blog post. A technical paper is in preparation and will be released soon.

Quickstart

We train SimpleTIR on multiple H100 nodes and tested the code with vllm==0.8.5. For training or evaluation across multiple nodes, we recommend submitting tasks through ray (cf. the DAPO setup)

We also recommend using a highly parallel sandbox for code execution. We use an internal sandbox for training by default, but we also add examples of using a local firejail sandbox. Please refer to sandbox/ and set the environment variable SANDBOX_ENDPOINT to the sandbox endpoint, although we only tested it for evaluation on a single node.

Training

Example command to run 7B training on an 8xH100 node:

MODEL_PATH=... \ # the parent dir of the checkpoint
DATA_PATH=... \ # the dir containing data like deepscaler/train (see datasets/)
CHECKPOINT_PATH=... \ # the dir to save the checkpoint
LOG_PATH=... \ # the dir to save the log
NNODES=... \
GPUS_PER_NODE=... \
RESUME=False \
CONFIG_NAME=simpletir_trainer \
bash train.sh \
  --max_response_length 8000 \
  --max_prompt_length 16000 \
  --model_name Qwen2.5-7B \
  --max_turns 5 \
  --train_batch_size 512 \
  --val_sample_size 50 \
  --n_val 16 \
  --train_dataset "simplelr_math_35/train deepscaler/train"

Example command to run 7B single-turn training on an 8xH100 node:

MODEL_PATH=... \ # the parent dir of the checkpoint
DATA_PATH=... \ # the dir containing data like deepscaler/train (see datasets/)
CHECKPOINT_PATH=... \ # the dir to save the checkpoint
LOG_PATH=... \ # the dir to save the log
NNODES=... \
GPUS_PER_NODE=... \
RESUME=False \
CONFIG_NAME=single_turn_math \
bash train.sh \
  --max_response_length 8000 \
  --max_prompt_length 4096 \
  --model_name Qwen2.5-7B \
  --train_dataset "simplelr_math_35/train deepscaler/train" \
  --tool_use False \
  --mask_void_turns False \
  --train_batch_size 512 \
  --val_sample_size 50 \
  --n_val 16

To resume a previous training run, simply set RESUME to True.

Inference

Example command to run 7B evaluation on an 8xH100 node:

MODEL_PATH=... \ # the parent dir of the checkpoint
DATA_PATH=... \ # the dir containing data like deepscaler/aime (see datasets/)
CHECKPOINT_PATH=... \ # the dir to save the checkpoint
LOG_PATH=... \ # the dir to save the log
NNODES=... \
GPUS_PER_NODE=... \
RESUME=False \
CONFIG_NAME=simpletir_trainer \
bash train.sh \
  --max_response_length 12000 \
  --max_prompt_length 36000 \
  --model_name <MODEL_NAME> \ # the name of the checkpoint
  --max_turns 10 \
  --valid_dataset "deepscaler/aime" \
  --val_only True \
  --n_val 32 \
  --output_acc_to_file True \
  --val_sample_size 500 \
  --sp_size 2

export RAY_memory_usage_threshold=1.0 pip install word2number pip install math_verify

To evaluate a trained checkpoint, please convert the checkpoint into huggingface format using scripts/model_merger.sh.

Acknowledgement

We thank verl and Search-R1 for the open source code.

Citation

If you find this codebase useful, please kindly give a star and cite our paper:

@article{xue2025simpletir,
  title={SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning},
  author={Xue, Zhenghai and Zheng, Longtao and Liu, Qian and Li, Yingru and Zheng, Xiaosen and Ma, Zejun and An, Bo},
  journal={arXiv preprint arXiv:2509.02479},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 388 Commits
.github		.github
datasets		datasets
docker		docker
docs		docs
examples		examples
patches		patches
recipe/simpletir		recipe/simpletir
sandbox		sandbox
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py
train.sh		train.sh
train_deepseek_8b_130_20250907.sh		train_deepseek_8b_130_20250907.sh
train_qwen25_coder_20250907.sh		train_qwen25_coder_20250907.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Key takeaways

Quickstart

Training

Inference

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

PGCodeLLM/SimpleTIR

Folders and files

Latest commit

History

Repository files navigation

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Key takeaways

Quickstart

Training

Inference

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages