You can install ReLIFT dependencies by running the following commands:
conda create -n relift python=3.10
conda activate relift
cd relift
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .Using vllm==0.8.3 is also fine. Please check the official vLLM documentation for instructions on how to upgrade.
If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whlThis repository includes:
ReLIFT: Codes for training ReLIFT, interleaved with fine-tuning for hardest questions. Our main code changes are in ReLIFT/verl/verl/relift.dataset: Dataset for training and evaluating ReLIFT.examples: Example script to train ReLIFT.eval_scripts: Evaluation scripts.
ReLIFT, a training method that interleaves RL with online FT, achieving superior performance and efficiency compared to using RL or SFT alone.
- RL Interleaved with Fine-Tuning: Combines RL with online fine-tuning, enabling the model to learn aspects that RL alone cannot capture.
- Efficient Online Fine-Tuning: Requires only 13% of demonstration data, focusing exclusively on areas where RL falls short.
- Superior Performance: Achieves better performance and efficiency compared to using RL or SFT alone.
You need to first run the data preparation script to get the training data in parquet format.
cd dataset
python prepare_train_luffy_format.pyYou need to first download RoadQAQ/Qwen2.5-Math-1.5B-16k-think, RoadQAQ/Qwen2.5-Math-7B-think, RoadQAQ/Qwen2.5-7B-think.
If you find downloading too difficult, you can modify the configuration files instead. For Qwen2.5-Math-1.5B-16k-think and Qwen2.5-Math-7B-think, please update config.json, enabling longer responses.
We provide three example script to train. You can run the following command to train ReLIFT for different base models:
sh ./examples/math-7b/train.shIf you want to train on multi nodes, you can run the following command:
source ./examples/ray_start.sh # on master node
source ./examples/ray_connect.sh # on client nodes
sh ./examples/math-7b/train_multi_nodes.sh # on master nodeWe provide scripts to evalute. You can evaluate using the following command:
sh ./eval_scripts/inference.shHere’s an example of using ReLIFT for inference:
Click to view inference example
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_path="RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zero"
question = "which number is larger? 9.11 or 9.9?"
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)| Model | Base Models |
|---|---|
| RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zero Link | Qwen2.5-Math-7B Link |
| RoadQAQ/ReLIFT-Qwen2.5-Math-1.5B-Zero Link | Qwen2.5-Math-1.5B Link |
| RoadQAQ/ReLIFT-Qwen2.5-7B-Zero Link | Qwen2.5-7B Link |
- Extending to 32B model.
- Finding a more stable way to interleave RL with FT.
- More results on multi-task and cross-task learning.
- Proof.
If you find this work helpful, please cite our paper:
@article{ma2025learning,
title={Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions},
author={Ma, Lu and Liang, Hao and Qiang, Meiyi and Tang, Lexiang and Ma, Xiaochen and Wong, Zhen Hao and Niu, Junbo and Shen, Chengyu and He, Runming and Cui, Bin and others},
journal={arXiv preprint arXiv:2506.07527},
year={2025}
}ReLIFT builds upon LUFFY, veRL, deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones, including LUFFY, veRL, deepscaler, NuminaMath, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.