Learning What Reinforcement Learning Can't

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Links

📜 Paper (arXiv)
🤗 HuggingFace Collection

Getting Started

Installation

You can install ReLIFT dependencies by running the following commands:

conda create -n relift python=3.10
conda activate relift
cd relift
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

Using vllm==0.8.3 is also fine. Please check the official vLLM documentation for instructions on how to upgrade.

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Repo Structure

This repository includes:

ReLIFT: Codes for training ReLIFT, interleaved with fine-tuning for hardest questions. Our main code changes are in ReLIFT/verl/verl/relift.
dataset: Dataset for training and evaluating ReLIFT.
examples: Example script to train ReLIFT.
eval_scripts: Evaluation scripts.

Introduction

ReLIFT, a training method that interleaves RL with online FT, achieving superior performance and efficiency compared to using RL or SFT alone.

Key Highlights:

RL Interleaved with Fine-Tuning: Combines RL with online fine-tuning, enabling the model to learn aspects that RL alone cannot capture.
Efficient Online Fine-Tuning: Requires only 13% of demonstration data, focusing exclusively on areas where RL falls short.
Superior Performance: Achieves better performance and efficiency compared to using RL or SFT alone.

Usage

Data Preparation

You need to first run the data preparation script to get the training data in parquet format.

cd dataset
python prepare_train_luffy_format.py

Model Preparation

You need to first download RoadQAQ/Qwen2.5-Math-1.5B-16k-think, RoadQAQ/Qwen2.5-Math-7B-think, RoadQAQ/Qwen2.5-7B-think.

If you find downloading too difficult, you can modify the configuration files instead. For Qwen2.5-Math-1.5B-16k-think and Qwen2.5-Math-7B-think, please update config.json, enabling longer responses.

Training

We provide three example script to train. You can run the following command to train ReLIFT for different base models:

  sh ./examples/math-7b/train.sh

If you want to train on multi nodes, you can run the following command:

  source ./examples/ray_start.sh # on master node
  source ./examples/ray_connect.sh # on client nodes
  sh ./examples/math-7b/train_multi_nodes.sh # on master node

Evaluation

We provide scripts to evalute. You can evaluate using the following command:

  sh ./eval_scripts/inference.sh

Inference

Here’s an example of using ReLIFT for inference:

Click to view inference example

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zero"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)

Models

Model	Base Models
RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zero Link	Qwen2.5-Math-7B Link
RoadQAQ/ReLIFT-Qwen2.5-Math-1.5B-Zero Link	Qwen2.5-Math-1.5B Link
RoadQAQ/ReLIFT-Qwen2.5-7B-Zero Link	Qwen2.5-7B Link

Todo List

Extending to 32B model.
Finding a more stable way to interleave RL with FT.
More results on multi-task and cross-task learning.
Proof.

📖 Citation

If you find this work helpful, please cite our paper:

@article{ma2025learning,
  title={Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions},
  author={Ma, Lu and Liang, Hao and Qiang, Meiyi and Tang, Lexiang and Ma, Xiaochen and Wong, Zhen Hao and Niu, Junbo and Shen, Chengyu and He, Runming and Cui, Bin and others},
  journal={arXiv preprint arXiv:2506.07527},
  year={2025}
}

Acknowledgement

ReLIFT builds upon LUFFY, veRL, deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones, including LUFFY, veRL, deepscaler, NuminaMath, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
dataset		dataset
eval_scripts		eval_scripts
examples		examples
figures		figures
relift		relift
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning What Reinforcement Learning Can't

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Links

Getting Started

Installation

Repo Structure

Introduction

Key Highlights:

Usage

Data Preparation

Model Preparation

Training

Evaluation

Inference

Models

Todo List

📖 Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

TheRoadQaQ/ReLIFT

Folders and files

Latest commit

History

Repository files navigation

Learning What Reinforcement Learning Can't

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Links

Getting Started

Installation

Repo Structure

Introduction

Key Highlights:

Usage

Data Preparation

Model Preparation

Training

Evaluation

Inference

Models

Todo List

📖 Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages