Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TheRoadQaQ/ReLIFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning What Reinforcement Learning Can't

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Links


Getting Started

Installation

You can install ReLIFT dependencies by running the following commands:

conda create -n relift python=3.10
conda activate relift
cd relift
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

Using vllm==0.8.3 is also fine. Please check the official vLLM documentation for instructions on how to upgrade.

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Repo Structure

This repository includes:

  • ReLIFT: Codes for training ReLIFT, interleaved with fine-tuning for hardest questions. Our main code changes are in ReLIFT/verl/verl/relift.
  • dataset: Dataset for training and evaluating ReLIFT.
  • examples: Example script to train ReLIFT.
  • eval_scripts: Evaluation scripts.

Introduction

ReLIFT, a training method that interleaves RL with online FT, achieving superior performance and efficiency compared to using RL or SFT alone.

overview

Key Highlights:

  • RL Interleaved with Fine-Tuning: Combines RL with online fine-tuning, enabling the model to learn aspects that RL alone cannot capture.
  • Efficient Online Fine-Tuning: Requires only 13% of demonstration data, focusing exclusively on areas where RL falls short.
  • Superior Performance: Achieves better performance and efficiency compared to using RL or SFT alone.

Usage

Data Preparation

You need to first run the data preparation script to get the training data in parquet format.

cd dataset
python prepare_train_luffy_format.py

Model Preparation

You need to first download RoadQAQ/Qwen2.5-Math-1.5B-16k-think, RoadQAQ/Qwen2.5-Math-7B-think, RoadQAQ/Qwen2.5-7B-think.

If you find downloading too difficult, you can modify the configuration files instead. For Qwen2.5-Math-1.5B-16k-think and Qwen2.5-Math-7B-think, please update config.json, enabling longer responses.

Training

We provide three example script to train. You can run the following command to train ReLIFT for different base models:

  sh ./examples/math-7b/train.sh

If you want to train on multi nodes, you can run the following command:

  source ./examples/ray_start.sh # on master node
  source ./examples/ray_connect.sh # on client nodes
  sh ./examples/math-7b/train_multi_nodes.sh # on master node

Evaluation

We provide scripts to evalute. You can evaluate using the following command:

  sh ./eval_scripts/inference.sh

Inference

Here’s an example of using ReLIFT for inference:

Click to view inference example
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zero"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)

Models

Model Base Models
RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zero Link Qwen2.5-Math-7B Link
RoadQAQ/ReLIFT-Qwen2.5-Math-1.5B-Zero Link Qwen2.5-Math-1.5B Link
RoadQAQ/ReLIFT-Qwen2.5-7B-Zero Link Qwen2.5-7B Link

Todo List

  • Extending to 32B model.
  • Finding a more stable way to interleave RL with FT.
  • More results on multi-task and cross-task learning.
  • Proof.

📖 Citation

If you find this work helpful, please cite our paper:

@article{ma2025learning,
  title={Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions},
  author={Ma, Lu and Liang, Hao and Qiang, Meiyi and Tang, Lexiang and Ma, Xiaochen and Wong, Zhen Hao and Niu, Junbo and Shen, Chengyu and He, Runming and Cui, Bin and others},
  journal={arXiv preprint arXiv:2506.07527},
  year={2025}
}

Acknowledgement

ReLIFT builds upon LUFFY, veRL, deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones, including LUFFY, veRL, deepscaler, NuminaMath, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.

About

Official Repository of "Learning what reinforcement learning can't"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages