Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ GPG Public
forked from AMAP-ML/GPG

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Notifications You must be signed in to change notification settings

palawat/GPG

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

📖Paper | Work in progress.
Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in the figure below, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. GPG

Comparison of various RL methods

GPG-method

Resources

🤗 Models

  1. GPG-Open-RS1: The RL model trained on the Open-r1 dataset based on GPG, using DeepSeek-R1-Distill-Qwen-1.5B as the baseline model.
  2. GPG-7B: The RL model trained on the simplelr_qwen_level3to5 dataset based on GPG, using Qwen2.5-Math-7B as the baseline model.

Usage

Environment Installation

Clone this repository.

git clone [email protected]:AMAP-ML/GPG.git

cd GPG

Follow the repositories you need and install the required packages.

Experiments on unimodal tasks

Please refer to the training script: ./open-rs/train.sh, ./open-rs/recipes

The results are as follows:

Table: The zero-shot pass@1 performance of the 1.5B models distilled by DeepSeek-R1 across five mathematical reasoning benchmarks. $\dagger$: reproduced results using the released code. $\ddagger$: results from open-rs.

Distilled 1.5B Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
DeepSeek-R1-Distill-Qwen-1.5B 48.9 28.8 82.8 62.9 26.5 43.3
Still-3-1.5B-Preview 51.6 32.5 84.4 66.7 29.0 45.4
Open-RS1 $^\dagger$ 53.1 33.3 83.8 67.5 29.8 50.9
Open-RS3 $^\dagger$ 52.0 26.7 85.4 70.0 27.9 50.2
GPG-RS1 55.7 33.3 87.6 77.5 29.4 50.5
GPG-RS3 55.5 33.3 85.0 80.0 26.8 52.4

Please refer to the training script: ./open-r1/train.sh

Table: The zero-shot pass@1 performance of the 7B models across five mathematical reasoning benchmarks. $\dagger$: reproduced results using the released code. $\ddagger$: results from open-rs. $^\star$: results from Dr.GRPO.

7B Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
Qwen-2.5-Math-7B-Instruct $^\ddagger$ 43.8 13.3 79.8 50.6 34.6 40.7
Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7
Qwen2.5-Math-7B (no template) $^\star$ 38.2 0.2 69.0 45.8 21.3 34.7
rStar-Math-7B - 26.7 78.4 47.5 - 47.1
Eurus-2-7B-PRIME 48.9 26.7 79.2 57.8 38.6 42.1
Oat-Zero-7B 51.4 43.3 80.0 62.7 30.1 41.0
Oat-Zero-7B $^\dagger$ 47.8 30.0 80.6 55.4 29.0 44.0
OpenReasoner-Zero-7B @ 8k 45.9 13.3 82.4 54.2 31.6 47.9
SimpleRL-Zero-7B $^\star$ 46.6 26.7 78.2 60.2 27.6 40.3
GPG-7B 57.7 36.7 84.6 82.5 39.0 45.8

Table: Math reasoning results on Qwen2.5-Math-7B model. $\dagger$: reproduction use the released code.

Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7
GPRO 43.7 16.7 73.4 62.5 30.2 35.7
GPG($F_{norm}=1, \alpha = 1$) 43.9 23.3 76.3 52.5 30.1 37.4
GPG($F_{norm}={std { R(o) } }, \alpha = 1 $) 45.3 23.3 73.6 60.0 30.5 39.3
GPG($F_{norm} =1, \alpha = \frac{B}{B-M}$) 47.8 30.0 75.0 62.5 33.1 38.2
GPG($F_{norm}$=1, $\alpha=\frac{B}{B-M}, \beta_{th}=0.6$) 48.3 30.0 76.2 62.5 34.2 39.0
Dr. GRPO $^\dagger$ 43.7 26.7 74.6 50.0 30.1 37.3

Experiments on multimodal tasks

Experiments on VisualThinker-R1-Zero

Please refer to the training script: ./VisualThinker-R1-Zero/src/open-r1-multimodal/run_grpo_SAT.sh

The results are as follows:

Table: Visual reasoning results on CV-Bench, which shows GPG training on base model has overall better performance over GRPO training and the base model.

Models Total Count Relation Depth Distance
Qwen2-VL-2B 31.38 54.69 22.46 0.16 31.66
+ SFT 57.84 60.02 68.92 55.00 45.83
+ GRPO 59.47 59.64 66.76 54.16 56.66
+ GPG 76.15 66.62 83.23 81.66 75.50

Experiments on Visual-RFT

Please refer to the training script: ./Visual-RFT/src/scripts/

The results are as follows:

Table: Reasoning grounding results on LISA. GPG surpasses GRPO in reasoning grounding with 239 training images.

Models mIoUtest mIoUval gIoUtest
Qwen2-VL-2B 26.9 30.1 25.3
+ SFT 28.3 29.7 25.3
+ GRPO 37.6 34.4 34.4
+ GPG 51.8 51.3 50.4

Table: 4-shot Results on Four Fine-grained Classification Datasets. GPG shows consistently better results than GRPO on $4$ classification datasets.

Models Average Flower102 Pets37 FGVC Cars196
Qwen2-VL-2B 56.0 54.8 66.4 45.9 56.8
+ SFT 55.6 58.5 55.5 67.9 40.5
+ GRPO 81.9 71.4 86.1 74.8 95.3
+ GPG 89.0 79.3 90.8 88.5 97.5

Experiments on R1-V

Please refer to the training script: ./R1-V/src/scripts/run_grpo_GEOQA_qwen2.5_3b.sh

Table: Geometry reasoning results on GEOQA. GPG is better than GRPO.

Models GEOQATest
Qwen2.5-VL-3B-Instruct 35.41
+ GRPO 47.48
+ GPG 51.33

Q&A

If you have any questions, please submit an issue or contact huanghailang.hhl<AT>alibaba-inc.com.

Citation

If you find GPG or code useful, please cite

@misc{chu2025GPG,
      title={GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning}, 
      author={Xiangxiang Chu and Hailang Huang and Xiao Zhang and Fei Wei and Yong Wang},
      year={2025},
      eprint={2504.02546},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.02546}, 
}

Acknowledgement

We sincerely thank projects open-rs, VisualThinker-R1-Zero, Visual-RFT, R1-V, Open-R1, understand-r1-zero(Dr.GRPO), and Open-r1-multimodal for providing their open-source resources.

About

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.2%
  • Jupyter Notebook 20.5%
  • Shell 6.0%
  • Makefile 0.3%