GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, Yong Wang

AMAP, Alibaba Group

| Work in progress.

Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in the figure below, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks.

Comparison of various RL methods

Resources

🤗 Models

GPG-Open-RS1: The RL model trained on the Open-r1 dataset based on GPG, using DeepSeek-R1-Distill-Qwen-1.5B as the baseline model.
GPG-7B: The RL model trained on the simplelr_qwen_level3to5 dataset based on GPG, using Qwen2.5-Math-7B as the baseline model.

Usage

Environment Installation

Clone this repository.

git clone [email protected]:AMAP-ML/GPG.git

cd GPG

Follow the repositories you need and install the required packages.

Experiments on unimodal tasks

Please refer to the training script: ./open-rs/train.sh, ./open-rs/recipes

The results are as follows:

Table: The zero-shot pass@1 performance of the 1.5B models distilled by DeepSeek-R1 across five mathematical reasoning benchmarks. $\dagger$: reproduced results using the released code. $\ddagger$: results from open-rs.

Distilled 1.5B Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench

DeepSeek-R1-Distill-Qwen-1.5B 48.9 28.8 82.8 62.9 26.5 43.3

Still-3-1.5B-Preview 51.6 32.5 84.4 66.7 29.0 45.4

Open-RS1 $^\dagger$ 53.1 33.3 83.8 67.5 29.8 50.9

Open-RS3 $^\dagger$ 52.0 26.7 85.4 70.0 27.9 50.2

GPG-RS1 55.7 33.3 87.6 77.5 29.4 50.5

GPG-RS3 55.5 33.3 85.0 80.0 26.8 52.4

Please refer to the training script: ./open-r1/train.sh

Table: The zero-shot pass@1 performance of the 7B models across five mathematical reasoning benchmarks. $\dagger$: reproduced results using the released code. $\ddagger$: results from open-rs. $^\star$: results from Dr.GRPO.

7B Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench

Qwen-2.5-Math-7B-Instruct $^\ddagger$ 43.8 13.3 79.8 50.6 34.6 40.7

Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7

Qwen2.5-Math-7B (no template) $^\star$ 38.2 0.2 69.0 45.8 21.3 34.7

rStar-Math-7B - 26.7 78.4 47.5 - 47.1

Eurus-2-7B-PRIME 48.9 26.7 79.2 57.8 38.6 42.1

Oat-Zero-7B 51.4 43.3 80.0 62.7 30.1 41.0

Oat-Zero-7B $^\dagger$ 47.8 30.0 80.6 55.4 29.0 44.0

OpenReasoner-Zero-7B @ 8k 45.9 13.3 82.4 54.2 31.6 47.9

SimpleRL-Zero-7B $^\star$ 46.6 26.7 78.2 60.2 27.6 40.3

GPG-7B 57.7 36.7 84.6 82.5 39.0 45.8

Table: Math reasoning results on Qwen2.5-Math-7B model. $\dagger$: reproduction use the released code.

Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench

Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7

GPRO 43.7 16.7 73.4 62.5 30.2 35.7

GPG($F_{norm}=1, \alpha = 1$) 43.9 23.3 76.3 52.5 30.1 37.4

GPG($F_{norm}={std { R(o) } }, \alpha = 1 $) 45.3 23.3 73.6 60.0 30.5 39.3

GPG($F_{norm} =1, \alpha = \frac{B}{B-M}$) 47.8 30.0 75.0 62.5 33.1 38.2

GPG($F_{norm}$=1, $\alpha=\frac{B}{B-M}, \beta_{th}=0.6$) 48.3 30.0 76.2 62.5 34.2 39.0

Dr. GRPO $^\dagger$ 43.7 26.7 74.6 50.0 30.1 37.3

Experiments on multimodal tasks

Experiments on VisualThinker-R1-Zero

Please refer to the training script: ./VisualThinker-R1-Zero/src/open-r1-multimodal/run_grpo_SAT.sh

The results are as follows:

Table: Visual reasoning results on CV-Bench, which shows GPG training on base model has overall better performance over GRPO training and the base model.

Models Total Count Relation Depth Distance

Qwen2-VL-2B 31.38 54.69 22.46 0.16 31.66

+ SFT 57.84 60.02 68.92 55.00 45.83

+ GRPO 59.47 59.64 66.76 54.16 56.66

+ GPG 76.15 66.62 83.23 81.66 75.50

Experiments on Visual-RFT

Please refer to the training script: ./Visual-RFT/src/scripts/

The results are as follows:

Table: Reasoning grounding results on LISA. GPG surpasses GRPO in reasoning grounding with 239 training images.

Models mIoU_test mIoU_val gIoU_test

Qwen2-VL-2B 26.9 30.1 25.3

+ SFT 28.3 29.7 25.3

+ GRPO 37.6 34.4 34.4

+ GPG 51.8 51.3 50.4

Table: 4-shot Results on Four Fine-grained Classification Datasets. GPG shows consistently better results than GRPO on $4$ classification datasets.

Models Average Flower102 Pets37 FGVC Cars196

Qwen2-VL-2B 56.0 54.8 66.4 45.9 56.8

+ SFT 55.6 58.5 55.5 67.9 40.5

+ GRPO 81.9 71.4 86.1 74.8 95.3

+ GPG 89.0 79.3 90.8 88.5 97.5

Experiments on R1-V

Please refer to the training script: ./R1-V/src/scripts/run_grpo_GEOQA_qwen2.5_3b.sh

Table: Geometry reasoning results on GEOQA. GPG is better than GRPO.

Models GEOQA_Test

Qwen2.5-VL-3B-Instruct 35.41

+ GRPO 47.48

+ GPG 51.33

Q&A

If you have any questions, please submit an issue or contact huanghailang.hhl<AT>alibaba-inc.com.

Citation

If you find GPG or code useful, please cite

@misc{chu2025GPG,
      title={GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning}, 
      author={Xiangxiang Chu and Hailang Huang and Xiao Zhang and Fei Wei and Yong Wang},
      year={2025},
      eprint={2504.02546},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.02546}, 
}

Acknowledgement

We sincerely thank projects open-rs, VisualThinker-R1-Zero, Visual-RFT, R1-V, Open-R1, understand-r1-zero(Dr.GRPO), and Open-r1-multimodal for providing their open-source resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Comparison of various RL methods

Resources

🤗 Models

Usage

Environment Installation

Experiments on unimodal tasks

Experiments on multimodal tasks

Experiments on VisualThinker-R1-Zero

Experiments on Visual-RFT

Experiments on R1-V

Q&A

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
R1-V		R1-V
Visual-RFT		Visual-RFT
VisualThinker-R1-Zero		VisualThinker-R1-Zero
docs/images		docs/images
open-r1		open-r1
open-rs		open-rs
README.md		README.md

Distilled 1.5B Models	Average	AIME24	MATH-500	AMC23	Minerva	OlympiadBench
DeepSeek-R1-Distill-Qwen-1.5B	48.9	28.8	82.8	62.9	26.5	43.3
Still-3-1.5B-Preview	51.6	32.5	84.4	66.7	29.0	45.4
Open-RS1 $^\dagger$	53.1	33.3	83.8	67.5	29.8	50.9
Open-RS3 $^\dagger$	52.0	26.7	85.4	70.0	27.9	50.2
GPG-RS1	55.7	33.3	87.6	77.5	29.4	50.5
GPG-RS3	55.5	33.3	85.0	80.0	26.8	52.4

7B Models	Average	AIME24	MATH-500	AMC23	Minerva	OlympiadBench
Qwen-2.5-Math-7B-Instruct $^\ddagger$	43.8	13.3	79.8	50.6	34.6	40.7
Qwen2.5-Math-7B	30.9	13.3	57.6	45.0	14.7	23.7
Qwen2.5-Math-7B (no template) $^\star$	38.2	0.2	69.0	45.8	21.3	34.7
rStar-Math-7B	-	26.7	78.4	47.5	-	47.1
Eurus-2-7B-PRIME	48.9	26.7	79.2	57.8	38.6	42.1
Oat-Zero-7B	51.4	43.3	80.0	62.7	30.1	41.0
Oat-Zero-7B $^\dagger$	47.8	30.0	80.6	55.4	29.0	44.0
OpenReasoner-Zero-7B @ 8k	45.9	13.3	82.4	54.2	31.6	47.9
SimpleRL-Zero-7B $^\star$	46.6	26.7	78.2	60.2	27.6	40.3
GPG-7B	57.7	36.7	84.6	82.5	39.0	45.8

Models	Total	Count	Relation	Depth	Distance
Qwen2-VL-2B	31.38	54.69	22.46	0.16	31.66
+ SFT	57.84	60.02	68.92	55.00	45.83
+ GRPO	59.47	59.64	66.76	54.16	56.66
+ GPG	76.15	66.62	83.23	81.66	75.50

Models	mIoU_test	mIoU_val	gIoU_test
Qwen2-VL-2B	26.9	30.1	25.3
+ SFT	28.3	29.7	25.3
+ GRPO	37.6	34.4	34.4
+ GPG	51.8	51.3	50.4

Models	Average	Flower102	Pets37	FGVC	Cars196
Qwen2-VL-2B	56.0	54.8	66.4	45.9	56.8
+ SFT	55.6	58.5	55.5	67.9	40.5
+ GRPO	81.9	71.4	86.1	74.8	95.3
+ GPG	89.0	79.3	90.8	88.5	97.5

palawat/GPG

Folders and files

Latest commit

History

Repository files navigation

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Comparison of various RL methods

Resources

🤗 Models

Usage

Environment Installation

Experiments on unimodal tasks

Experiments on multimodal tasks

Experiments on VisualThinker-R1-Zero

Experiments on Visual-RFT

Experiments on R1-V

Q&A

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages