- GPG-Open-RS1: The RL model trained on the Open-r1 dataset based on GPG, using DeepSeek-R1-Distill-Qwen-1.5B as the baseline model.
- GPG-7B: The RL model trained on the simplelr_qwen_level3to5 dataset based on GPG, using Qwen2.5-Math-7B as the baseline model.
Clone this repository.
git clone [email protected]:AMAP-ML/GPG.git
cd GPGFollow the repositories you need and install the required packages.
Please refer to the training script: ./open-rs/train.sh, ./open-rs/recipes
The results are as follows:
Table: The zero-shot pass@1 performance of the 1.5B models distilled by DeepSeek-R1 across five mathematical reasoning benchmarks.
$\dagger$ : reproduced results using the released code.$\ddagger$ : results from open-rs.
Distilled 1.5B Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench DeepSeek-R1-Distill-Qwen-1.5B 48.9 28.8 82.8 62.9 26.5 43.3 Still-3-1.5B-Preview 51.6 32.5 84.4 66.7 29.0 45.4 Open-RS1 $^\dagger$ 53.1 33.3 83.8 67.5 29.8 50.9 Open-RS3 $^\dagger$ 52.0 26.7 85.4 70.0 27.9 50.2 GPG-RS1 55.7 33.3 87.6 77.5 29.4 50.5 GPG-RS3 55.5 33.3 85.0 80.0 26.8 52.4
Please refer to the training script: ./open-r1/train.sh
Table: The zero-shot pass@1 performance of the 7B models across five mathematical reasoning benchmarks.
$\dagger$ : reproduced results using the released code.$\ddagger$ : results from open-rs.$^\star$ : results from Dr.GRPO.
7B Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench Qwen-2.5-Math-7B-Instruct $^\ddagger$ 43.8 13.3 79.8 50.6 34.6 40.7 Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7 Qwen2.5-Math-7B (no template) $^\star$ 38.2 0.2 69.0 45.8 21.3 34.7 rStar-Math-7B - 26.7 78.4 47.5 - 47.1 Eurus-2-7B-PRIME 48.9 26.7 79.2 57.8 38.6 42.1 Oat-Zero-7B 51.4 43.3 80.0 62.7 30.1 41.0 Oat-Zero-7B $^\dagger$ 47.8 30.0 80.6 55.4 29.0 44.0 OpenReasoner-Zero-7B @ 8k 45.9 13.3 82.4 54.2 31.6 47.9 SimpleRL-Zero-7B $^\star$ 46.6 26.7 78.2 60.2 27.6 40.3 GPG-7B 57.7 36.7 84.6 82.5 39.0 45.8
Table: Math reasoning results on Qwen2.5-Math-7B model.
$\dagger$ : reproduction use the released code.
Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7 GPRO 43.7 16.7 73.4 62.5 30.2 35.7 GPG( $F_{norm}=1, \alpha = 1$ )43.9 23.3 76.3 52.5 30.1 37.4 GPG( $F_{norm}={std { R(o) } }, \alpha = 1 $ )45.3 23.3 73.6 60.0 30.5 39.3 GPG( $F_{norm} =1, \alpha = \frac{B}{B-M}$ )47.8 30.0 75.0 62.5 33.1 38.2 GPG( $F_{norm}$ =1,$\alpha=\frac{B}{B-M}, \beta_{th}=0.6$ )48.3 30.0 76.2 62.5 34.2 39.0 Dr. GRPO $^\dagger$ 43.7 26.7 74.6 50.0 30.1 37.3
Please refer to the training script: ./VisualThinker-R1-Zero/src/open-r1-multimodal/run_grpo_SAT.sh
The results are as follows:
Table: Visual reasoning results on CV-Bench, which shows GPG training on base model has overall better performance over GRPO training and the base model.
Models Total Count Relation Depth Distance Qwen2-VL-2B 31.38 54.69 22.46 0.16 31.66 + SFT 57.84 60.02 68.92 55.00 45.83 + GRPO 59.47 59.64 66.76 54.16 56.66 + GPG 76.15 66.62 83.23 81.66 75.50
Please refer to the training script: ./Visual-RFT/src/scripts/
The results are as follows:
Table: Reasoning grounding results on LISA. GPG surpasses GRPO in reasoning grounding with 239 training images.
Models mIoUtest mIoUval gIoUtest Qwen2-VL-2B 26.9 30.1 25.3 + SFT 28.3 29.7 25.3 + GRPO 37.6 34.4 34.4 + GPG 51.8 51.3 50.4
Table: 4-shot Results on Four Fine-grained Classification Datasets. GPG shows consistently better results than GRPO on
$4$ classification datasets.
Models Average Flower102 Pets37 FGVC Cars196 Qwen2-VL-2B 56.0 54.8 66.4 45.9 56.8 + SFT 55.6 58.5 55.5 67.9 40.5 + GRPO 81.9 71.4 86.1 74.8 95.3 + GPG 89.0 79.3 90.8 88.5 97.5
Please refer to the training script: ./R1-V/src/scripts/run_grpo_GEOQA_qwen2.5_3b.sh
Table: Geometry reasoning results on GEOQA. GPG is better than GRPO.
Models GEOQATest Qwen2.5-VL-3B-Instruct 35.41 + GRPO 47.48 + GPG 51.33
If you have any questions, please submit an issue or contact huanghailang.hhl<AT>alibaba-inc.com.
If you find GPG or code useful, please cite
@misc{chu2025GPG,
title={GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning},
author={Xiangxiang Chu and Hailang Huang and Xiao Zhang and Fei Wei and Yong Wang},
year={2025},
eprint={2504.02546},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.02546},
}We sincerely thank projects open-rs, VisualThinker-R1-Zero, Visual-RFT, R1-V, Open-R1, understand-r1-zero(Dr.GRPO), and Open-r1-multimodal for providing their open-source resources.