Codestin Search App

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Overview

This is the official implementation of paper "Boosting MLLM Reasoning with Text-Debiased Hint-GRPO" [arXiv], which proposes two methods to improve the original GRPO algorithm on the MLLMs.

⭐⭐~If you find the resources of this project (e.g., dataset, model weights, training code) helpful, please kindly leave a star here😊.

Motivation

Low data utilization of GRPO: If all answers are incorrect, the zero loss gradients ($\nabla_\theta\mathcal{L}=\mathbf{0}$) will invalidate the sample.
Text-bias of GRPO, where the MLLM ignores real image and uses its imagined image from text to generate outputs.

Observation

Data utilization rate:
Qwen2-VL-7B's test accuracy w/ & w/o image in GRPO training:

Method

To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. The framework of our method is shown as below:

🔥🔥🔥 News!!

📰 [2025.03.31] Our paper is available at arXiv.
🤗 [2025.03.31] Our model weights are available at Hugging Face and Hugging Face.
💬 [2025.06.26] Our paper is accepted by ICCV 2025!
🚀 [2025.07.01] Training code is available here.
🚀 [2025.07.01] Evaluation code is available here.

Requirements

The python packages required for this project is listed as below:

torch==2.5.1
accelerate
codetiming
datasets
flash-attn==2.7.4
liger-kernel
mathruler
numpy
omegaconf
pandas
peft
pillow
pyarrow
pylatexenc
qwen-vl-utils
ray[default]
tensordict
torchdata
transformers==4.51.0
vllm==0.7.3
wandb
deepspeed==0.15.4
math_verify

Dataset

Train dataset. We choose the LLaVA-CoT dataset as the training dataset. Download it and place it to /PATH/TO/LLaVA-CoT-100k/.
Test dataset. Following R1-V, download the Geo170K dataset and place it to /PATH/TO/Geo170K/. Unzip images.zip.

Pre-trained models

Base models. Two famous pre-trained MLLMs are required for training:
- Qwen2-VL-7B
- Qwen2.5-VL-3B
Evaluation. Our trained Hint-GRPO models can be downloaded from:
- Hint-GRPO-Qwen2-VL-7B
- Hint-GRPO-Qwen2.5-VL-3B

The path of these models will be /PATH/TO/Qwen2-VL-7B-Instruct/, /PATH/TO/Qwen2.5-VL-3B-Instruct/, /PATH/TO/Hint-GRPO-Qwen2-VL-7B/, and /PATH/TO/Hint-GRPO-Qwen2.5-VL-3B/ respectively.

Training

Train Hint-GRPO based on Qwen2-VL-7B model on a node of 8 GPUs (note that /PATH/TO in --model_name_or_path and --data_path should be changed to your own path).

torchrun --nproc_per_node=7 \
  --nnodes 1 \
  --node_rank 0 \
  --master-addr localhost \
  --master-port 7001 \
  train_model.py \
  --deepspeed local_scripts/zero3_offload.json \
  --model_name_or_path /PATH/TO/Qwen2-VL-7B-Instruct/ \
  --data_path /PATH/TO/LLaVA-CoT-100k/ \
  --dataset_name LLaVA-CoT-100k \
  --max_prompt_length 8192 \
  --max_completion_length 768 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --num_groups 2 \
  --num_generations 4 \
  --logging_steps 1 \
  --bf16 \
  --torch_dtype bfloat16 \
  --gradient_checkpointing true \
  --attn_implementation flash_attention_2 \
  --max_pixels 2359296 \
  --save_total_limit 16 \
  --num_train_epochs 2 \
  --eval_strategy no \
  --reward_funcs accuracy \
  --use_vllm true \
  --vllm_gpu_memory_utilization 0.6 \
  --save_steps 500 \
  --save_only_model true \
  --output_dir output/exp_Qwen2-VL-7B

Train Hint-GRPO based on Qwen2.5-VL-3B model on a node of 8 GPUs (note that /PATH/TO in --model_name_or_path and --data_path should be changed to your own path).

torchrun --nproc_per_node=7 \
  --nnodes 1 \
  --node_rank 0 \
  --master-addr localhost \
  --master-port 7001 \
  train_model.py \
  --deepspeed local_scripts/zero3_offload.json \
  --model_name_or_path /PATH/TO/Qwen2.5-VL-3B-Instruct/ \
  --data_path /PATH/TO/LLaVA-CoT-100k/ \
  --dataset_name LLaVA-CoT-100k \
  --max_prompt_length 8192 \
  --max_completion_length 768 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --num_groups 2 \
  --num_generations 4 \
  --logging_steps 1 \
  --bf16 \
  --torch_dtype bfloat16 \
  --gradient_checkpointing true \
  --attn_implementation flash_attention_2 \
  --max_pixels 2359296 \
  --save_total_limit 16 \
  --num_train_epochs 2 \
  --eval_strategy no \
  --reward_funcs accuracy \
  --use_vllm true \
  --vllm_gpu_memory_utilization 0.6 \
  --save_steps 500 \
  --save_only_model true \
  --output_dir output/exp_Qwen2.5-VL-3B

Evaluation

Inference based on Hint-GRPO-Qwen2-VL-7B model (/PATH/TO should be changed to your own path):

Without text-bias calibration:

python test_model.py \
  --model_path /PATH/TO/Hint-GRPO-Qwen2-VL-7B/ \
  --data_path /PATH/TO/Geo170K/ \
  --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \
  --scale 0.0

With text-bias calibration:

python test_model.py \
  --model_path /PATH/TO/Hint-GRPO-Qwen2-VL-7B/ \
  --data_path /PATH/TO/Geo170K/ \
  --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \
  --scale 0.8

Inference based on Hint-GRPO-Qwen2.5-VL-3B model (/PATH/TO should be changed to your own path):

Without text-bias calibration:

python test_model.py \
  --model_path /PATH/TO/Hint-GRPO-Qwen2.5-VL-3B/ \
  --data_path /PATH/TO/Geo170K/ \
  --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \
  --scale 0.0

With text-bias calibration:

python test_model.py \
  --model_path /PATH/TO/Hint-GRPO-Qwen2.5-VL-3B \
  --data_path /PATH/TO/Geo170K/ \
  --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \
  --scale 0.8

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
data_files		data_files
eval		eval
local_scripts		local_scripts
model		model
open_r1		open_r1
trl		trl
README.md		README.md
requirements.txt		requirements.txt
test_model.py		test_model.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Overview

Motivation

Observation

Method

🔥🔥🔥 News!!

Requirements

Dataset

Pre-trained models

Training

Evaluation

About

Uh oh!

Releases

Packages

Languages

hqhQAQ/Hint-GRPO

Folders and files

Latest commit

History

Repository files navigation

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Overview

Motivation

Observation

Method

🔥🔥🔥 News!!

Requirements

Dataset

Pre-trained models

Training

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages