This is the official implementation of paper "Boosting MLLM Reasoning with Text-Debiased Hint-GRPO" [arXiv], which proposes two methods to improve the original GRPO algorithm on the MLLMs.
⭐⭐~If you find the resources of this project (e.g., dataset, model weights, training code) helpful, please kindly leave a star here😊.
-
Low data utilization of GRPO: If all answers are incorrect, the zero loss gradients (
$\nabla_\theta\mathcal{L}=\mathbf{0}$ ) will invalidate the sample. -
Text-bias of GRPO, where the MLLM ignores real image and uses its imagined image from text to generate outputs.
-
Data utilization rate:
-
Qwen2-VL-7B's test accuracy w/ & w/o image in GRPO training:
To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. The framework of our method is shown as below:
- 📰 [2025.03.31] Our paper is available at arXiv.
- 🤗 [2025.03.31] Our model weights are available at Hugging Face and Hugging Face.
- 💬 [2025.06.26] Our paper is accepted by ICCV 2025!
- 🚀 [2025.07.01] Training code is available here.
- 🚀 [2025.07.01] Evaluation code is available here.
The python packages required for this project is listed as below:
torch==2.5.1
accelerate
codetiming
datasets
flash-attn==2.7.4
liger-kernel
mathruler
numpy
omegaconf
pandas
peft
pillow
pyarrow
pylatexenc
qwen-vl-utils
ray[default]
tensordict
torchdata
transformers==4.51.0
vllm==0.7.3
wandb
deepspeed==0.15.4
math_verify
-
Train dataset. We choose the LLaVA-CoT dataset as the training dataset. Download it and place it to
/PATH/TO/LLaVA-CoT-100k/. -
Test dataset. Following R1-V, download the Geo170K dataset and place it to
/PATH/TO/Geo170K/. Unzipimages.zip.
-
Base models. Two famous pre-trained MLLMs are required for training:
-
Evaluation. Our trained Hint-GRPO models can be downloaded from:
The path of these models will be /PATH/TO/Qwen2-VL-7B-Instruct/, /PATH/TO/Qwen2.5-VL-3B-Instruct/, /PATH/TO/Hint-GRPO-Qwen2-VL-7B/, and /PATH/TO/Hint-GRPO-Qwen2.5-VL-3B/ respectively.
-
Train Hint-GRPO based on
Qwen2-VL-7Bmodel on a node of 8 GPUs (note that/PATH/TOin--model_name_or_pathand--data_pathshould be changed to your own path).torchrun --nproc_per_node=7 \ --nnodes 1 \ --node_rank 0 \ --master-addr localhost \ --master-port 7001 \ train_model.py \ --deepspeed local_scripts/zero3_offload.json \ --model_name_or_path /PATH/TO/Qwen2-VL-7B-Instruct/ \ --data_path /PATH/TO/LLaVA-CoT-100k/ \ --dataset_name LLaVA-CoT-100k \ --max_prompt_length 8192 \ --max_completion_length 768 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --num_groups 2 \ --num_generations 4 \ --logging_steps 1 \ --bf16 \ --torch_dtype bfloat16 \ --gradient_checkpointing true \ --attn_implementation flash_attention_2 \ --max_pixels 2359296 \ --save_total_limit 16 \ --num_train_epochs 2 \ --eval_strategy no \ --reward_funcs accuracy \ --use_vllm true \ --vllm_gpu_memory_utilization 0.6 \ --save_steps 500 \ --save_only_model true \ --output_dir output/exp_Qwen2-VL-7B -
Train Hint-GRPO based on
Qwen2.5-VL-3Bmodel on a node of 8 GPUs (note that/PATH/TOin--model_name_or_pathand--data_pathshould be changed to your own path).torchrun --nproc_per_node=7 \ --nnodes 1 \ --node_rank 0 \ --master-addr localhost \ --master-port 7001 \ train_model.py \ --deepspeed local_scripts/zero3_offload.json \ --model_name_or_path /PATH/TO/Qwen2.5-VL-3B-Instruct/ \ --data_path /PATH/TO/LLaVA-CoT-100k/ \ --dataset_name LLaVA-CoT-100k \ --max_prompt_length 8192 \ --max_completion_length 768 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --num_groups 2 \ --num_generations 4 \ --logging_steps 1 \ --bf16 \ --torch_dtype bfloat16 \ --gradient_checkpointing true \ --attn_implementation flash_attention_2 \ --max_pixels 2359296 \ --save_total_limit 16 \ --num_train_epochs 2 \ --eval_strategy no \ --reward_funcs accuracy \ --use_vllm true \ --vllm_gpu_memory_utilization 0.6 \ --save_steps 500 \ --save_only_model true \ --output_dir output/exp_Qwen2.5-VL-3B
-
Inference based on
Hint-GRPO-Qwen2-VL-7Bmodel (/PATH/TOshould be changed to your own path):-
Without text-bias calibration:
python test_model.py \ --model_path /PATH/TO/Hint-GRPO-Qwen2-VL-7B/ \ --data_path /PATH/TO/Geo170K/ \ --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \ --scale 0.0 -
With text-bias calibration:
python test_model.py \ --model_path /PATH/TO/Hint-GRPO-Qwen2-VL-7B/ \ --data_path /PATH/TO/Geo170K/ \ --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \ --scale 0.8
-
-
Inference based on
Hint-GRPO-Qwen2.5-VL-3Bmodel (/PATH/TOshould be changed to your own path):-
Without text-bias calibration:
python test_model.py \ --model_path /PATH/TO/Hint-GRPO-Qwen2.5-VL-3B/ \ --data_path /PATH/TO/Geo170K/ \ --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \ --scale 0.0 -
With text-bias calibration:
python test_model.py \ --model_path /PATH/TO/Hint-GRPO-Qwen2.5-VL-3B \ --data_path /PATH/TO/Geo170K/ \ --prompt_path ./eval/prompts/geoqa_test_prompts.jsonl \ --scale 0.8
-