- [2025.07.22] Please check this note to quickly locate the code we modified from the original Verl. We will continue working to make the code more user-friendly.
- [2025.07.18] We release Paper and this GitHub repo.
- [2025.07.17] All data and models can be found here.
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [Paper]
Senqiao Yang,
Junyi Li,
Xin Lai,
Bei Yu,
Hengshuang Zhao,
Jiaya Jia
-
Our VisionThink leverages reinforcement learning to autonomously learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on fine-grained benchmarks, such as those involving OCR-related tasks.
-
VisionThink improves performance on General VQA tasks while reducing visual tokens by 50%, achieving 102% of the original model’s performance across nine benchmarks.
-
VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into Efficient Reasoning Vision Language Models.
The environment follows the Verl.
git clone https://github.com/dvlab-research/VisionThink.git
conda create -n visionthink python=3.11 -y
conda activate visionthink
# veRL
pip3 install -e .
# flash-attn
pip3 install flash-attn --no-build-isolation
If you want to use the Qwen3 as the Judge Model.
pip install -U tensordict
pip install transformers==4.51.0
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-General-Train --local-dir datasets/VisionThink-General-Train
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-General-Val --local-dir datasets/VisionThink-General-Val
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Train --local-dir datasets/VisionThink-Smart-Train
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Val --local-dir datasets/VisionThink-Smart-Val
To use GPT as the reward model, first set the following environment variables:
AZURE_API_KEY=
AZURE_ENDPOINT=
AZURE_API_VERSION=
Then run:
bash scripts/run_generalqa_4o_judge.sh
For ease of use, we also support using Qwen3 as the reward model. The relevant logic is implemented in RewardModelWorker, and no additional setup is needed:
bash scripts/run_generalqa_qwen3_judge.sh
After training completes, convert the model to Hugging Face format using:
python scripts/model_merger.py --local_dir <your_model_path>
Our trained model, VisionThink-General, based on Qwen2.5-VL-7B, is available here.
Run the following script:
bash scripts/run_efficient_gpt4o_judge.sh
Our trained model, VisionThink-Efficient, based on Qwen2.5-VL-7B, is available here.
The evaluation code follows the structure of Lmms-Eval.
Please place scripts/vllm_tool.py into the lmms_eval/models/ directory.
The evaluation scripts are similar to those used for other models. For example:
CUDA_VISIBLE_DEVICES="0,1,2,3" python3 -m lmms_eval \
--model vllm_tool \
--model_args model_version=${MODEL_DIR},tensor_parallel_size=4,trust_remote_code=True,max_images=2,prompt=tool_call,enable_tool_call=True,downsample_image=True \
--tasks ${TASKS} \
--batch_size 1024 \
--log_samples \
--log_samples_suffix vllm \
--output_path ./lmms_eval_logs/${MODEL_NAME} \
--verbosity DEBUG
If you find this project useful in your research, please consider citing:
This work is highly motivated by our previous effort on efficient VLMs, VisionZip, which explores token compression for faster inference.
@article{yang2025visionthink,
title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
journal={arXiv preprint arXiv:2507.13348},
year={2025}
}
@article{yang2024visionzip,
title={VisionZip: Longer is Better but Not Necessary in Vision Language Models},
author={Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya},
journal={arXiv preprint arXiv:2412.04467},
year={2024}
}
-
This work is built upon Verl, EasyR1, Lmms-Eval, and MMSearch-R1. We thank them for their excellent open-source contributions.
-
We also thank Qwen, DeepSeek-R1, VisionZip, FastV, SparseVLM, and others for their contributions, which have provided valuable insights.
- VisionThink is licensed under the Apache License 2.0.