VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

News

[2025.07.22] Please check this note to quickly locate the code we modified from the original Verl. We will continue working to make the code more user-friendly.
[2025.07.18] We release Paper and this GitHub repo.
[2025.07.17] All data and models can be found here.

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [Paper]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Highlights

Our VisionThink leverages reinforcement learning to autonomously learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on fine-grained benchmarks, such as those involving OCR-related tasks.
VisionThink improves performance on General VQA tasks while reducing visual tokens by 50%, achieving 102% of the original model’s performance across nine benchmarks.
VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into Efficient Reasoning Vision Language Models.

Video

Installation

The environment follows the Verl.

git clone https://github.com/dvlab-research/VisionThink.git
conda create -n visionthink python=3.11 -y
conda activate visionthink
# veRL
pip3 install -e . 
# flash-attn
pip3 install flash-attn --no-build-isolation

If you want to use the Qwen3 as the Judge Model.

pip install -U tensordict
pip install transformers==4.51.0

Train

Data Preparation

Dataset for General VQA

huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-General-Train --local-dir datasets/VisionThink-General-Train
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-General-Val --local-dir datasets/VisionThink-General-Val

Dataset for Efficient Reasoning VLM

huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Train --local-dir datasets/VisionThink-Smart-Train
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Val --local-dir datasets/VisionThink-Smart-Val

General VQA Improvement via Reinforcement Learning

To use GPT as the reward model, first set the following environment variables:

AZURE_API_KEY=
AZURE_ENDPOINT=
AZURE_API_VERSION=

Then run:

bash scripts/run_generalqa_4o_judge.sh

For ease of use, we also support using Qwen3 as the reward model. The relevant logic is implemented in RewardModelWorker, and no additional setup is needed:

bash scripts/run_generalqa_qwen3_judge.sh

After training completes, convert the model to Hugging Face format using:

python scripts/model_merger.py --local_dir <your_model_path>

Our trained model, VisionThink-General, based on Qwen2.5-VL-7B, is available here.

Efficient Reasoning Vision-Language Models

Run the following script:

bash scripts/run_efficient_gpt4o_judge.sh

Our trained model, VisionThink-Efficient, based on Qwen2.5-VL-7B, is available here.

Evaluation

The evaluation code follows the structure of Lmms-Eval. Please place scripts/vllm_tool.py into the lmms_eval/models/ directory. The evaluation scripts are similar to those used for other models. For example:

CUDA_VISIBLE_DEVICES="0,1,2,3" python3 -m lmms_eval \
    --model vllm_tool \
    --model_args model_version=${MODEL_DIR},tensor_parallel_size=4,trust_remote_code=True,max_images=2,prompt=tool_call,enable_tool_call=True,downsample_image=True \
    --tasks ${TASKS} \
    --batch_size 1024 \
    --log_samples \
    --log_samples_suffix vllm \
    --output_path ./lmms_eval_logs/${MODEL_NAME} \
    --verbosity DEBUG

Citation

If you find this project useful in your research, please consider citing:

This work is highly motivated by our previous effort on efficient VLMs, VisionZip, which explores token compression for faster inference.

@article{yang2025visionthink,
  title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
  author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
  journal={arXiv preprint arXiv:2507.13348},
  year={2025}
}
@article{yang2024visionzip,
  title={VisionZip: Longer is Better but Not Necessary in Vision Language Models},
  author={Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2412.04467},
  year={2024}
}

Acknowledgement

This work is built upon Verl, EasyR1, Lmms-Eval, and MMSearch-R1. We thank them for their excellent open-source contributions.
We also thank Qwen, DeepSeek-R1, VisionZip, FastV, SparseVLM, and others for their contributions, which have provided valuable insights.

License

VisionThink is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
files		files
patches		patches
scripts		scripts
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
NOTE.md		NOTE.md
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

TABLE OF CONTENTS

News

Highlights

Video

Installation

Train

Data Preparation

Dataset for General VQA

Dataset for Efficient Reasoning VLM

General VQA Improvement via Reinforcement Learning

Efficient Reasoning Vision-Language Models

Evaluation

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

getaivids/VisionThink

Folders and files

Latest commit

History

Repository files navigation

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

TABLE OF CONTENTS

News

Highlights

Video

Installation

Train

Data Preparation

Dataset for General VQA

Dataset for Efficient Reasoning VLM

General VQA Improvement via Reinforcement Learning

Efficient Reasoning Vision-Language Models

Evaluation

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages