Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Efficient Reasoning Vision Language Models

License

getaivids/VisionThink

 
 

Repository files navigation

Stanford-Alpaca

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper HF Code License

TABLE OF CONTENTS

  1. News
  2. Highlights
  3. Video
  4. Installation
  5. Train
  6. Evaluation
  7. Citation
  8. Acknowledgement
  9. License

News

  • [2025.07.22] Please check this note to quickly locate the code we modified from the original Verl. We will continue working to make the code more user-friendly.
  • [2025.07.18] We release Paper and this GitHub repo.
  • [2025.07.17] All data and models can be found here.

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [Paper]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Highlights

Stanford-Alpaca

  1. Our VisionThink leverages reinforcement learning to autonomously learn whether to reduce visual tokens. Compared to traditional efficient VLM approaches, our method achieves significant improvements on fine-grained benchmarks, such as those involving OCR-related tasks.

  2. VisionThink improves performance on General VQA tasks while reducing visual tokens by 50%, achieving 102% of the original model’s performance across nine benchmarks.

  3. VisionThink achieves strong performance and efficiency by simply resizing input images to reduce visual tokens. We hope this inspires further research into Efficient Reasoning Vision Language Models.

Video

Stanford-Alpaca

Installation

The environment follows the Verl.

git clone https://github.com/dvlab-research/VisionThink.git
conda create -n visionthink python=3.11 -y
conda activate visionthink
# veRL
pip3 install -e . 
# flash-attn
pip3 install flash-attn --no-build-isolation

If you want to use the Qwen3 as the Judge Model.

pip install -U tensordict
pip install transformers==4.51.0

Train

Data Preparation

Dataset for General VQA

huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-General-Train --local-dir datasets/VisionThink-General-Train
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-General-Val --local-dir datasets/VisionThink-General-Val

Dataset for Efficient Reasoning VLM

huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Train --local-dir datasets/VisionThink-Smart-Train
huggingface-cli download --repo-type dataset --resume-download Senqiao/VisionThink-Smart-Val --local-dir datasets/VisionThink-Smart-Val

General VQA Improvement via Reinforcement Learning

To use GPT as the reward model, first set the following environment variables:

AZURE_API_KEY=
AZURE_ENDPOINT=
AZURE_API_VERSION=

Then run:

bash scripts/run_generalqa_4o_judge.sh

For ease of use, we also support using Qwen3 as the reward model. The relevant logic is implemented in RewardModelWorker, and no additional setup is needed:

bash scripts/run_generalqa_qwen3_judge.sh

After training completes, convert the model to Hugging Face format using:

python scripts/model_merger.py --local_dir <your_model_path>

Our trained model, VisionThink-General, based on Qwen2.5-VL-7B, is available here.


Efficient Reasoning Vision-Language Models

Run the following script:

bash scripts/run_efficient_gpt4o_judge.sh

Our trained model, VisionThink-Efficient, based on Qwen2.5-VL-7B, is available here.

Evaluation

The evaluation code follows the structure of Lmms-Eval. Please place scripts/vllm_tool.py into the lmms_eval/models/ directory. The evaluation scripts are similar to those used for other models. For example:

CUDA_VISIBLE_DEVICES="0,1,2,3" python3 -m lmms_eval \
    --model vllm_tool \
    --model_args model_version=${MODEL_DIR},tensor_parallel_size=4,trust_remote_code=True,max_images=2,prompt=tool_call,enable_tool_call=True,downsample_image=True \
    --tasks ${TASKS} \
    --batch_size 1024 \
    --log_samples \
    --log_samples_suffix vllm \
    --output_path ./lmms_eval_logs/${MODEL_NAME} \
    --verbosity DEBUG

Citation

If you find this project useful in your research, please consider citing:

This work is highly motivated by our previous effort on efficient VLMs, VisionZip, which explores token compression for faster inference.

@article{yang2025visionthink,
  title={VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning},
  author={Yang, Senqiao and Li, Junyi and Lai, Xin and Yu, Bei and Zhao, Hengshuang and Jia, Jiaya},
  journal={arXiv preprint arXiv:2507.13348},
  year={2025}
}
@article{yang2024visionzip,
  title={VisionZip: Longer is Better but Not Necessary in Vision Language Models},
  author={Yang, Senqiao and Chen, Yukang and Tian, Zhuotao and Wang, Chengyao and Li, Jingyao and Yu, Bei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2412.04467},
  year={2024}
}

Acknowledgement

License

  • VisionThink is licensed under the Apache License 2.0.

About

Efficient Reasoning Vision Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.2%
  • Shell 0.8%