EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Visual-Languge Models
📄 Paper | 🏠 Project Website
Zekun Wang*, MingHua Ma*, Zexin Wang*, Rongchuan Mu*, liping shan, Ming Liu, Bing Qin,
Harbin Institute of Technology
We introduce EffiVLM-Bench, a comprehensive benchmark designed to systematically evaluate training-free acceleration methods for Large Visual-Language Models (LVLMs). While LVLMs have achieved remarkable performance across diverse multimodal tasks, their high computational and memory demands hinder practical deployment and scalability. Although various acceleration techniques have been proposed, a lack of unified evaluation across different architectures, datasets, and metrics limits our understanding of their effectiveness and trade-offs.
In this work, we introduce a comprehensive benchmark, EffiVLM-Bench, to investigate the effectiveness of training-free acceleration methods across representative LVLMs and diverse datasets. We concentrate on evaluating various mainstream acceleration methods classified into two categories: token compression and parameter compression. EffiVLM-Bench provides a unified framework for evaluating not only the absolute performance but also the generalization and loyalty capabilities of these methods, while further exploring the Pareto-optimal trade-offs between performance and efficiency.
- 2025.05.18 EffiVLM-Bench is accepted to ACL 2025!
- Exciting updates on the way: new compression methods and more supported models are coming soon!
```bash
conda create -n mllm-efficiency python=3.10
conda activate mllm-efficiency
pip install -r requirements.txt
pip install ninja
pip install omegaconf
pip install flash-attention-softmax-n
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install nvidia/label/cuda-12.1.1::cuda-nvcc
```
```bash
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
```
Create a new file in the activate.d directory and add the following content:
```bash
#!/bin/bash
export CUDA_HOME=$(dirname $(dirname $(which nvcc)))
```
Create a new file in the deactivate.d directory and add the following content:
```bash
#!/bin/bash
unset CUDA_HOME
```
```bash
conda activate mllm-efficiency
echo $CUDA_HOME
which nvcc
pip install flash-attn --no-build-isolation
```
```bash
cd lmms-eval
pip install -e .
cd ../llava/
pip install -e .
pip install numpy==2.2.0
```
```bash
cd qwen2vl
pip install -e .
pip install qwen-vl-utils
```
Before running the script, you need to set the environment variables to ensure that the module is imported normally.
```bash
export CONDA_DEFAULT_ENV="mllm-efficiency"
export PATH="/your anaconda path /envs/mllm-efficiency/bin:$PATH"
export PYTHONPATH="/your project path/EffiVLM-Bench:/your project path/EffiVLM-Bench/lmms-eval"
```
This section guides you on how to use the predict.py script for inference and testing various KV cache compression and token prune methods.The primary script for conducting inference tests is located at test/predict.py.
You can test various KV cache compression and token prune methods on the following models:
llava-onevision-qwen2-7b-ovQwen2-VL-7B-InstructInternVL2_5-38B
Additionally, KV cache methods are supported for the following model:
InternVL2_5-4B
Usage To run the script, use the following command structure:
python test/predict.py [arguments]Below are the necessary command-line arguments to configure the inference process:
--image_path:str, Path to the input image.--question:str, The question to ask the model.--pretrained:str, Path or identifier for the pretrained model.--model_name:str, choices:['llava-onevision-qwen2', 'qwen2-vl', 'internvl2_5']. Specify the model name.--method:str, choices:['random', 'streamingllm', 'h2o', 'snapkv', 'look-m', 'vl-cache', 'pyramidkv', 'fastv', 'visionzip', 'prumerge+']. The KV cache compression or token prune method to use.--merge:bool, default:True. Merge switch for thelook-mKV cache method.--head_adaptive:bool, default:True. Enables head-adaptive strategy forh2o,snapkv, andpyramidkvmethods.--pooling:str, default:avgpool. Pooling strategy forsnapkvandpyramidkvmethods.--layer_adaptive:bool, default:True. Enables layer-adaptive strategy for thevl-cachemethod.--vlcache_different_window_per_layer:bool, default:False. Enables different window sizes per layer for thevl-cachemethod.--budgets:float, default:0.4. Budget for KV cache compression and token prune methods.
We use lmms-eval to evaluate various benchmarks. For examples of startup scripts, please refer to the run_example.sh file. You only need to replace your own paths and related module names and parameter names accordingly.
./run_example.shThanks KVCache-Factory , ECoFLaP , Wanda, SparseGPT , FastV , VisionZip , PruMerge for providing open-source code to support the expansion of this project.
@misc{wang2025effivlmbenchcomprehensivebenchmarkevaluating,
title={EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models},
author={Zekun Wang and Minghua Ma and Zexin Wang and Rongchuan Mu and Liping Shan and Ming Liu and Bing Qin},
year={2025},
eprint={2506.00479},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.00479},
}