- Authors: Jaewoo Lee, Keyang Xuan, Chanakya Ekbote, Sandeep Polsetty, Yi R. (May) Fung, Paul Pu Liang
- Paper
Multimodal Large Language Models (MLLMs) have grown in size to address the complexities of multimodal tasks. While beneficial for performance, their colossal model size imposes substantial computational and memory resources, limiting their practicality in resource-constrained scenarios. Post-training model pruning effectively reduces model size by removing a massive number of parameters without compromising performance. However, most existing model compression techniques assume unimodal models, limiting their effectiveness in multimodal settings.
We propose TAMP (Token-Adaptive Multimodal Pruning), an effective MLLM pruning pipleine that leverages multimodal token attributes to meausre layer importance for layer-wise sparsity (DAS) and computes adaptive input activations for capturing multimodal processing demands at each layer (AMIA). Our method utilizes multimodal token attributes to guide MLLM pruning.
Please follow the installation instructions from LLaVA-NeXT and VideoLLaMA2
Please download original the LLaVA-NeXT's visual instruction tuning dataset from LLaVA-NeXT-Data. Prepare this in TAMP/playground/LLaVA-NeXT-Data. Then, split the dataset by task names.
python llava/pruners/split_finetune_llava_next.py Similarly, download the VideoLLaMA2's audiovisual instruction tuning dataset (AVInstruct) through the AVinstruct. Place the downloaded files in TAMP/datasets. Then, preprocess the AVInstruct annotations files by transforming them into LLaVA-like files.
python datasets/transform_to_avinstruct.py --video_dir TAMP/datasets/path_to_video --dataset_path1 TAMP/datasets/avqa_data1.json --dataset_path2 TAMP/datasets/avqa_data2.json --save_path TAMP/datasets/avinstruct_avqa_music.jsonIn this paper, we focus on two models: llama3-llava-next-8b for vision-language model compression experiments and
VideoLLaMA2.1-7B-AV for audio-visual-language model compression experiments. Please download the models in /checkpoints directory.
bash scripts/prune/tamp.shEvaluate the pruned models with the official LLaVA-NeXT evaluation pipeline LLaVA-NeXT.
bash scripts_videollama2/prune/tamp.shEvaluate the pruned models with the official LLaVA-NeXT evaluation pipeline VideoLLaMA2.
@inproceedings{lee2024tamp,
title={TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models},
author={Jaewoo Lee and Keyang Xuan and Chanakya Ekbote and Sandeep Polisetty and Yi R. (May) Fung and Paul Pu Liang},
year={2025},
booktitle={Findings of the Association for Computational Linguistics (ACL)},
}