Thanks to visit codestin.com
Credit goes to github.com

Skip to content

G-JWLee/TAMP

Repository files navigation

TAMP

TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models, ACL 2025 Findings

Introduction

Multimodal Large Language Models (MLLMs) have grown in size to address the complexities of multimodal tasks. While beneficial for performance, their colossal model size imposes substantial computational and memory resources, limiting their practicality in resource-constrained scenarios. Post-training model pruning effectively reduces model size by removing a massive number of parameters without compromising performance. However, most existing model compression techniques assume unimodal models, limiting their effectiveness in multimodal settings.

Teaser

We propose TAMP (Token-Adaptive Multimodal Pruning), an effective MLLM pruning pipleine that leverages multimodal token attributes to meausre layer importance for layer-wise sparsity (DAS) and computes adaptive input activations for capturing multimodal processing demands at each layer (AMIA). Our method utilizes multimodal token attributes to guide MLLM pruning.

Install

Please follow the installation instructions from LLaVA-NeXT and VideoLLaMA2

Calibration Dataset

LLaVA-NeXT

Please download original the LLaVA-NeXT's visual instruction tuning dataset from LLaVA-NeXT-Data. Prepare this in TAMP/playground/LLaVA-NeXT-Data. Then, split the dataset by task names.

python llava/pruners/split_finetune_llava_next.py 

VideoLLaMA2

Similarly, download the VideoLLaMA2's audiovisual instruction tuning dataset (AVInstruct) through the AVinstruct. Place the downloaded files in TAMP/datasets. Then, preprocess the AVInstruct annotations files by transforming them into LLaVA-like files.

python datasets/transform_to_avinstruct.py --video_dir TAMP/datasets/path_to_video --dataset_path1 TAMP/datasets/avqa_data1.json --dataset_path2 TAMP/datasets/avqa_data2.json --save_path TAMP/datasets/avinstruct_avqa_music.json

Models

In this paper, we focus on two models: llama3-llava-next-8b for vision-language model compression experiments and VideoLLaMA2.1-7B-AV for audio-visual-language model compression experiments. Please download the models in /checkpoints directory.

LLaVA-1.5 experiment

Pruning

bash scripts/prune/tamp.sh

Evaluation

Evaluate the pruned models with the official LLaVA-NeXT evaluation pipeline LLaVA-NeXT.

VideoLLaMA2 experiment

Pruning

bash scripts_videollama2/prune/tamp.sh

Evaluation

Evaluate the pruned models with the official LLaVA-NeXT evaluation pipeline VideoLLaMA2.

Bibtex

@inproceedings{lee2024tamp,
      title={TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models},
      author={Jaewoo Lee and Keyang Xuan and Chanakya Ekbote and Sandeep Polisetty and Yi R. (May) Fung and Paul Pu Liang},
      year={2025},
      booktitle={Findings of the Association for Computational Linguistics (ACL)},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published