Thanks to visit codestin.com
Credit goes to github.com

Skip to content

(NeurIPS 2025 πŸ”₯) Official implementation for "Efficient Multi-modal Large Language Models via Progressive Consistency Distillation"

License

Notifications You must be signed in to change notification settings

ZichenWen1/EPIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Paper Project Page Github License

Zichen Wen1,2   Shaobo Wang1   Yufa Zhou3   Junyuan Zhang4
Qintong Zhang5   Yifeng Gao1   Zhaorun Chen6   Bin Wang2   Weijia Li7,2
Conghui He2*   Linfeng Zhang1*

1EPIC Lab, Shanghai Jiao Tong University   2Shanghai AI Laboratory
3Duke University   4The University of Hong Kong
5Peking University   6University of Chicago   7Sun Yat-sen University
*Corresponding authors

πŸ“° News

  • 2025.10.13 πŸŽ‰ We have released our code for EPIC!
  • 2025.10.01 πŸ€—πŸ€— We release our latest work EPIC, an efficient framework for progressive consistency distillation in multi-modal large language models.

πŸ“– Overview

EPIC Motivation Figure

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression.

EPIC Overview Figure

In this work, we propose to develop Efficient MLLMs via ProgressIve Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory.

🎯 Key Features

  • Progressive Learning: Token compression ratio increases progressively over time, or compression layers progressively shift from deeper to shallower layers
  • Self-Consistency Distillation: Teacher and student share weights with no extra model introduced
  • Multiple Token Compression Strategies: Supports DART, FastV, and Random token compression

πŸš€ Quick Start

Prerequisites

# Clone this repository and navigate to EPIC folder
git clone https://github.com/zichenwen1/EPIC
cd EPIC
# Create conda environment
conda create -n EPIC python=3.10 -y
conda activate EPIC

# Install torch and flash attention
pip install torch torchvision torchaudio
pip install flash_attn --no-build-isolation # This may depend on your versions of torch, python, and cuda

# Install dependencies
pip install -r requirements.txt

# Key dependencies include:
# - transformers
# - deepspeed
# - torch (with CUDA support)
# - peft
# - tensorboard

Training

1. Prepare Data

Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./playground/data,

β”œβ”€β”€ coco
β”‚   └── train2017
β”œβ”€β”€ gqa
β”‚   └── images
β”œβ”€β”€ ocr_vqa
β”‚   └── images
β”œβ”€β”€ textvqa
β”‚   └── train_images
└── vg
    β”œβ”€β”€ VG_100K
    └── VG_100K_2

2. Configure Training

Edit scripts/v1_5/finetune_TCD.sh to set your parameters:

# Model configuration
MODEL_NAME_OR_PATH="/path/to/your/base/model"
VISION_TOWER="/path/to/vision/tower"
MM_PROJECTOR="/path/to/mm_projector"

# Training configuration
BATCH_SIZE=4
LEARNING_RATE=2e-5
MM_VISION_TOWER_LR=2e-6
NUM_EPOCHS=3

# Compression configuration
PRUNING_METHOD="dart"  # Options: dart, fastv, random
PRUNED_LAYER=2
SECOND_PRUNED_LAYER=15
REDUCTION_RATIO=0.5

3. Run Training

bash scripts/v1_5/finetune_TCD.sh dart
bash scripts/v1_5/finetune_TCD.sh fastv  
bash scripts/v1_5/finetune_TCD.sh random

πŸ—οΈ Project Structure

EPIC_dev/
β”œβ”€β”€ assets/                         # Project assets and images
β”‚   β”œβ”€β”€ motivation.jpg
β”‚   └── overview.jpg
β”œβ”€β”€ checkpoints/                    # Model checkpoints and training logs
β”œβ”€β”€ docs/                          # Documentation
β”‚   β”œβ”€β”€ Customize_Component.md
β”‚   β”œβ”€β”€ Data.md
β”‚   β”œβ”€β”€ Evaluation.md
β”‚   β”œβ”€β”€ Finetune_Custom_Data.md
β”‚   β”œβ”€β”€ Intel.md
β”‚   β”œβ”€β”€ LLaVA_Bench.md
β”‚   β”œβ”€β”€ LLaVA_from_LLaMA2.md
β”‚   β”œβ”€β”€ LoRA.md
β”‚   β”œβ”€β”€ macOS.md
β”‚   β”œβ”€β”€ MODEL_ZOO.md
β”‚   β”œβ”€β”€ ScienceQA.md
β”‚   └── Windows.md
β”œβ”€β”€ llava/
β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”œβ”€β”€ pruning_methods/        # Token compression methods
β”‚   β”‚   β”‚   β”œβ”€β”€ config.py          # Configuration management
β”‚   β”‚   β”‚   β”œβ”€β”€ factory.py         # Model factory
β”‚   β”‚   β”‚   β”œβ”€β”€ models.py          # Compression model implementations
β”‚   β”‚   β”‚   └── trainer_factory.py # Trainer factory
β”‚   β”‚   β”œβ”€β”€ language_model/        # Language model implementations
β”‚   β”‚   β”‚   β”œβ”€β”€ dart/             # DART compression
β”‚   β”‚   β”‚   β”œβ”€β”€ fastv/            # FastV compression
β”‚   β”‚   β”‚   β”œβ”€β”€ random/           # Random compression
β”‚   β”‚   β”‚   β”œβ”€β”€ llava_llama.py    # LLaVA LLaMA implementation
β”‚   β”‚   β”‚   β”œβ”€β”€ llava_mistral.py  # LLaVA Mistral implementation
β”‚   β”‚   β”‚   └── llava_mpt.py      # LLaVA MPT implementation
β”‚   β”‚   β”œβ”€β”€ multimodal_encoder/   # Multimodal encoder components
β”‚   β”‚   β”œβ”€β”€ multimodal_projector/ # Multimodal projector components
β”‚   β”‚   └── llava_arch.py         # LLaVA architecture
β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ train_mem_KD_TCD.py   # Memory-efficient KD training
β”‚   β”‚   β”œβ”€β”€ train_TCD.py          # TCD training script
β”‚   β”‚   β”œβ”€β”€ llava_trainer_KD_from_pretrain_*.py  # Specialized KD trainers
β”‚   β”‚   β”œβ”€β”€ llava_trainer_KD.py   # Knowledge distillation trainer
β”‚   β”‚   └── llava_trainer.py      # Base trainer
β”‚   β”œβ”€β”€ eval/                     # Evaluation scripts
β”‚   β”‚   β”œβ”€β”€ eval_*.py            # Various evaluation scripts
β”‚   β”‚   └── webpage/             # Web-based evaluation interface
β”‚   └── serve/                   # Model serving components
β”‚       β”œβ”€β”€ cli.py
β”‚       β”œβ”€β”€ controller.py
β”‚       β”œβ”€β”€ gradio_web_server.py
β”‚       β”œβ”€β”€ model_worker.py
β”‚       └── sglang_worker.py
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ v1_5/
β”‚   β”‚   β”œβ”€β”€ eval/                # Evaluation scripts
β”‚   β”‚   β”‚   β”œβ”€β”€ gqa.sh
β”‚   β”‚   β”‚   β”œβ”€β”€ mmbench.sh
β”‚   β”‚   β”‚   β”œβ”€β”€ pope.sh
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ finetune_TCD.sh      # TCD fine-tuning script
β”‚   β”‚   └── multi_node_train.sh  # Multi-node training
β”‚   β”œβ”€β”€ finetune*.sh             # Various fine-tuning scripts
β”‚   β”œβ”€β”€ pretrain*.sh             # Pre-training scripts
β”‚   └── zero3.json               # DeepSpeed configuration
β”œβ”€β”€ PRUNING_METHODS_REFACTOR.md  # Pruning methods documentation
β”œβ”€β”€ pyproject.toml               # Python project configuration
β”œβ”€β”€ requirements.txt             # Dependencies
└── README.md                    # This file

Implementation Details

  • Model Factory: llava/model/pruning_methods/factory.py - Creates models based on compression method
  • Trainer Factory: llava/model/pruning_methods/trainer_factory.py - Creates specialized trainers
  • Configuration: llava/model/pruning_methods/config.py - Manages compression parameters
  • TCD Training: llava/train/train_mem_KD_TCD.py - Main training entry point

πŸ“Š Results

Performance on Visual Understanding Benchmarks

Results Figure

Inference Efficiency

inference_efficiency Figure

πŸ”¬ Analysis

High ROI vs Low ROI Areas

Our analysis shows that:

  • High ROI: Reducing tokens from 576 to 64 preserves most performance with significant efficiency gains
  • Low ROI: Further compression below 64 tokens yields diminishing returns in speed and sharp accuracy drops

analysis Figure

Generalization Across Methods

EPIC enables strong generalization across different token compression strategies. Models trained with one method (e.g., DART) perform well with other methods (FastV, Random) at inference

generalization Figure

πŸ“Œ TODO

  • Release checkpoints
  • Release LCD implementation
  • Release evaluation code

πŸ“š Citation

If you find this work useful, please cite our paper:

@article{wen2025efficient,
    title={Efficient Multi-modal Large Language Models via Progressive Consistency Distillation},
    author={Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen, Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and others},
    journal={arXiv preprint arXiv:2510.00515},
    year={2025}
}

🀝 Contributing

We welcome contributions! Please feel free to submit issues and pull requests.

πŸ“„ License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“ž Contact

About

(NeurIPS 2025 πŸ”₯) Official implementation for "Efficient Multi-modal Large Language Models via Progressive Consistency Distillation"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published