Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Zichen Wen^1,2 Shaobo Wang¹ Yufa Zhou³ Junyuan Zhang⁴
Qintong Zhang⁵ Yifeng Gao¹ Zhaorun Chen⁶ Bin Wang² Weijia Li^7,2
Conghui He²^* Linfeng Zhang¹^*
¹EPIC Lab, Shanghai Jiao Tong University ²Shanghai AI Laboratory
³Duke University ⁴The University of Hong Kong
⁵Peking University ⁶University of Chicago ⁷Sun Yat-sen University
_{^*Corresponding authors}

📰 News

2025.10.13 🎉 We have released our code for EPIC!
2025.10.01 🤗🤗 We release our latest work EPIC, an efficient framework for progressive consistency distillation in multi-modal large language models.

📖 Overview

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression.

In this work, we propose to develop Efficient MLLMs via ProgressIve Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory.

🎯 Key Features

Progressive Learning: Token compression ratio increases progressively over time, or compression layers progressively shift from deeper to shallower layers
Self-Consistency Distillation: Teacher and student share weights with no extra model introduced
Multiple Token Compression Strategies: Supports DART, FastV, and Random token compression

🚀 Quick Start

Prerequisites

# Clone this repository and navigate to EPIC folder
git clone https://github.com/zichenwen1/EPIC
cd EPIC

# Create conda environment
conda create -n EPIC python=3.10 -y
conda activate EPIC

# Install torch and flash attention
pip install torch torchvision torchaudio
pip install flash_attn --no-build-isolation # This may depend on your versions of torch, python, and cuda

# Install dependencies
pip install -r requirements.txt

# Key dependencies include:
# - transformers
# - deepspeed
# - torch (with CUDA support)
# - peft
# - tensorboard

Training

1. Prepare Data

Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in ./playground/data,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

2. Configure Training

Edit scripts/v1_5/finetune_TCD.sh to set your parameters:

# Model configuration
MODEL_NAME_OR_PATH="/path/to/your/base/model"
VISION_TOWER="/path/to/vision/tower"
MM_PROJECTOR="/path/to/mm_projector"

# Training configuration
BATCH_SIZE=4
LEARNING_RATE=2e-5
MM_VISION_TOWER_LR=2e-6
NUM_EPOCHS=3

# Compression configuration
PRUNING_METHOD="dart"  # Options: dart, fastv, random
PRUNED_LAYER=2
SECOND_PRUNED_LAYER=15
REDUCTION_RATIO=0.5

3. Run Training

bash scripts/v1_5/finetune_TCD.sh dart
bash scripts/v1_5/finetune_TCD.sh fastv  
bash scripts/v1_5/finetune_TCD.sh random

🏗️ Project Structure

EPIC_dev/
├── assets/                         # Project assets and images
│   ├── motivation.jpg
│   └── overview.jpg
├── checkpoints/                    # Model checkpoints and training logs
├── docs/                          # Documentation
│   ├── Customize_Component.md
│   ├── Data.md
│   ├── Evaluation.md
│   ├── Finetune_Custom_Data.md
│   ├── Intel.md
│   ├── LLaVA_Bench.md
│   ├── LLaVA_from_LLaMA2.md
│   ├── LoRA.md
│   ├── macOS.md
│   ├── MODEL_ZOO.md
│   ├── ScienceQA.md
│   └── Windows.md
├── llava/
│   ├── model/
│   │   ├── pruning_methods/        # Token compression methods
│   │   │   ├── config.py          # Configuration management
│   │   │   ├── factory.py         # Model factory
│   │   │   ├── models.py          # Compression model implementations
│   │   │   └── trainer_factory.py # Trainer factory
│   │   ├── language_model/        # Language model implementations
│   │   │   ├── dart/             # DART compression
│   │   │   ├── fastv/            # FastV compression
│   │   │   ├── random/           # Random compression
│   │   │   ├── llava_llama.py    # LLaVA LLaMA implementation
│   │   │   ├── llava_mistral.py  # LLaVA Mistral implementation
│   │   │   └── llava_mpt.py      # LLaVA MPT implementation
│   │   ├── multimodal_encoder/   # Multimodal encoder components
│   │   ├── multimodal_projector/ # Multimodal projector components
│   │   └── llava_arch.py         # LLaVA architecture
│   ├── train/
│   │   ├── train_mem_KD_TCD.py   # Memory-efficient KD training
│   │   ├── train_TCD.py          # TCD training script
│   │   ├── llava_trainer_KD_from_pretrain_*.py  # Specialized KD trainers
│   │   ├── llava_trainer_KD.py   # Knowledge distillation trainer
│   │   └── llava_trainer.py      # Base trainer
│   ├── eval/                     # Evaluation scripts
│   │   ├── eval_*.py            # Various evaluation scripts
│   │   └── webpage/             # Web-based evaluation interface
│   └── serve/                   # Model serving components
│       ├── cli.py
│       ├── controller.py
│       ├── gradio_web_server.py
│       ├── model_worker.py
│       └── sglang_worker.py
├── scripts/
│   ├── v1_5/
│   │   ├── eval/                # Evaluation scripts
│   │   │   ├── gqa.sh
│   │   │   ├── mmbench.sh
│   │   │   ├── pope.sh
│   │   │   └── ...
│   │   ├── finetune_TCD.sh      # TCD fine-tuning script
│   │   └── multi_node_train.sh  # Multi-node training
│   ├── finetune*.sh             # Various fine-tuning scripts
│   ├── pretrain*.sh             # Pre-training scripts
│   └── zero3.json               # DeepSpeed configuration
├── PRUNING_METHODS_REFACTOR.md  # Pruning methods documentation
├── pyproject.toml               # Python project configuration
├── requirements.txt             # Dependencies
└── README.md                    # This file

Implementation Details

Model Factory: llava/model/pruning_methods/factory.py - Creates models based on compression method
Trainer Factory: llava/model/pruning_methods/trainer_factory.py - Creates specialized trainers
Configuration: llava/model/pruning_methods/config.py - Manages compression parameters
TCD Training: llava/train/train_mem_KD_TCD.py - Main training entry point

📊 Results

Performance on Visual Understanding Benchmarks

Inference Efficiency

🔬 Analysis

High ROI vs Low ROI Areas

Our analysis shows that:

High ROI: Reducing tokens from 576 to 64 preserves most performance with significant efficiency gains
Low ROI: Further compression below 64 tokens yields diminishing returns in speed and sharp accuracy drops

Generalization Across Methods

EPIC enables strong generalization across different token compression strategies. Models trained with one method (e.g., DART) perform well with other methods (FastV, Random) at inference

📌 TODO

Release checkpoints
Release LCD implementation
Release evaluation code

📚 Citation

If you find this work useful, please cite our paper:

@article{wen2025efficient,
    title={Efficient Multi-modal Large Language Models via Progressive Consistency Distillation},
    author={Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen, Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and others},
    journal={arXiv preprint arXiv:2510.00515},
    year={2025}
}

🤝 Contributing

We welcome contributions! Please feel free to submit issues and pull requests.

📄 License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.

🙏 Acknowledgments

LLaVA for the base multimodal framework
DeepSpeed for efficient training
Transformers for model implementations

📞 Contact

Email: [email protected]
Project Page: https://zichenwen1.github.io/EPIC/
Paper: arXiv:2510.00515

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

📰 News

📖 Overview

🎯 Key Features

🚀 Quick Start

Prerequisites

Training

1. Prepare Data

2. Configure Training

3. Run Training

🏗️ Project Structure

Implementation Details

📊 Results

Performance on Visual Understanding Benchmarks

Inference Efficiency

🔬 Analysis

High ROI vs Low ROI Areas

Generalization Across Methods

📌 TODO

📚 Citation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
docs		docs
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

ZichenWen1/EPIC

Folders and files

Latest commit

History

Repository files navigation

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

📰 News

📖 Overview

🎯 Key Features

🚀 Quick Start

Prerequisites

Training

1. Prepare Data

2. Configure Training

3. Run Training

🏗️ Project Structure

Implementation Details

📊 Results

Performance on Visual Understanding Benchmarks

Inference Efficiency

🔬 Analysis

High ROI vs Low ROI Areas

Generalization Across Methods

📌 TODO

📚 Citation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages