Zichen Wen1,2 β
Shaobo Wang1 β
Yufa Zhou3 β
Junyuan Zhang4
Qintong Zhang5 β
Yifeng Gao1 β
Zhaorun Chen6 β
Bin Wang2 β
Weijia Li7,2
Conghui He2* β
Linfeng Zhang1*
1EPIC Lab, Shanghai Jiao Tong University β
2Shanghai AI Laboratory
3Duke University β
4The University of Hong Kong
5Peking University β
6University of Chicago β
7Sun Yat-sen University
*Corresponding authors
2025.10.13
π We have released our code for EPIC!2025.10.01
π€π€ We release our latest work EPIC, an efficient framework for progressive consistency distillation in multi-modal large language models.
In this work, we propose to develop Efficient MLLMs via ProgressIve Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory.
- Progressive Learning: Token compression ratio increases progressively over time, or compression layers progressively shift from deeper to shallower layers
- Self-Consistency Distillation: Teacher and student share weights with no extra model introduced
- Multiple Token Compression Strategies: Supports DART, FastV, and Random token compression
# Clone this repository and navigate to EPIC folder
git clone https://github.com/zichenwen1/EPIC
cd EPIC
# Create conda environment
conda create -n EPIC python=3.10 -y
conda activate EPIC
# Install torch and flash attention
pip install torch torchvision torchaudio
pip install flash_attn --no-build-isolation # This may depend on your versions of torch, python, and cuda
# Install dependencies
pip install -r requirements.txt
# Key dependencies include:
# - transformers
# - deepspeed
# - torch (with CUDA support)
# - peft
# - tensorboard
Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
βββ coco
β βββ train2017
βββ gqa
β βββ images
βββ ocr_vqa
β βββ images
βββ textvqa
β βββ train_images
βββ vg
βββ VG_100K
βββ VG_100K_2
Edit scripts/v1_5/finetune_TCD.sh
to set your parameters:
# Model configuration
MODEL_NAME_OR_PATH="/path/to/your/base/model"
VISION_TOWER="/path/to/vision/tower"
MM_PROJECTOR="/path/to/mm_projector"
# Training configuration
BATCH_SIZE=4
LEARNING_RATE=2e-5
MM_VISION_TOWER_LR=2e-6
NUM_EPOCHS=3
# Compression configuration
PRUNING_METHOD="dart" # Options: dart, fastv, random
PRUNED_LAYER=2
SECOND_PRUNED_LAYER=15
REDUCTION_RATIO=0.5
bash scripts/v1_5/finetune_TCD.sh dart
bash scripts/v1_5/finetune_TCD.sh fastv
bash scripts/v1_5/finetune_TCD.sh random
EPIC_dev/
βββ assets/ # Project assets and images
β βββ motivation.jpg
β βββ overview.jpg
βββ checkpoints/ # Model checkpoints and training logs
βββ docs/ # Documentation
β βββ Customize_Component.md
β βββ Data.md
β βββ Evaluation.md
β βββ Finetune_Custom_Data.md
β βββ Intel.md
β βββ LLaVA_Bench.md
β βββ LLaVA_from_LLaMA2.md
β βββ LoRA.md
β βββ macOS.md
β βββ MODEL_ZOO.md
β βββ ScienceQA.md
β βββ Windows.md
βββ llava/
β βββ model/
β β βββ pruning_methods/ # Token compression methods
β β β βββ config.py # Configuration management
β β β βββ factory.py # Model factory
β β β βββ models.py # Compression model implementations
β β β βββ trainer_factory.py # Trainer factory
β β βββ language_model/ # Language model implementations
β β β βββ dart/ # DART compression
β β β βββ fastv/ # FastV compression
β β β βββ random/ # Random compression
β β β βββ llava_llama.py # LLaVA LLaMA implementation
β β β βββ llava_mistral.py # LLaVA Mistral implementation
β β β βββ llava_mpt.py # LLaVA MPT implementation
β β βββ multimodal_encoder/ # Multimodal encoder components
β β βββ multimodal_projector/ # Multimodal projector components
β β βββ llava_arch.py # LLaVA architecture
β βββ train/
β β βββ train_mem_KD_TCD.py # Memory-efficient KD training
β β βββ train_TCD.py # TCD training script
β β βββ llava_trainer_KD_from_pretrain_*.py # Specialized KD trainers
β β βββ llava_trainer_KD.py # Knowledge distillation trainer
β β βββ llava_trainer.py # Base trainer
β βββ eval/ # Evaluation scripts
β β βββ eval_*.py # Various evaluation scripts
β β βββ webpage/ # Web-based evaluation interface
β βββ serve/ # Model serving components
β βββ cli.py
β βββ controller.py
β βββ gradio_web_server.py
β βββ model_worker.py
β βββ sglang_worker.py
βββ scripts/
β βββ v1_5/
β β βββ eval/ # Evaluation scripts
β β β βββ gqa.sh
β β β βββ mmbench.sh
β β β βββ pope.sh
β β β βββ ...
β β βββ finetune_TCD.sh # TCD fine-tuning script
β β βββ multi_node_train.sh # Multi-node training
β βββ finetune*.sh # Various fine-tuning scripts
β βββ pretrain*.sh # Pre-training scripts
β βββ zero3.json # DeepSpeed configuration
βββ PRUNING_METHODS_REFACTOR.md # Pruning methods documentation
βββ pyproject.toml # Python project configuration
βββ requirements.txt # Dependencies
βββ README.md # This file
- Model Factory:
llava/model/pruning_methods/factory.py
- Creates models based on compression method - Trainer Factory:
llava/model/pruning_methods/trainer_factory.py
- Creates specialized trainers - Configuration:
llava/model/pruning_methods/config.py
- Manages compression parameters - TCD Training:
llava/train/train_mem_KD_TCD.py
- Main training entry point
Our analysis shows that:
- High ROI: Reducing tokens from 576 to 64 preserves most performance with significant efficiency gains
- Low ROI: Further compression below 64 tokens yields diminishing returns in speed and sharp accuracy drops
EPIC enables strong generalization across different token compression strategies. Models trained with one method (e.g., DART) perform well with other methods (FastV, Random) at inference
- Release checkpoints
- Release LCD implementation
- Release evaluation code
If you find this work useful, please cite our paper:
@article{wen2025efficient,
title={Efficient Multi-modal Large Language Models via Progressive Consistency Distillation},
author={Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen, Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and others},
journal={arXiv preprint arXiv:2510.00515},
year={2025}
}
We welcome contributions! Please feel free to submit issues and pull requests.
This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.
- LLaVA for the base multimodal framework
- DeepSpeed for efficient training
- Transformers for model implementations
- Email: [email protected]
- Project Page: https://zichenwen1.github.io/EPIC/
- Paper: arXiv:2510.00515