MAE-Lite (IJCV 2025)
News | Introduction | Getting Started | Main Results | Citation | Acknowledge
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training
Jin Gao, Shubo Lin, Shaoru Wang*, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
IJCV 2025
A Closer Look at Self-Supervised Lightweight Vision Transformers
Shaoru Wang, Jin Gao*, Zeming Li, Xiaoqin Zhang, Weiming Hu
ICML 2023
2024.12: Our extended version is accepted by IJCV 2025!2023.5: Code & models are released!2023.4: Our paper is accepted by ICML 2023!2022.5: Our initial version of the paper was published on Arxiv.
MAE-Lite focuses on exploring the pre-training of lightweight Vision Transformers (ViTs). This repo provide the code and models for the study in the paper.
- We provide advanced pre-training (based on MAE) and fine-tuning recipes for lightweight ViTs and demonstrate that even vanilla lightweight ViT (e.g., ViT-Tiny) beats most previous SOTA ConvNets and ViT derivatives with delicate network architecture design. We achieve 79.0% top-1 accuracy on ImageNet with vanilla ViT-Tiny (5.7M).
- We provide code for the transfer evaluation of pre-trained models on several classification tasks (e.g., Oxford 102 Flower, Oxford-IIIT Pet, FGVC Aircraft, CIFAR, etc.) and COCO detection tasks (based on ViTDet). We find that the self-supervised pre-trained ViTs work worse than the supervised pre-trained ones on data-insufficient downstream tasks.
- We provide code for the analysis tools used in the paper to examine the layer representations and attention distance & entropy for the ViTs.
- We provide code and models for our proposed knowledge distillation method for the pre-trained lightweight ViTs based on MAE, which shows superiority on the trasfer evaluation of data-insufficient classification tasks and dense prediction tasks.
update (2025.02.28)
- We provide benchmark for more masked image modeling (MIM) pre-training methods (BEiT, BootMAE, MaskFeat) on lightweight ViTs and evaluate their transferability to downstream tasks.
- We provide code and models for our decoupled distillation method during pre-training and transfer to more dense prediction tasks including detection, tracking and semantic segmentation, which enables SOTA performance on the ADE20K segmentation task (42.8% mIoU) and LaSOT tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
- We extend our distillation method to hierarchical ViTs (Swin and Hiera), which validate the generalizability and effectiveness of the distillation following our observation-analysis-solution flow.
Setup conda environment:
# Create environment
conda create -n mae-lite python=3.7 -y
conda activate mae-lite
# Instaill requirements
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch -y
# Clone MAE-Lite
git clone https://github.com/wangsr126/mae-lite.git
cd mae-lite
# Install other requirements
pip3 install -r requirements.txt
python3 setup.py build develop --userPrepare the ImageNet data in <BASE_FOLDER>/data/imagenet/imagenet_train, <BASE_FOLDER>/data/imagenet/imagenet_val.
To pre-train ViT-Tiny with our recommended MAE recipe:
# 4096 batch-sizes on 8 GPUs:
cd projects/mae_lite
ssl_train -b 4096 -d 0-7 -e 400 -f mae_lite_exp.py --amp \
--exp-options exp_name=mae_lite/mae_tiny_400ePlease download the pre-trained models, e.g.,
download MAE-Tiny to <BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar
To fine-tune with the improved recipe:
# 1024 batch-sizes on 8 GPUs:
cd projects/eval_tools
ssl_train -b 1024 -d 0-7 -e 300 -f finetuning_exp.py --amp \
[--ckpt <checkpoint-path>] --exp-options pretrain_exp_name=mae_lite/mae_tiny_400e<checkpoint-path>: if set to<BASE_FOLDER>/checkpoints/mae_tiny_400e.pth.tar, it will be loaded as initialization; If not set, the checkpoint at<BASE_FOLDER>/outputs/mae_lite/mae_tiny_400e/last_epoch_ckpt.pth.tarwill be loaded automatically.
download MAE-Tiny-FT to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_300e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_evalAnd you will get "Top1: 77.978" if all right.
download MAE-Tiny-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_evalAnd you will get "Top1: 79.002" if all right.
download MAE-Tiny-Distill-DΒ²-FT-RPE to <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar
# 1024 batch-sizes on 1 GPUs:
python mae_lite/tools/eval.py -b 1024 -d 0 -f projects/eval_tools/finetuning_rpe_exp.py \
--ckpt <BASE_FOLDER>/checkpoints/mae_tiny_distill_d2_400e_ft_rpe_1000e.pth.tar \
--exp-options pretrain_exp_name=mae_lite/mae_tiny_400e/ft_rpe_eval qv_bias=FalseAnd you will get "Top1: 79.444" if all right.
Please refer to DISTILL.md.
Please refer to TRANSFER.md.
Please refer to DETECTION.md.
Please refer to TRACKING.md.
Please refer to SEGMENTATION.md.
Please refer to MOCOV3.md.
Please refer to VISUAL.md.
| pre-train code | pre-train epochs |
fine-tune recipe | fine-tune epoch | accuracy | ckpt |
|---|---|---|---|---|---|
| - | - | impr. | 300 | 75.8 | link |
| mae_lite | 400 | - | - | - | link |
| impr. | 300 | 78.0 | link | ||
| impr.+RPE | 1000 | 79.0 | link | ||
| mae_lite_distill | 400 | - | - | - | link |
| impr. | 300 | 78.4 | link | ||
| mae_lite_d2_distill | 400 | - | - | - | link |
| impr. | 300 | 78.7 | link | ||
| impr.+RPE | 1000 | 79.4 | link |
Please cite the following paper if this repo helps your research:
@misc{wang2023closer,
title={A Closer Look at Self-Supervised Lightweight Vision Transformers},
author={Shaoru Wang and Jin Gao and Zeming Li and Xiaoqin Zhang and Weiming Hu},
journal={arXiv preprint arXiv:2205.14443},
year={2023},
}
@article{gao2025experimental,
title={An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training},
author={Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu},
journal={International Journal of Computer Vision},
year={2025},
doi={10.1007/s11263-024-02327-w},
publisher={Springer}
}We thank for the code implementation from timm, MAE, MoCo-v3.
This repo is released under the Apache 2.0 license. Please see the LICENSE file for more information.