Official PyTorch implementation of MViTv2, from the following paper:
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. CVPR 2022.
Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*
MViT is a multiscale transformer which serves as a general vision backbone for different visual recognition tasks:
Image Classification: Included in this repo.
Object Detection and Instance Segmentation: See MViTv2 in Detectron2.
Video Action Recognition and Detection: See MViTv2 in PySlowFast.
| name | resolution | acc@1 | #params | FLOPs | 1k model | 
|---|---|---|---|---|---|
| MViTv2-T | 224x224 | 82.3 | 24M | 4.7G | model | 
| MViTv2-S | 224x224 | 83.6 | 35M | 7.0G | model | 
| MViTv2-B | 224x224 | 84.4 | 52M | 10.2G | model | 
| MViTv2-L | 224x224 | 85.3 | 218M | 42.1G | model | 
| name | resolution | acc@1 | #params | FLOPs | 21k model | 1k model | 
|---|---|---|---|---|---|---|
| MViTv2-B | 224x224 | - | 52M | 10.2G | model | - | 
| MViTv2-L | 224x224 | 87.5 | 218M | 42.1G | model | - | 
| MViTv2-H | 224x224 | 88.0 | 667M | 120.6G | model | - | 
Please check INSTALL.md for installation instructions.
Here we can train a standard MViTv2 model from scratch by:
python tools/main.py \
  --cfg configs/MViTv2_T.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 256 \
To evaluate a pretrained MViT model:
python tools/main.py \
  --cfg configs/test/MViTv2_T_test.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TEST.BATCH_SIZE 256 \
This repository is built based on the PySlowFast.
MViT is released under the Apache 2.0 license.
If you find this repository helpful, please consider citing:
@inproceedings{li2021improved,
  title={MViTv2: Improved multiscale vision transformers for classification and detection},
  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={CVPR},
  year={2022}
}
@inproceedings{fan2021multiscale,
  title={Multiscale vision transformers},
  author={Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={ICCV},
  year={2021}
}