Releases: MzeroMiko/VMamba
Releases · MzeroMiko/VMamba
VMamba v0 Segmentation checkpoints
Semantic Segmentation on ADE20K
| Backbone | Input | #params | FLOPs | Segmentor | mIoU(SS) | mIoU(MS) | configs/logs/logs(ms)/ckpts |
|---|---|---|---|---|---|---|---|
| Vanilla-VMamba-T | 512x512 | 55M | UperNet@160k | 47.3 | 48.3 | config/log/log(ms)/ckpt | |
| Vanilla-VMamba-S | 512x512 | 76M | UperNet@160k | 49.5 | 50.5 | config/log/log(ms)/ckpt | |
| Vanilla-VMamba-B | 512x512 | 110M | UperNet@160k | 50.0 | 51.3 | config/log/log(ms)/ckpt |
VMamba v0 Detection checkpoints
Object Detection on COCO
| Backbone | #params | FLOPs | Detector | bboxAP | bboxAP50 | bboxAP75 | segmAP | segmAP50 | segmAP75 | configs/logs/ckpts |
|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla-VMamba-T | 42M | MaskRCNN@1x | 46.5 | 68.5 | 50.7 | 42.1 | 65.5 | 45.3 | config/log/ckpt | |
| Vanilla-VMamba-S | 64M | MaskRCNN@1x | 48.2 | 69.7 | 52.5 | 43.0 | 66.6 | 46.4 | config/log/ckpt | |
| Vanilla-VMamba-B | 96M | MaskRCNN@1x | 48.6 | 70.0 | 53.1 | 43.3 | 67.1 | 46.7 | config/log/ckpt | |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Vanilla-VMamba-T | 42M | MaskRCNN@3x | 48.5 | 70.0 | 52.7 | 43.2 | 66.9 | 46.4 | config/log/ckpt | |
| Vanilla-VMamba-S | 64M | MaskRCNN@3x | 49.7 | 70.4 | 54.2 | 44.0 | 67.6 | 47.3 | config/log/ckpt |
VMamba v0 Classification checkpoints
Checkpoints for VMamba (alias of vssm version 0)
These checkpoints correspond to the experiments done before date #20240119.
| name | pretrain | resolution | acc@1 | #params | FLOPs | best epoch | use ema | config |
|---|---|---|---|---|---|---|---|---|
| VMamba-T | ImageNet-1K | 224x224 | 82.2 | 22M | 292 | did'nt add | config | |
| VMamba-S | ImageNet-1K | 224x224 | 83.5 | 44M | 238 | true | config | |
| VMamba-B | ImageNet-1K | 224x224 | 83.2 | 75M | 260 | did'nt add | config | |
| VMamba-B* | ImageNet-1K | 224x224 | 83.7 | 75M | 241 | true | config |
Most backbone models trained without ema, which do not enhance performance \cite(Swin-Transformer). We use ema because our model is still under development, without hyperparameter tuning.
The checkpoints used in object detection and segmentation is VMamba-B with droppath 0.5 + no ema. VMamba-B* represents for VMamba-B with droppath 0.6 + ema, the performance of which is non-ema: 83.3 in epoch 262; ema: 83.7 in epoch 241
VMamba v2 Segmentation checkpoints
Semantic Segmentation on ADE20K
| Backbone | Input | #params | FLOPs | Segmentor | mIoU(SS) | mIoU(MS) | configs/logs/logs(ms)/ckpts |
|---|---|---|---|---|---|---|---|
VMamba-T[s2l5] |
512x512 | 62M | 948G | UperNet@160k | 48.3 | 48.6 | config/log/log(ms)/ckpt |
VMamba-S[s2l15] |
512x512 | 82M | 1028G | UperNet@160k | 50.6 | 51.2 | config/log/log(ms)/ckpt |
VMamba-B[s2l15] |
512x512 | 122M | 1170G | UperNet@160k | 51.0 | 51.6 | config/log/log(ms)/ckpt |
VMamba-T[s1l8] |
512x512 | 62M | 949G | UperNet@160k | 47.9 | 48.8 | config/log/log(ms)/ckpt |
VMamba v2 Detection checkpoints
Object Detection on COCO
| Backbone | #params | FLOPs | Detector | bboxAP | bboxAP50 | bboxAP75 | segmAP | segmAP50 | segmAP75 | configs/logs/ckpts |
|---|---|---|---|---|---|---|---|---|---|---|
VMamba-T[s2l5] |
50M | 270G | MaskRCNN@1x | 47.4 | 69.5 | 52.0 | 42.7 | 66.3 | 46.0 | config/log/ckpt |
VMamba-S[s2l15] |
70M | 384G | MaskRCNN@1x | 48.7 | 70.0 | 53.4 | 43.7 | 67.3 | 47.0 | config/log/ckpt |
VMamba-B[s2l15] |
108M | 485G | MaskRCNN@1x | 49.2 | 71.4 | 54.0 | 44.1 | 68.3 | 47.7 | config/log/ckpt |
VMamba-B[s2l15] |
108M | 485G | MaskRCNN@1x[bs8] |
49.2 | 70.9 | 53.9 | 43.9 | 67.7 | 47.6 | config/log/ckpt |
VMamba-T[s1l8] |
50M | 271G | MaskRCNN@1x | 47.3 | 69.3 | 52.0 | 42.7 | 66.4 | 45.9 | config/log/ckpt |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
VMamba-T[s2l5] |
50M | 270G | MaskRCNN@3x | 48.9 | 70.6 | 53.6 | 43.7 | 67.7 | 46.8 | config/log/ckpt |
VMamba-S[s2l15] |
70M | 384G | MaskRCNN@3x | 49.9 | 70.9 | 54.7 | 44.20 | 68.2 | 47.7 | config/log/ckpt |
VMamba-T[s1l8] |
50M | 271G | MaskRCNN@3x | 48.8 | 70.4 | 53.50 | 43.7 | 67.4 | 47.0 | config/log/ckpt |
- Models in this subsection is initialized from the models trained in
classfication. - we now calculate FLOPs with the algrithm @ albertgu provides, which will be bigger than previous calculation (which is based on the
selective_scan_reffunction, and ignores the hardware-aware algrithm).
VMamba v2 Classification checkpoints
Classification on ImageNet-1K
| name | pretrain | resolution | acc@1 | #params | FLOPs | TP. | Train TP. | configs/logs/ckpts |
|---|---|---|---|---|---|---|---|---|
VMamba-T[s2l5] |
ImageNet-1K | 224x224 | 82.5 | 31M | 4.9G | 1340 | 464 | config/log/ckpt |
VMamba-S[s2l15] |
ImageNet-1K | 224x224 | 83.6 | 50M | 8.7G | 877 | 314 | config/log/ckpt |
VMamba-B[s2l15] |
ImageNet-1K | 224x224 | 83.9 | 89M | 15.4G | 646 | 247 | config/log/ckpt |
VMamba-T[s1l8] |
ImageNet-1K | 224x224 | 82.6 | 30M | 4.9G | 1686 | 571 | config/log/ckpt |
VMamba-S[s1l20] |
ImageNet-1K | 224x224 | 83.3 | 49M | 8.6G | 1106 | 390 | config/log/ckpt |
VMamba-B[s1l20] |
ImageNet-1K | 224x224 | 83.8 | 87M | 15.2G | 827 | 313 | config/log/ckpt |
- Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for
drop_path_rateandEMA. All models are trained with EMA except for theVanilla-VMamba-T. TP.(Throughput)andTrain TP. (Train Throughput)are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128.Train TP.is tested with mix-resolution, excluding the time consumption of optimizers.FLOPsandparametersare now gathered withhead(In previous versions, without head, so the numbers raise a little bit).- we calculate
FLOPswith the algorithm @ albertgu provides, which will be bigger than previous calculation (which is based on theselective_scan_reffunction, and ignores the hardware-aware algorithm).
Checkpoints for nightly builds!
| name | pretrain | resolution | acc@1 | #params | FLOPs | best epoch | use ema | config |
|---|---|---|---|---|---|---|---|---|
| VMamba-T | ImageNet-1K | 224x224 | 82.5 | 32M | 5G | 258 | true | config |
We use ema because our model is still under development, without hyperparameter tuning.
This is a pre-release