Codestin Search App

Semantic Segmentation on ADE20K

Backbone	Input	#params	FLOPs	Segmentor	mIoU(SS)	mIoU(MS)	configs/logs/logs(ms)/ckpts
Vanilla-VMamba-T	512x512	55M	~~939G~~ 964G	UperNet@160k	47.3	48.3	config/log/log(ms)/ckpt
Vanilla-VMamba-S	512x512	76M	~~1037G~~ 1081G	UperNet@160k	49.5	50.5	config/log/log(ms)/ckpt
Vanilla-VMamba-B	512x512	110M	~~1167G~~ 1226G	UperNet@160k	50.0	51.3	config/log/log(ms)/ckpt

Object Detection on COCO

Backbone	#params	FLOPs	Detector	bboxAP	bboxAP50	bboxAP75	segmAP	segmAP50	segmAP75	configs/logs/ckpts
Vanilla-VMamba-T	42M	~~262G~~ 286G	MaskRCNN@1x	46.5	68.5	50.7	42.1	65.5	45.3	config/log/ckpt
Vanilla-VMamba-S	64M	~~357G~~ 400G	MaskRCNN@1x	48.2	69.7	52.5	43.0	66.6	46.4	config/log/ckpt
Vanilla-VMamba-B	96M	~~482G~~ 540G	MaskRCNN@1x	48.6	70.0	53.1	43.3	67.1	46.7	config/log/ckpt
:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:
Vanilla-VMamba-T	42M	~~262G~~ 286G	MaskRCNN@3x	48.5	70.0	52.7	43.2	66.9	46.4	config/log/ckpt
Vanilla-VMamba-S	64M	~~357G~~ 400G	MaskRCNN@3x	49.7	70.4	54.2	44.0	67.6	47.3	config/log/ckpt

Checkpoints for VMamba (alias of vssm version 0)

These checkpoints correspond to the experiments done before date #20240119.

name	pretrain	resolution	acc@1	#params	FLOPs	best epoch	use ema	config
VMamba-T	ImageNet-1K	224x224	82.2	22M	~~4.5G~~ 5.6G	292	did'nt add	config
VMamba-S	ImageNet-1K	224x224	83.5	44M	~~9.1G~~ 11.2G	238	true	config
VMamba-B	ImageNet-1K	224x224	83.2	75M	~~15.2G~~ 18.0G	260	did'nt add	config
VMamba-B*	ImageNet-1K	224x224	83.7	75M	~~15.2G~~ 18.0G	241	true	config

Most backbone models trained without ema, which do not enhance performance \cite(Swin-Transformer). We use ema because our model is still under development, without hyperparameter tuning.

The checkpoints used in object detection and segmentation is VMamba-B with droppath 0.5 + no ema. VMamba-B* represents for VMamba-B with droppath 0.6 + ema, the performance of which is non-ema: 83.3 in epoch 262; ema: 83.7 in epoch 241

Semantic Segmentation on ADE20K

Backbone	Input	#params	FLOPs	Segmentor	mIoU(SS)	mIoU(MS)	configs/logs/logs(ms)/ckpts
VMamba-T[`s2l5`]	512x512	62M	948G	UperNet@160k	48.3	48.6	config/log/log(ms)/ckpt
VMamba-S[`s2l15`]	512x512	82M	1028G	UperNet@160k	50.6	51.2	config/log/log(ms)/ckpt
VMamba-B[`s2l15`]	512x512	122M	1170G	UperNet@160k	51.0	51.6	config/log/log(ms)/ckpt
VMamba-T[`s1l8`]	512x512	62M	949G	UperNet@160k	47.9	48.8	config/log/log(ms)/ckpt

Object Detection on COCO

Backbone	#params	FLOPs	Detector	bboxAP	bboxAP50	bboxAP75	segmAP	segmAP50	segmAP75	configs/logs/ckpts
VMamba-T[`s2l5`]	50M	270G	MaskRCNN@1x	47.4	69.5	52.0	42.7	66.3	46.0	config/log/ckpt
VMamba-S[`s2l15`]	70M	384G	MaskRCNN@1x	48.7	70.0	53.4	43.7	67.3	47.0	config/log/ckpt
VMamba-B[`s2l15`]	108M	485G	MaskRCNN@1x	49.2	71.4	54.0	44.1	68.3	47.7	config/log/ckpt
VMamba-B[`s2l15`]	108M	485G	MaskRCNN@1x[`bs8`]	49.2	70.9	53.9	43.9	67.7	47.6	config/log/ckpt
VMamba-T[`s1l8`]	50M	271G	MaskRCNN@1x	47.3	69.3	52.0	42.7	66.4	45.9	config/log/ckpt
:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:
VMamba-T[`s2l5`]	50M	270G	MaskRCNN@3x	48.9	70.6	53.6	43.7	67.7	46.8	config/log/ckpt
VMamba-S[`s2l15`]	70M	384G	MaskRCNN@3x	49.9	70.9	54.7	44.20	68.2	47.7	config/log/ckpt
VMamba-T[`s1l8`]	50M	271G	MaskRCNN@3x	48.8	70.4	53.50	43.7	67.4	47.0	config/log/ckpt

Models in this subsection is initialized from the models trained in classfication.
we now calculate FLOPs with the algrithm @ albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).

Classification on ImageNet-1K

name	pretrain	resolution	acc@1	#params	FLOPs	TP.	Train TP.	configs/logs/ckpts
VMamba-T[`s2l5`]	ImageNet-1K	224x224	82.5	31M	4.9G	1340	464	config/log/ckpt
VMamba-S[`s2l15`]	ImageNet-1K	224x224	83.6	50M	8.7G	877	314	config/log/ckpt
VMamba-B[`s2l15`]	ImageNet-1K	224x224	83.9	89M	15.4G	646	247	config/log/ckpt
VMamba-T[`s1l8`]	ImageNet-1K	224x224	82.6	30M	4.9G	1686	571	config/log/ckpt
VMamba-S[`s1l20`]	ImageNet-1K	224x224	83.3	49M	8.6G	1106	390	config/log/ckpt
VMamba-B[`s1l20`]	ImageNet-1K	224x224	83.8	87M	15.2G	827	313	config/log/ckpt

Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for drop_path_rate and EMA. All models are trained with EMA except for the Vanilla-VMamba-T.
TP.(Throughput) and Train TP. (Train Throughput) are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128. Train TP. is tested with mix-resolution, excluding the time consumption of optimizers.
FLOPs and parameters are now gathered with head (In previous versions, without head, so the numbers raise a little bit).
we calculate FLOPs with the algorithm @ albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algorithm).

name	pretrain	resolution	acc@1	#params	FLOPs	best epoch	use ema	config
VMamba-T	ImageNet-1K	224x224	82.5	32M	5G	258	true	config

We use ema because our model is still under development, without hyperparameter tuning.

This is a pre-release

Releases: MzeroMiko/VMamba

VMamba v0 Segmentation checkpoints

Semantic Segmentation on ADE20K

Uh oh!

VMamba v0 Detection checkpoints

Object Detection on COCO

Uh oh!

VMamba v0 Classification checkpoints

Uh oh!

VMamba v2 Segmentation checkpoints

Semantic Segmentation on ADE20K

Uh oh!

VMamba v2 Detection checkpoints

Object Detection on COCO

Uh oh!

VMamba v2 Classification checkpoints

Classification on ImageNet-1K

Uh oh!

Checkpoints for nightly builds!

Uh oh!