[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Xin Zhang, Robby T. Tan
National University of Singapore
CVPR 2025

[`Project Page`] [`Paper`]

Environment

Requirements

The requirements can be installed with:

conda create -n mfuser python=3.9 numpy=1.26.4
conda activate mfuser
conda install pytorch==2.0.1 torchvision==0.15.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
pip install xformers==0.0.20
pip install mmcv-full==1.5.1 
pip install mamba_ssm==2.2.2
pip install causal_conv1d==1.4.0

Pre-trained VFM & VLM Models

Please download the pre-trained VFM and VLM models and save them in ./pretrained folder.

Model	Type	Link
DINOv2	`dinov2_vitl14_pretrain.pth`	download link
CLIP	`ViT-L-14-336px.pt`	download link
EVA02-CLIP	`EVA02_CLIP_L_336_psz14_s6B.pt`	download link
SIGLIP	`siglip_vitl16_384.pth`	download link

Checkpoints

You can download MFuser model checkpoints and save them in ./work_dirs_d folder. By default, all experiments below use DINOv2-L as the VFM.

Model	Pretrained	Trained on	Config	Link
`mfuser-clip-vit-l-city`	CLIP	Cityscapes	config	download link
`mfuser-clip-vit-l-gta`	CLIP	GTA5	config	download link
`mfuser-eva02-clip-vit-l-city`	EVA02-CLIP	Cityscapes	config	download link
`mfuser-eva02-clip-vit-l-gta`	EVA02-CLIP	GTA5	config	download link
`mfuser-siglip-vit-l-city`	SIGLIP	Cityscapes	config	download link
`mfuser-siglip-vit-l-gta`	SIGLIP	GTA5	config	download link

Datasets

To set up datasets, please follow the official TLDR repo.

After downloading the datasets, edit the data folder root in the dataset config files following your environment.

src_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...)
tgt_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...)

The final folder structure should look like this:

MFuser
├── ...
├── pretrained
│   ├── dinov2_vitl14_pretrain.pth
│   ├── EVA02_CLIP_L_336_psz14_s6B.pt
│   ├── siglip_vitl16_384.pth
│   ├── ViT-L-14-336px.pt
├── data
│   ├── cityscapes
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val
│   ├── bdd100k
│   │   ├── images
│   │   |   ├── 10k
│   │   │   |    ├── train
│   │   │   |    ├── val
│   │   ├── labels
│   │   |   ├── sem_seg
│   │   |   |    ├── masks
│   │   │   |    |    ├── train
│   │   │   |    |    ├── val
│   ├── mapillary
│   │   ├── training
│   │   ├── cityscapes_trainIdLabel
│   │   ├── half
│   │   │   ├── val_img
│   │   │   ├── val_label
│   ├── gta
│   │   ├── images
│   │   ├── labels
├── ...

Training

python train.py configs/[TRAIN_CONFIG]

Evaluation

Run the evaluation:

python test.py configs/[TEST_CONFIG] work_dirs_d/[MODEL] --eval mIoU

Citation

If you find our code helpful, please cite our paper:

@article{zhang2025mamba,
  title     = {Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation},
  author    = {Zhang, Xin and Robby T., Tan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2025},
}

Acknowledgements

This project is based on the following open-source projects. We thank the authors for sharing their codes.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
configs		configs
mmseg		mmseg
models		models
tools/convert_datasets		tools/convert_datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dist_test.sh		dist_test.sh
dist_train.sh		dist_train.sh
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

[`Project Page`] [`Paper`]

Environment

Requirements

Pre-trained VFM & VLM Models

Checkpoints

Datasets

Training

Evaluation

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

devinxzhang/MFuser

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

[Project Page] [Paper]

Environment

Requirements

Pre-trained VFM & VLM Models

Checkpoints

Datasets

Training

Evaluation

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

[`Project Page`] [`Paper`]

Packages