[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Xin Zhang, Robby T. Tan
National University of Singapore
CVPR 2025
[Project Page] [Paper]
-
The requirements can be installed with:
conda create -n mfuser python=3.9 numpy=1.26.4 conda activate mfuser conda install pytorch==2.0.1 torchvision==0.15.2 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt pip install xformers==0.0.20 pip install mmcv-full==1.5.1 pip install mamba_ssm==2.2.2 pip install causal_conv1d==1.4.0
-
Please download the pre-trained VFM and VLM models and save them in
./pretrainedfolder.Model Type Link DINOv2 dinov2_vitl14_pretrain.pthdownload link CLIP ViT-L-14-336px.ptdownload link EVA02-CLIP EVA02_CLIP_L_336_psz14_s6B.ptdownload link SIGLIP siglip_vitl16_384.pthdownload link
-
You can download MFuser model checkpoints and save them in
./work_dirs_dfolder. By default, all experiments below use DINOv2-L as the VFM.Model Pretrained Trained on Config Link mfuser-clip-vit-l-cityCLIP Cityscapes config download link mfuser-clip-vit-l-gtaCLIP GTA5 config download link mfuser-eva02-clip-vit-l-cityEVA02-CLIP Cityscapes config download link mfuser-eva02-clip-vit-l-gtaEVA02-CLIP GTA5 config download link mfuser-siglip-vit-l-citySIGLIP Cityscapes config download link mfuser-siglip-vit-l-gtaSIGLIP GTA5 config download link
-
To set up datasets, please follow the official TLDR repo.
-
After downloading the datasets, edit the data folder root in the dataset config files following your environment.
src_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...) tgt_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...)
-
The final folder structure should look like this:
MFuser
├── ...
├── pretrained
│ ├── dinov2_vitl14_pretrain.pth
│ ├── EVA02_CLIP_L_336_psz14_s6B.pt
│ ├── siglip_vitl16_384.pth
│ ├── ViT-L-14-336px.pt
├── data
│ ├── cityscapes
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── bdd100k
│ │ ├── images
│ │ | ├── 10k
│ │ │ | ├── train
│ │ │ | ├── val
│ │ ├── labels
│ │ | ├── sem_seg
│ │ | | ├── masks
│ │ │ | | ├── train
│ │ │ | | ├── val
│ ├── mapillary
│ │ ├── training
│ │ ├── cityscapes_trainIdLabel
│ │ ├── half
│ │ │ ├── val_img
│ │ │ ├── val_label
│ ├── gta
│ │ ├── images
│ │ ├── labels
├── ...
python train.py configs/[TRAIN_CONFIG]
Run the evaluation:
python test.py configs/[TEST_CONFIG] work_dirs_d/[MODEL] --eval mIoU
If you find our code helpful, please cite our paper:
@article{zhang2025mamba,
title = {Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation},
author = {Zhang, Xin and Robby T., Tan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}This project is based on the following open-source projects. We thank the authors for sharing their codes.