Our SMoE-Stereo framework fuses Vision Foundation Models (VFMs) with a Selective-MoE design to unlock robust stereo matching at minimal computational cost. Its standout features are π :
-
Our SMoE dynamically selects the most suitable experts for each input and thereby adapts to varying input characteristics, allowing it to adapt seamlessly to diverse βin-the-wildβ scenes and domain shifts.
-
Unlike existing stereo matching methods that rely on rigid, sequential processing pipelines for all inputs, SMoE-Stereo intelligently prioritizes computational resources by selectively engaging only the most relevant MoEs for simpler scenes. This adaptive architecture optimally balances accuracy and processing speed according to available resources.
-
Remarkably, despite being trained exclusively on standard datasets (KITTI 2012/2015, Middlebury, and ETH3D training splits) without any additional data, SMoE-Stereo has achieved top ranking on the Robust Vision Challenge (RVC) leaderboards.
Exciting Update! Our framework now comprehensively supports mainstream PEFT strategies for stereo matching, including:
- Visual Prompt Tuning (ECCV 2022)
- LoRA (ICLR 2022)
- AdapterFormer (NeuralPS 2022)
- Adapter Tuning (ECCV 2024)
- LoRA MoE, Adapter MoE
- Our SMoE strategy
Additionally, the framework is compatible with multiple leading vision foundation models (VFMs):
All these models can now leverage our PEFT implementation for enhanced performance and flexibility. Please choose the model variants you want !!!
Below are Examples:
parser.add_argument('--peft_type', default='smoe', choices=["lora", "smoe", "adapter", "tuning", "vpt", "ff"], type=str)
parser.add_argument('--vfm_type', default='damv2', choices=["sam", "dam", "damv2", "dinov2"], type=str)
parser.add_argument('--vfm_size', default='vitl', choices=["vitb", "vits", "vitl"], type=str)
- Upload the ViT-small weights of SMoE-Stereo.
- add SMoE-IGEV-backbone.
- add the KITTI demo.mp4.
We use RAFT-Stereo as our backbone and replace its feature extractor with VFMs, while the remaining structures are unchanged.
Our MoE modules and the experts within each MoE layer can be selectively activated to adapt to different input characteristics, facilitate scene-specific adaptation, enabling robust stereo matching across diverse real-world scenarios.
- NVIDIA RTX a6000
- Python 3.8.13
conda create -n smoestereo python=3.8
conda activate smoestereopip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm
pip install scipy
pip install opencv-python
pip install scikit-image
pip install tensorboard
pip install matplotlib
pip install timm==0.5.4
pip install thop
pip install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1/index.html
pip install accelerate==1.0.1
pip install gradio_imageslider
pip install gradio==4.29.0
| Model | Link |
|---|---|
| sceneflow | Google Driver |
| RVC (mix of all training datasets) | Google Driver |
The mix_all model is trained on all the datasets mentioned above, which has the best performance on zero-shot generalization. The model weights can be placed in the ckpt folders.
To evaluate the zero-shot performance of SMoE-Stereo on Scene Flow, KITTI, ETH3D, vkitti, DrivingStereo, or Middlebury, run
python evaluate_stereo.py --resume ./pretrained/damv2_sceneflow.pth --eval_dataset *(select one of ["eth3d", "kitti", "middlebury", "robust_weather", "robust"])or use the model trained on all datasets, which is better for zero-shot generalization.
python evaluate_stereo.py --resume ./pretrained/SMoEStereo_RVC.pth --eval_dataset *(select one of ["eth3d", "kitti", "sceneflow", "vkitti", "driving"])If you find our work useful in your research, please consider citing our paper:
@article{wang2025moe,
title={learning robust stereo matching in the wild with selective mixture-of-experts},
author={Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu},
journal={arXiv preprint arXiv:2507.04631},
year={2025}
}This project is based on RAFT-Stereo and GMStereo. We thank the original authors for their excellent works.