Junha Lee¹'²*, Chunghyun Park¹'²*, Jaesung Choe¹, Frank Wang¹, Jan Kautz¹, Minsu Cho², Chris Choy¹
*equal contribution
¹NVIDIA, ²POSTECH
We present Mosaic3D, a comprehensive solution for open-vocabulary 3D scene understanding that addresses three essential aspects: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. Our approach combines state-of-the-art open-vocabulary image segmentation models with region-aware vision-language models to create an automatic pipeline for generating high-quality 3D mask-text pairs.
- Mosaic3D-5.6M Dataset: The largest 3D mask-text paired dataset to date, encompassing over 30K indoor scenes and approximately 1M RGB-D frames, yielding 5.6M region captions with 30M total text tokens
- Mosaic3D Model: A 3D visual foundation model (3D-VFM) combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation
- State-of-the-art Performance: Achieves leading results on open-vocabulary 3D semantic and instance segmentation benchmarks including ScanNet200, Matterport3D, and ScanNet++
Our Mosaic3D-5.6M dataset offers significant advantages over existing datasets:
- Scale: 5.6M mask-text pairs across 30K+ scenes (significantly larger than existing datasets)
- Precision: Leverages advanced open-vocabulary segmentation for precise region boundaries
- Rich Descriptions: Captures object attributes, spatial relationships, and scene context
- Quality: Combines robust region-aware VLMs for comprehensive textual annotations
The dataset can be found in Huggingface. Follow the instruction there to download and organize the data into required structure.
# Build docker image
bash docker/docker_build.sh
# Run docker container with dataset path
bash docker/docker_run.sh /path/to/datasets# Create conda environment
conda env create -f environment.yaml
# Install requirements
pip install -r requirements.txt
# Install pre-commit hooks
pre-commit installMosaic3D employs a two-stage training approach:
- Per-point Language Alignment: Trains a 3D encoder using contrastive learning to align 3D point features with textual descriptions
- Mask Decoder Training: Trains a lightweight mask decoder to predict instance segments from the aligned features
This design enables effective open-vocabulary 3D semantic and instance segmentation across diverse indoor scenes.
# Train Mosaic3D model with default configuration
python src/train.py experiment=train_spunet_multidata_ppt data=sc trainer.ddp trainer.devices=8 logger=wandb# Download Segment3D checkpoint
python src/models/networks/opensegment3d/download_ckpt.py
# Train a lightweight mask decoder with default configuration
python src/train.py experiment=train_opensegment3d_scannet model.net.backbone_ckpt=/path/to/encoder.ckpt trainer.ddp trainer.devices=8 logger=wandbYou can override any configuration parameter from the command line:
python src/train.py experiment=train_spunet_multidata_ppt data=sc+ar model=spunet34c trainer.max_epochs=100The model achieves state-of-the-art results on multiple benchmarks:
- Annotation-free 3D semantic segmentation: ScanNet20 & ScanNet200, Matterport3D, ScanNet++
- Annotation-free 3D instance segmentation: ScanNet200
Run the following commands to evaluate the pretrained models on ScanNet20 and ScanNet200 validation.
python src/eval.py experiment=train_spunet_multidata_ppt data=sc ckpt_path=[path/to/model/checkpoint]We provide pretrained models for both model_scale and data_scale experiments. All models are available on Hugging Face.
Models trained on different combinations of datasets:
| Model | Training Data | Size | f-mIoU on ScanNet200 |
|---|---|---|---|
sc.ckpt |
ScanNet | 2.74 GB | 13.0% |
sc+ar.ckpt |
ScanNet + ARKitScenes | 2.74 GB | 14.8% |
sc+ar+sc++.ckpt |
ScanNet + ARKitScenes + ScanNet++ | 2.74 GB | 15.5% |
sc+ar+sc+++ma.ckpt |
SC + AR + SC++ + Matterport3D | 2.74 GB | 15.4% |
sc+ar+sc+++ma+st.ckpt |
SC + AR + SC++ + MA + Structured3D | 2.74 GB | 15.7% |
Evaluate the impact of training data scale by testing models trained on different dataset combinations.
Set the data configuration to match the training data of your checkpoint:
# Model trained on ScanNet only
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc \
ckpt_path=ckpt_raw/converted/sc.ckpt
# Model trained on ScanNet + ARKitScenes
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar \
ckpt_path=ckpt_raw/converted/sc+ar.ckpt
# Model trained on ScanNet + ARKitScenes + ScanNet++
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc++ \
ckpt_path=ckpt_raw/converted/sc+ar+sc++.ckpt
# Model trained on four datasets (+ Matterport3D)
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc+++ma \
ckpt_path=ckpt_raw/converted/sc+ar+sc+++ma.ckpt
# Model trained on all five datasets (+ Structured3D)
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc+++ma+st \
ckpt_path=ckpt_raw/converted/sc+ar+sc+++ma+st.ckptModels with different backbone architectures (trained on ScanNet + ARKitScenes + ScanNet++):
| Model | Architecture | Parameters | Size | f-mIoU on ScanNet200 |
|---|---|---|---|---|
spunet14a.ckpt |
SparseUNet-14A | ~14M | 2.61 GB | 13.2% |
spunet18a.ckpt |
SparseUNet-18A | ~18M | 2.64 GB | 14.5% |
spunet34c.ckpt |
SparseUNet-34C | ~34M | 2.74 GB | 15.5% |
spunet50.ckpt |
SparseUNet-50 | ~50M | 2.97 GB | 15.8% |
spunet101.ckpt |
SparseUNet-101 | ~101M | 3.59 GB | 16.0% |
Evaluate different model architectures on the combined dataset (ScanNet + ARKitScenes + ScanNet++):
# SparseUNet-14A (smallest, fastest)
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc++ \
model=spunet14a+ppt \
ckpt_path=ckpt_raw/converted/spunet14a.ckpt
# SparseUNet-18A
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc++ \
model=spunet18a+ppt \
ckpt_path=ckpt_raw/converted/spunet18a.ckpt
# SparseUNet-34C (recommended for balance)
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc++ \
model=spunet34c+ppt \
ckpt_path=ckpt_raw/converted/spunet34c.ckpt
# SparseUNet-50
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc++ \
model=spunet50+ppt \
ckpt_path=ckpt_raw/converted/spunet50.ckpt
# SparseUNet-101 (largest, best performance)
python src/eval.py experiment=train_spunet_multidata_ppt \
data=sc+ar+sc++ \
model=spunet101+ppt \
ckpt_path=ckpt_raw/converted/spunet101.ckptIf you find this work useful, please consider citing:
@inproceedings{lee2025mosaic3d,
title={Mosaic3d: Foundation dataset and model for open-vocabulary 3d segmentation},
author={Lee, Junha and Park, Chunghyun and Choe, Jaesung and Wang, Yu-Chiang Frank and Kautz, Jan and Cho, Minsu and Choy, Chris},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={14089--14101},
year={2025}
}Our work builds upon several fantastic open-source projects. We'd like to express our gratitude to the authors of: