Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ SIU3R Public

[NeurIPS 2025 Spotlight] Official implementation of the SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Notifications You must be signed in to change notification settings

WU-CVGL/SIU3R

Repository files navigation

This repository is the official implementation of the SIU3R.

SIU3R is a feed-forward method that can achieve simultaneous 3D scene understanding and reconstruction given unposed images. In particular, SIU3R does not require feature alignment with 2D VFMs (e.g., CLIP, LSeg) to enable understanding, which unleashes its potential as a unified model to achieve multiple 3D understanding tasks (i.e., semantic, instance, panoptic and text-referred segmentation). Moreover, tailored designs for mutual benefits can further boost SIU3R's performance by encouraging bi-directional promotion between reconstruction and understanding.


siu3r_page.mp4

📰 News

  • [2025-09-19] Our code is now released! 🎉
  • [2025-09-18] Our paper is accepted by NeurIPS 2025 as a Spotlight paper! 🌟
  • [2025-07-03] Our paper is available on arXiv! 🎉 Paper

🛠️ Installation

We recommend using uv to create a virtual environment for this project. The following instructions assume you have uv installed. Our code is tested with Python 3.10 and PyTorch 2.4.1 with cuda 11.8.

To set up the environment, just run uv sync command.

⚡️ Inference

To run inference, you can download the pre-trained model from here and place it in the pretrained_weights directory.

Then, you can run the inference script:

python inference.py --image_path1 <path_to_image1> --image_path2 <path_to_image2> --output_path <output_directory> [--cx <cx_value>] [--cy <cy_value>] [--fx <fx_value>] [--fy <fy_value>]

A output.ply will be generated in the specified output directory, containing the reconstructed gaussian splattings. The cx, cy, fx, and fy parameters are optional and can be used to specify the camera intrinsics. If not provided, default values will be used.

You can view the results in the online viewer by running:

python viewer.py --output_ply <output_directory/output.ply>

📚 Dataset

We use the ScanNet dataset for training and evaluation. You can download the processed dataset from here and place it in the data directory. The dataset should have the following structure:

data/
├── scannet/
│   ├── train/
|   |   |-- scene0000_00
|   |   |   |-- color
|   |   |   |-- depth
|   |   |   |-- extrinsic
|   |   |   |-- instance
|   |   |   |-- intrinsic.txt
|   |   |   |-- iou.png
|   |   |   |-- iou.pt
|   |   |   |-- panoptic
|   |   |   `-- semantic
|   |   `-- ....
|   └── val/
│       ├── scene0011_00
│       │   |-- color
│       │   |-- depth
│       │   |-- extrinsic
│       │   |-- instance
│       │   |-- intrinsic.txt
│       │   |-- iou.png
│       │   |-- iou.pt
│       │   |-- panoptic
│       │   `-- semantic
│       `-- ....
|-- train_refer_seg_data.json
|-- val_pair.json
|-- val_refer_pair.json
`-- val_refer_seg_data.json

📝 Training

If you want to train the model, you should download pretrained MASt3R weights from here, our pretrained panoptic segmentation head weights from here and put them in the pretrained_weights directory.

To train the model, you can use the following command:

python src/run.py experiment=siu3r_train

This will start the training process using the configuration specified in configs/main.yaml. You can modify the configuration file to adjust the training parameters, such as devices, learning rate, batch size, and number of epochs.

📐 Evaluation

To evaluate the model, you can use the following command:

python src/run.py experiment=siu3r_test mode=test ckpt_path={your_ckpt_path}

This will start the evaluate process, which will load scannet validation set and generate nvs and segmentation results for pairs defined in val_pair.json. After that, evaluator will calculate metrics and write into json file.

📷 Camera Conventions

Our camera system is the same as pixelSplat. The camera intrinsic matrices are normalized (the first row is divided by image width, and the second row is divided by image height). The camera extrinsic matrices are OpenCV-style camera-to-world matrices ( +X right, +Y down, +Z camera looks into the screen).

📖 Citation

If you find our work useful, please consider citing our paper:

@misc{xu2025siu3r,
      title={SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment}, 
      author={Qi Xu and Dongxu Wei and Lingzhe Zhao and Wenpu Li and Zhangchi Huang and Shunping Ji and Peidong Liu},
      year={2025},
      eprint={2507.02705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.02705}, 
}

About

[NeurIPS 2025 Spotlight] Official implementation of the SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages