This is the official repository for the paper:
3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
Ting Huang*, Zeyu Zhang*β , and Hao Tang#
*Equal contribution. β Project lead. #Corresponding author.
Note
πͺ This and following visualizations show the zero-shot results of 3D-R1 in various complex scenes, demonstrating its incredible generalizability and state-of-the-art performance.
teaser.mp4
If you find our code or paper helpful, please consider starring β us and citing:
@article{huang20253d,
title={3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding},
author={Huang, Ting and Zhang, Zeyu and Tang, Hao},
journal={arXiv preprint arXiv:2507.23478},
year={2025}
}
3D-R1 is an open-source generalist model that enhances the reasoning of 3D VLMs for unified scene understanding.
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding.
2025/08/07: π Our paper has been shared by Deep Blue AI.
2025/08/05: π Our paper has been shared by AK.
2025/08/04: π Our paper has been promoted by AIxiv.
2025/08/03: π Our paper has been promoted by Learn AI with us.
2025/08/01: π£ Our paper has been promoted by 52CV.
Important
General Response to Visualization: We acknowledge that some users are seeking detailed visualization code. Regarding the bounding box drift issue in the visualization, we are currently fixing it and will update the visualization results accordingly, along with releasing a detailed visualization tutorial.
- Upload our paper to arXiv and build project pages.
- Upload the code.
- Release Scene-30K dataset. (see Scene-30K)
- Release RL part code.
- Release visualization script.
- Add a demo on huggingface.
Note
If youβd like to learn more about our paper, be sure to check out this youtube video by @AIResearchRoundup.
Our code is tested with CUDA 11.8 and Python 3.9.16. To run the codes, you should first install the following packages:
h5py
scipy
cython
plyfile
'trimesh>=2.35.39,<2.35.40'
'networkx>=2.2,<2.3'
'torch=2.0.1+cu118'
google-generativeai
peft>=0.7.0
transformers>=4.35.0
accelerate>=0.20.0
tqdm
orjson
clip @ git+https://github.com/openai/CLIP.git
git+https://github.com/LiheYoung/Depth-Anything.git
After that, build the pointnet2
and accelerated giou
from source:
# PointNet++
cd third_party/pointnet2
python setup.py install
cd utils
python cython_compile.py build_ext --inplace
You can download the pre-processed data from here. Process 3D data: Follow the instructions here and download the ScanNetV2 dataset.
To train the model, you are required to prepare language annotations from ScanRefer
, Nr3D
, ScanQA
, and the ScanNet part of 3D-LLM
.
ScanRefer
. Follow the commands here to download theScanRefer
dataset.Nr3D
. Follow the commands here to download theNr3D
dataset.ScanQA
. Follow the commands here to download theScanQA
dataset.3D-LLM
. The data are located at here.
You can synthesize Scene-30K by:
bash script/synthesize_scene30K.sh
Or you can download from huggingface
If your server has no trouble auto-downloading weights from huggingfaceπ€, feel free to skip this step.
Download files from the Qwen2.5-VL-7B-Instruct
checkpoint at huggingface.
We provide training script in the script
folder with different LLM backends. Feel free to modify the hyper parameters in those commands.
SFT on Scene-30K as a cold-start:
bash script/train.generalist.sh
bash script/train.rl.sh
3D Scene Dense Captioning (3D-DC)
3d_dc_demo.mp4 |
3D Object Captioning
3d_object_captioning_demo.mp4 |
3D Visual Grounding (3D-VG)
3d_vg_demo.mp4 |
3D Question Answering (3D-QA)
3d_qa_demo.mp4 |
3D Dialogue
3d_dialogue_demo.mp4 |
3D Reasoning
3d_reasoning_demo.mp4 |
3D Planning
3d_planning_demo.mp4 |
We thank the authors of Qwen, LSceneLLM, ARKit, and DeepSeek-Math for their open-source code.