D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation
CoRL 2024, Munich, Germany.
This is the official repository of D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation.
For more information, please visit our project page.
Songlin Wei, Haoran Geng, Jiayi Chen, Congyue Deng, Wenbo Cui, Chengyang Zhao, Xiaomeng Fang, Leonidas Guibas, and He Wang
- We just release example code for generating IR stereo images using isaac-sim 4.0.0
- We just release new model variant (Cond. on RGB+Raw), please checkout the updated inference.py
- Traning protocols and datasets
Our method robustly predicts transparent (bottles) and specular (basin and cups) object depths in tabletop environments and beyond.
conda create --name d3roma python=3.8
conda activate d3roma
# install dependencies with pip
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install huggingface_hub==0.24.5
pip install diffusers opencv-python scikit-image matplotlib transformers datasets accelerate tensorboard imageio open3d kornia
pip install hydra-core --upgrade
- For model variant: Cond. Left+Right+Raw Google drive, ηΎεΊ¦δΊ
- For model variant: Cond. RGB+Raw Google drive, ηΎεΊ¦δΊ
# Download pretrained weigths from Google Drive
# Extract it under the project folder
You can run the following script to test our model. We provided two variants left+right+raw for stereo cameras and rgb+raw for any RGBD cameras:
python inference.py
This will generate three files under folder _output:
_outputs.{variant}/pred.png: the pseudo colored depth map
_outputs.{variant}/pred.ply: the pointcloud which ia obtained though back-projected the predicted depth
_outputs.{variant}/raw.ply: the pointcloud which ia obtained though back-projected the camera raw depth
All the datasets will be linked to folder datasets
-
Download SceneFlow stereo
-
Download DREDS
-
Download HISS
-
Download Clearpose
Example datasets folder structure:
datasets
βββ clearpose -> /raid/songlin/Data/clearpose
βΒ Β βββ clearpose_downsample_100
βΒ Β βΒ Β βββ downsample.py
βΒ Β βΒ Β βββ model
βΒ Β βΒ Β βββ set1
βΒ Β βΒ Β βββ ...
βΒ Β βββ metadata
βΒ Β βΒ Β βββ set1
βΒ Β βΒ Β βββ ...
βΒ Β βββ model
βΒ Β βΒ Β βββ 003_cracker_box
βΒ Β βΒ Β βββ ...
βΒ Β βββ set1
βΒ Β βΒ Β βββ scene1
βΒ Β βΒ Β βββ ...
βΒ Β βββ ...
βββ DREDS
βΒ Β βββ test -> /raid/songlin/Data/DREDS_ECCV2022/DREDS-CatKnown/test
βΒ Β βΒ Β βββ shapenet_generate_1216_val_novel
βΒ Β βββ test_std_catknown -> /raid/songlin/Data/DREDS_ECCV2022/STD-CatKnown
βΒ Β βΒ Β βββ test_0
βΒ Β βΒ Β βββ ...
βΒ Β βββ test_std_catnovel -> /raid/songlin/Data/DREDS_ECCV2022/STD-CatNovel
βΒ Β βΒ Β βββ real_data_novel
βΒ Β βββ train -> /raid/songlin/Data/DREDS_ECCV2022/DREDS-CatKnown/train
βΒ Β βΒ Β βββ part0
βΒ Β βΒ Β βββ ...
βΒ Β βββ val -> /raid/songlin/Data/DREDS_ECCV2022/DREDS-CatKnown/val
βΒ Β βββ shapenet_generate_1216
βββ HISS
βΒ Β βββ train -> /raid/songlin/Data/hssd-isaac-sim-100k
βΒ Β βΒ Β βββ 102344049
βΒ Β βΒ Β βββ 102344280
βΒ Β βΒ Β βββ 103997586_171030666
βΒ Β βΒ Β βββ 107734119_175999932
βΒ Β βΒ Β βββ bad_his.txt
βΒ Β βββ val -> /raid/songlin/Data/hssd-isaac-sim-300hq
βΒ Β βββ 102344049
βΒ Β βββ 102344280
βΒ Β βββ 103997586_171030666
βΒ Β βββ 107734119_175999932
βΒ Β βββ 300hq.tar.gz
βΒ Β βββ bad_his.txt
βΒ Β βββ simulation2
βββ sceneflow -> /raid/songlin/Data/sceneflow
β βββ bad_sceneflow_test.txt
β βββ bad_sceneflow_train.txt
β βββ Driving
β βΒ Β βββ disparity
β βΒ Β βββ frames_cleanpass
β βΒ Β βββ frames_finalpass
β βΒ Β βββ raw_cleanpass
β βΒ Β βββ raw_finalpass
β βββ FlyingThings3D
β βΒ Β βββ disparity
β βΒ Β βββ frames_cleanpass
β βΒ Β βββ frames_finalpass
β βΒ Β βββ raw_cleanpass
β βΒ Β βββ raw_finalpass
β βββ Monkaa
β βββ disparity
β βββ frames_cleanpass
β βββ frames_finalpass
β βββ raw_cleanpass
β βββ raw_finalpass
βββ README.md
-
We resize
DREDSdataset from1270x720to640x360, and convert raw depth to raw disparity using resized resolutions. -
If the dataset does not provide raw disparity, we pre-compute them by running Stereo Matching algorithms:
# please make necessary changes to file paths, focal lengths and baselines etc.
# we adapted this file from DREDS.
python scripts/stereo_matching.py
We also tried using libSGM to precompute disaprity maps for SceneFlow.
The precomputed raw disparities are put under raw_cleanpass and raw_finalpass with same sub-folder paths.
You can also download the precomputed sceneflow raw disparities here.
- Sometimes the source stereo images are too challenging for computing raw disparities, so we filter them our during training. We run the following scripts to filter out very bad raw disparities and exclude them in dataloader:
python scritps/check_sceneflow.py
python scritps/check_stereo.py
We use v-2.1 (resolution 768) version of stable diffusion.
Download stablediffusion v2.1-768 checkpoints and put in under checkpoint/stablediffusion
Example folder structure after downloaed (I download the checkpoint files manullay)
checkpoint
βββ stable-diffusion -> /home/songlin/Projects/diff-stereo/checkpoint/stable-diffusion
βββ feature_extractor
β βββ preprocessor_config.json
βββ model_index.json
βββ scheduler
β βββ scheduler_config.json
βββ text_encoder
β βββ config.json
β βββ model.safetensors
βββ tokenizer
β βββ merges.txt
β βββ special_tokens_map.json
β βββ tokenizer_config.json
β βββ vocab.json
βββ unet
β βββ config.json
β βββ diffusion_pytorch_model.safetensors
βββ v2-1_768-nonema-pruned.safetensors
βββ vae
βββ config.json
βββ diffusion_pytorch_model.safetensors
# Because we already downloaded StableDiffusion's pretrained weights
export HF_HUB_OFFLINE=True
We use huggingface accelerate and train on 8 A100-40G:
cd <Project Dir>
conda activate d3roma
accelerate config
We train the variant left+right+raw using datasets: SceneFlow, DREDS, and HISS. This variant is suitable for working with Stereo cameras.
accelerate launch train.py \
task=train_ldm_mixed_left+right+raw \
task.tag=release \
task.eval_num_batch=10 \
task.val_every_global_steps=5000
We train the variant rgb+raw using datasets: DREDS, HISS and ClearPose. This variant is suitable for working with RGBD cameras.
accelerate launch train.py \
task=train_ldm_mixed_rgb+raw \
task.tag=release \
task.eval_num_batch=10 \
task.val_every_global_steps=5000
tensorboard --logdir experiments --port 20000
If you want to parallel evaluation on test datasets:
accelerate launch distributed_evaluate.py task=...
accelerate launch train.py task=train_dreds_reprod
accelerate launch train.py task=train_clearpose
accelerate launch train.py task=train_syntodd_rgbd
accelerate launch train.py task=train_sceneflow
If you have any questions please contact us:
Songlin Wei: [email protected], Haoran Geng: [email protected], He Wang: [email protected]
@inproceedings{
wei2024droma,
title={D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation},
author={Songlin Wei and Haoran Geng and Jiayi Chen and Congyue Deng and Cui Wenbo and Chengyang Zhao and Xiaomeng Fang and Leonidas Guibas and He Wang},
booktitle={8th Annual Conference on Robot Learning},
year={2024},
url={https://openreview.net/forum?id=7E3JAys1xO}
}
This work and the dataset are licensed under CC BY-NC 4.0.