D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement
Yixuan Wang1*, Zhuoran Li2, 3*, Mingtong Zhang1, Katherine Driggs-Campbell1, Jiajun Wu2, Li Fei-Fei2, Yunzhu Li1, 2
1University of Illinois Urbana-Champaign,
2Stanford University,
3National University of Singapore
teaser_capcut.mp4
In this notebook, we show how to build D3Fields and visualize reconstructed mesh, mask fields, and descriptor fields. We also demonstrate how to track keypoints of a video.
We recommend Mambaforge instead of the standard anaconda distribution for faster installation:
# create conda environment
mamba env create -f env.yaml
conda activate d3fields
# download pretrained models
bash scripts/download_ckpts.sh
bash scripts/download_data.sh
python vis_repr.py # visualize the representation
python vis_tracking.py # visualize the tracking
Fusion is the core class of D3Fields. It contains the following key functions:
- update: it takes in the observation and updates the internal states.
- text_queries_for_inst_mask: it will query the instance mask according to the text query and thresholds.
- text_queries_for_inst_mask_no_track: it is similar to- text_queries_for_inst_mask, but it will not invoke the underlying XMem tracking module.
- eval: it will evaluate associated features for arbitrary 3D points.
- batch_eval: for a large batch of points, it will evaluate them batch by batch to avoid out-of-memory error. The important attributes of- Fusionare:
- curr_obs_torch: a dictionary containing the following keys:- color: multiview color images in the format of np.uint8 BGR numpy arrays
- color_tensor: multiview color images in the format of float32 BGR torch tensors
- depth: multiview depth images in the format of np.float32 torch tensors, unit in meters
- mask: multiview instance mask images in the format of np.uint8 torch tensors (V, H, W, num_inst)
- consensus_mask_label: mask labels aggregated from all views in the format of a list of strings.
 
To run D3Fields on your own dataset, you could follow the following steps:
- Prepare dataset in the following structure:
dataset_name
├── camera_0
│   ├── color
|   |   ├── 0.png
|   |   ├── 1.png
|   |   ├── ...
│   ├── depth
|   |   ├── 0.png
|   |   ├── 1.png
|   |   ├── ...
│   ├── camera_extrinsics.npy
│   ├── camera_params.npy
├── camera_1
├── ...
The definition of camera_extrinsics.npy and camera_params.npy is defined as follows:
camera_extrinsics.npy: (4, 4) numpy array, the extrinsics of the camera, which transforms a point from world coordinate to camera coordinate
camera_params.npy: (4,) numpy array, the camera parameters in the following order: fx, fy, cx, cy
- Prepare the PCA pickle file for the query texts. Find four images of the queries texts (e.g. mug) with clean bakcground and central objects. Change obj_typewithinscripts/prepare_pca.pyand run it.
- Specify the workspace boundary as x_lower, x_upper, y_lower, y_upper, z_lower, z_upper.
- Run python vis_repr_custom.py, such aspython vis_repr_custom.py --data_path data/2023-09-15-13-21-56-171587 --pca_path pca_model/mug.pkl --query_texts mug --query_thresholds 0.3 --x_lower -0.4 --x_upper 0.4 --y_upper 0.3 --y_lower -0.4 --z_upper 0.02 --z_lower -0.2
Tips for debugging:
- Make sure the transformation is right by visualizing pcdwithinvis_repr_custom.pyusing Open3D.
- If the GPU is out of memory, run vis_repr_custom.pywith smallerstep. This will generate a more sparse voxel grid.
- Make sure Grounded SAM outputs reasonable results by checking curr_obs_torch['mask']andcurr_obs_torch['consensus_mask_label']ofFusionclass.
If you find this repo useful for your research, please consider citing the paper
@article{wang2023d3fields,
    title={D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation},
    author={Wang, Yixuan and Li, Zhuoran and Zhang, Mingtong and Driggs-Campbell, Katherine and Wu, Jiajun and Fei-Fei, Li and Li, Yunzhu},
    journal={arXiv preprint arXiv:2309.16118},
    year={2023}
}