Ponder
Ponder
Abstract
Render
SDF
fs
<latexit sha1_base64="07aa7o5lYgi+sSPGEB2Sb6m3DlA=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0EPZ1v1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vktZF1atVvfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1ZEjdY=</latexit>
Point Cloud
Encoder
Query Point
fp
<latexit sha1_base64="m+1VKmYYcVLEOFQ7JUNGeVUAkZM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0EPaTfrniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYP6PKcCZwWuqlGhPKxnSIXUsljVD72fzUKTmzyoCEsbIlDZmrvycyGmk9iQLbGVEz0sveTPzP66YmvPYzLpPUoGSLRWEqiInJ7G8y4AqZERNLKFPc3krYiCrKjE2nZEPwll9eJa2LqlereveXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwBRuI3T</latexit>
Feature RGB
Point Cloud X Rendered
<latexit sha1_base64="rPqcaONwIPoQvdMhQn4HkZvtxqk=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6rLoxmUF+4A2lMl00g6dTMLMjVBCP8ONC0Xc+jXu/BsnbRbaemDgcM69zLknSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wSTu9zvPHFtRKwecZpwP6IjJULBKFqp148ojhmVWXc2qNbcujsHWSVeQWpQoDmofvWHMUsjrpBJakzPcxP0M6pRMMlnlX5qeELZhI54z1JFI278bB55Rs6sMiRhrO1TSObq742MRsZMo8BO5hHNspeL/3m9FMMbPxMqSZErtvgoTCXBmOT3k6HQnKGcWkKZFjYrYWOqKUPbUsWW4C2fvEraF3Xvqu49XNYat0UdZTiBUzgHD66hAffQhBYwiOEZXuHNQefFeXc+FqMlp9g5hj9wPn8AlTmRdA==</latexit>
Figure 3. The pipeline of our point cloud pre-training via neural rendering (Ponder). Given multi-view RGB-D images, we first
construct the point cloud by back-projection, then use a point cloud encoder fp to extract per-point features E. E are organized to a 3D
feature volume (visualized as an image in this figure) by average pooling. Finally, the 3D feature volume is rendered to multi-view RGB-D
images via a differentiable neural rendering, which are compared with the original input multi-view RGB-D images as the supervision.
Point cloud encoder fp and color decoder fc are used for transfer learning.
coder (Section 3.2) and organize it to a 3D feature volume empty space, we perform average pooling, followed by a
(Section 3.3), which is used to reconstruct the neural scene 3D CNN, to aggregate features from the nearby points. The
representation and render images in a differentiable manner dense 3D volume is denoted as V.
(Section 3.4).
3.4. Pre-training with Neural Rendering
3.1. Constructing point cloud from RGB-D images
This section introduces how to reconstruct the implicit
The proposed method makes use of sequential RGB- scene representation and render images differentiablely. We
D images {(Ii , Di )}Ni=1 , the camera intrinsic parameters first give a brief introduction to neural scene representation,
{Ki }N
i=1 , and extrinsic poses {ξi }N
i=1 ∈ SE(3). N is the then illustrate how to integrate it into our point cloud pre-
input view number. SE(3) refers to the Special Euclidean training pipeline. Last, we show the differentiable render-
Group representing 3D rotations and translations. The cam- ing formulation to render color and depth images from the
era parameters can be easily obtained from SfM or SLAM. neural scene representation.
We construct the point cloud X by back-projecting
RGB-D images to point clouds in a unified coordinate: Brief introduction of neural scene representation.
N Neural scene representation aims to represent the scene ge-
ometry and appearance through a neural network. In this
[
X = π −1 (Ii , Di , ξi , Ki ), (1)
i
paper, we use the Signed Distance Function (SDF), which
measures the distance between a query point and the sur-
where π −1 back-projects the RGB-D image to 3D world
face boundary, to represent the scene geometry implicitly.
space using camera poses. Note that different from pre-
SDF is capable of representing high-quality geometry de-
vious methods which only consider the point location, our
tails. For any query point of the scene, the neural network
method attributes each point with both point location and
takes points features as input and outputs the corresponding
RGB color. The details of π −1 are provided in the supple-
SDF value and RGB value. In this way, the neural network
mentary material.
captures both the geometry and appearance information of
3.2. Point cloud encoder for feature extraction a specific scene. Following NeuS [49], the scene can be
reconstructed as:
Given the point cloud X constructed from RGB-D im-
ages, a point cloud encoder fp is used to extract per-point s(p) = f˜s (p), c(p, d) = f˜c (p, d), (3)
feature embedding E:
where f˜s is the SDF decoder and f˜c is the RGB color de-
E = fp (X ). (2) coder. f˜s takes point location p as input, and predicts the
The encoder fp pre-trained with the method mentioned in SDF value s. f˜c takes point location p and viewing direc-
the Section 3.4 serves as a good initialization for various tion d as input, and outputs the RGB color value c. Both f˜s
downstream tasks. and f˜c are implemented by simple MLP networks.
Table 1. 3D object detection AP25 and AP50 on ScanNet and SUN RGB-D. VoteNet [38] and 3DETR [32] are two baseline 3D object de-
tection models. The DepthContrast [61] and Point-BERT [58] results are adopted from IAE [55] and MaskPoint [25]. Ponder outperforms
both VoteNet-based and 3DETR-based point cloud pre-training methods with fewer training epochs.
The 3D detection results are shown in Table 1. Our mentation performance by a large margin, boosting OA
method improves the baseline of VoteNet without pre- and mIoU for 2.1% and 5%, respectively. Jigsaw and
training by a large margin, boosting AP50 by 7.5% and OcCo use ShapeNet as the pre-train dataset. Although
3.7% for ScanNet and SUN RGB-D, respectively. IAE [55] they get improvements compared with the baseline, the
is a pre-training method that represents the inherent 3D ge- limited scale of training data constrains the transferring
ometry in a continuous manner. Our learned point cloud ability. IAE achieves significant improvements by lever-
representation achieves higher accuracy because it is able aging the large-scale dataset and an implicit reconstruc-
to recover both the geometry and appearance of the scene. tion manner. Compared with IAE, the proposed approach
The AP50 and AP25 of our method are higher than that of achieves a higher semantic segmentation performance with
IAE by 1.2% and 2.1% on ScanNet, respectively. Mask- the DGCNN backbone (+0.3% for OA and +0.4% for
Point [25] is another method aiming to learn a continuous mIoU). Besides, IAE requires a large amount of 3D mesh
surface by classifying if the query point is occupied. How- for supervision. Our approach, in contrast, only requires
ever, its performance can be constrained due to the noisy RGB-D images as the supervision, which is much cheaper
labeling of the query point occupancy value. As presented and easy to fetch.
in Table 1, even with an inferior backbone (PointNet++ vs
3DETR), our method is able to achieve better accuracy with
4.2.2 Low-level 3D Tasks
fewer pre-training epochs.
Low-level 3D tasks like scene reconstruction and image
3D semantic segmentation. 3D semantic segmentation synthesis are getting increasing attention due to their wide
is another fundamental scene understanding task. Follow- applications. However, most of them are trained from
ing [43,47,55], we choose DGCNN [51] as our baseline for scratch. How to pre-train a model with a good initialization
a fair comparison. DGCNN applies a dynamic graph CNN is desperately needed. We are the first pre-training work to
as the backbone. For pre-training, we use DGCNN as the demonstrate a strong transferring ability to such low-level
point cloud encoder fp , and pre-train the model on ScanNet. 3D tasks.
We validate the effectiveness of our method by transfer-
ring the weights to Stanford Large-Scale3D Indoor Spaces 3D scene reconstruction. 3D scene reconstruction task
(S3DIS) [3] dataset, which is an indoor 3D understanding aims to recover the scene geometry, e.g. mesh, from
dataset containing 6 large-scale indoor scenes with point se- the point cloud input. We choose ConvONet [37] as the
mantic annotations. Following the same setting of [51], we baseline model, whose architecture are widely adopted
use the overall accuracy (OA) mean IoU(mIoU) on points in [9, 26, 56]. Following the same setting as ConvONet, we
as the evaluation metric, and report the average evaluation conduct experiments on the Synthetic Indoor Scene Dataset
results across six folds. (SISD) [37], which is a synthetic dataset and contains 5000
Table 2 shows the quantitative results. Compared with scenes with multiple ShapeNet [5] objects. We pre-train the
the DGCNN baseline, the proposed method boost the seg- PointNet encoder, which is the same as the original Con-
Method OA↑ mIoU↑
DGCNN [51] 84.1 56.1 Method Encoder IoU↑ Normal Consistency↑ F-Score↑
Jigsaw [43] 84.4 56.6 ConvONet [37] PointNet 84.9 0.915 0.964
OcCo [47] 85.1 58.5 Ponder PointNet 85.7 0.917 0.965
IAE [55] 85.9 60.7 ConvONet PointNet++ 77.8 0.887 0.906
Ponder 86.2 61.1 Ponder PointNet++ 80.2 0.893 0.920
Table 2. 3D semantic segmentation OA Table 3. 3D scene reconstruction IoU, NC, and F-Score on SISD dataset with
and mIoU on S3DIS dataset with DGCNN PointNet and PointNet++ model. For both PointNet and PointNet++,
model. Ponder outperforms previous state-of- Ponder is able to boost the reconstruction performance.
the-art models.
Image synthesis from point clouds. We also validate the Influence on Rendering Targets. The rendering part of
effectiveness of our method on another low-level task of im- our method contains two items: RGB color image and depth
age synthesis from point clouds. We use Point-NeRF [53] image. We study the influence of each item with the trans-
as the baseline. Point-NeRF uses neural 3D point clouds ferring task of 3D detection. The results are presented in Ta-
with associated neural features to render images. It can be ble 4. Combining depth and color images for reconstruction
used both for a generalizable setting for various scenes and a shows the best detection results. In addition, using depth re-
single-scene fitting setting. In our experiments, we mainly construction presents better performance than color recon-
focus on the generalizable setting of Point-NeRF. We re- struction for 3D detection.
Input Point Cloud Projected Point Cloud Reconstruction Image Synthesis Reference Image Depth Synthesis Reference Depth
Figure 6. Direct applications of Ponder on the ScanNet validation set. The proposed Ponder model can be directly used for various
applications, such as 3D reconstruction and image synthesis. The input point clouds are drawn as spheres for better clarity.
Number of input RGB-D view. Our method utilizes N depth may even look better compared with the ground-truth
RGB-D images, where N is the input view number. We depth image which has irregular values.
study the influence of N and conduct experiments on 3D
detection, as shown in Table 5. Using multi-view super- View number ScanNet SUN RGB-D
vision helps to reduce single-view ambiguity. Similar ob-
servations are also found in the multi-view reconstruction 1 view 40.1 35.4
task [27]. Compared with the single view, multiple views 3 views 40.8 36.0
achieve higher accuracy, boosting AP50 by 0.9% and 1.2% 5 views 41.0 36.6
for ScanNet and Sun RGB-D datasets, respectively.
Table 5. Ablation study for view number. 3D detection AP50
4.4. Other applications on ScanNet and SUN RGB-D. Using multi-view supervision for
In previous sections, we show that the proposed pipeline point cloud pre-training can achieve better performance.
can be used for transfer learning. In this section, we show
that the pre-trained model from our pipeline Ponder itself 5. Conclusion
can also be directly used for surface reconstruction and im-
In this paper, we show that differentiable neural render-
age synthesis from sparse point clouds.
ing is a powerful tool for point cloud representation learn-
ing. The proposed pre-training pipeline, Ponder, is able to
3D reconstruction from sparse point clouds. The encode rich geometry and appearance cues into the point
learned model has the capability to recover the scene sur- cloud representation via neural rendering. For the first time,
face from sparse point clouds. Specifically, after learning our model can be transferred to not only high-level 3D per-
the neural scene representation, we query the SDF value in ception tasks but also 3D low-level tasks, like 3D recon-
the 3D space and leverage the Marching Cubes [28] to ex- struction and image synthesis from point clouds. Also, the
tract the surface. We show the reconstruction results in Fig- learned Ponder model can be directly used for 3D recon-
ure 6. The results show that even though the input is sparse struction and image synthesis from sparse point clouds.
point clouds from complex scenes, our method is able to Several directions could be explored in future works.
recover high-fidelity meshes. First, there are various types of neural rendering, which
could also be leveraged for point cloud representation learn-
Image synthesis from sparse point clouds. Another in- ing. Second, other 3D domain-specific designs could be
teresting experiment to explore is that our pipeline is able integrated into point cloud pre-training pipelines. Third,
to render realistic images from sparse point cloud input. As exploring the proposed pre-training pipeline Ponder on a
shown in Figure 6, our method is able to recover similar larger dataset and more downstream tasks is also a potential
color images with the ground truth. Also, the recovered research direction.
References [12] Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaim-
ing He. Masked autoencoders as spatiotemporal learners.
[1] Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, arXiv preprint arXiv:2205.09113, 2022. 1
Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- [13] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
drigo. Crosspoint: Self-supervised cross-modal contrastive Yaron Lipman. Implicit geometric regularization for learning
learning for 3d point cloud understanding. In Proceedings of shapes. arXiv preprint arXiv:2002.10099, 2020. 5
the IEEE/CVF Conference on Computer Vision and Pattern [14] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
Recognition, pages 9902–9912, 2022. 2 Dollár, and Ross Girshick. Masked autoencoders are scalable
[2] Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry vision learners. In Proceedings of the IEEE/CVF Conference
Ulyanov, and Victor Lempitsky. Neural point-based graph- on Computer Vision and Pattern Recognition, pages 16000–
ics. In European Conference on Computer Vision, pages 16009, 2022. 1, 2
696–712. Springer, 2020. 2 [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
[3] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioan- Girshick. Momentum contrast for unsupervised visual rep-
nis Brilakis, Martin Fischer, and Silvio Savarese. 3d seman- resentation learning. In Proceedings of the IEEE/CVF con-
tic parsing of large-scale indoor spaces. In Proceedings of ference on computer vision and pattern recognition, pages
the IEEE conference on computer vision and pattern recog- 9729–9738, 2020. 2
nition, pages 1534–1543, 2016. 6 [16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
[4] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter shick. Mask r-cnn. In Proceedings of the IEEE international
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. conference on computer vision, pages 2961–2969, 2017. 1
Mip-nerf: A multiscale representation for anti-aliasing neu- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
ral radiance fields. In Proceedings of the IEEE/CVF Inter- Deep residual learning for image recognition. In Proceed-
national Conference on Computer Vision, pages 5855–5864, ings of the IEEE conference on computer vision and pattern
2021. 2 recognition, pages 770–778, 2016. 1
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, [18] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Xie. Exploring data-efficient 3d scene understanding with
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: contrastive scene contexts. In Proceedings of the IEEE/CVF
An information-rich 3d model repository. arXiv preprint Conference on Computer Vision and Pattern Recognition,
arXiv:1512.03012, 2015. 6 pages 15587–15597, 2021. 1, 2
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [19] Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and
offrey Hinton. A simple framework for contrastive learning Matthias Nießner. Pri3d: Can 3d priors help 2d represen-
of visual representations. In International conference on ma- tation learning? In Proceedings of the IEEE/CVF Inter-
chine learning, pages 1597–1607. PMLR, 2020. 2 national Conference on Computer Vision, pages 5693–5702,
2021. 2
[7] Yujin Chen, Matthias Nießner, and Angela Dai. 4dcontrast:
[20] Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu.
Contrastive learning with dynamic correspondences for 3d
Spatio-temporal self-supervised representation learning for
scene understanding. In Proceedings of the European Con-
3d point clouds. arXiv preprint arXiv:2109.00179, 2021. 1,
ference on Computer Vision (ECCV), 2022. 1, 2
2, 6
[8] Zhang Chen, Yinda Zhang, Kyle Genova, Sean Fanello, [21] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola,
Sofien Bouaziz, Christian Häne, Ruofei Du, Cem Keskin, and Henrik Aanæs. Large scale multi-view stereopsis eval-
Thomas Funkhouser, and Danhang Tang. Multiresolution uation. In Proceedings of the IEEE conference on computer
deep implicit functions for 3d shape representation. In vision and pattern recognition, pages 406–413, 2014. 7
Proceedings of the IEEE/CVF International Conference on [22] Li Jiang, Shaoshuai Shi, Zhuotao Tian, Xin Lai, Shu Liu,
Computer Vision, pages 13087–13096, 2021. 12 Chi-Wing Fu, and Jiaya Jia. Guided point contrastive learn-
[9] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. ing for semi-supervised point cloud semantic segmentation.
Implicit functions in feature space for 3d shape reconstruc- In Proceedings of the IEEE/CVF International Conference
tion and completion. In Proceedings of the IEEE/CVF Con- on Computer Vision, pages 6423–6432, 2021. 1, 2
ference on Computer Vision and Pattern Recognition, pages [23] Lanxiao Li and Michael Heizmann. A closer look at invari-
6970–6981, 2020. 6, 12 ances in self-supervised pre-training for 3d vision. arXiv
[10] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- preprint arXiv:2207.04997, 2022. 2
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: [24] Lanxiao Li and Michael Heizmann. A closer look at invari-
Richly-annotated 3d reconstructions of indoor scenes. In ances in self-supervised pre-training for 3d vision. In ECCV,
Proc. Computer Vision and Pattern Recognition (CVPR), 2022. 14
IEEE, 2017. 5 [25] Haotian Liu, Mu Cai, and Yong Jae Lee. Masked discrim-
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ination for self-supervised learning on point clouds. arXiv
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, preprint arXiv:2203.11183, 2022. 1, 2, 5, 6, 13
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [26] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
vain Gelly, et al. An image is worth 16x16 words: Trans- Christian Theobalt. Neural sparse voxel fields. Advances
formers for image recognition at scale. arXiv preprint in Neural Information Processing Systems, 33:15651–15663,
arXiv:2010.11929, 2020. 1 2020. 6
[27] Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, [41] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
and Wenping Wang. Sparseneus: Fast generalizable neu- Geiger. Kilonerf: Speeding up neural radiance fields with
ral surface reconstruction from sparse views. arXiv preprint thousands of tiny mlps, 2021. 2
arXiv:2206.05737, 2022. 8 [42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[28] William E Lorensen and Harvey E Cline. Marching cubes: Faster r-cnn: Towards real-time object detection with region
A high resolution 3d surface construction algorithm. ACM proposal networks. Advances in neural information process-
siggraph computer graphics, 21(4):163–169, 1987. 8 ing systems, 28, 2015. 1
[29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay [43] Jonathan Sauder and Bjarne Sievers. Self-supervised deep
regularization. arXiv preprint arXiv:1711.05101, 2017. 5 learning on point clouds by reconstructing space. Advances
[30] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, in Neural Information Processing Systems, 32, 2019. 6, 7
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [44] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
Representing scenes as neural radiance fields for view syn- Sun rgb-d: A rgb-d scene understanding benchmark suite. In
thesis. Communications of the ACM, 65(1):99–106, 2021. 2, Proceedings of the IEEE conference on computer vision and
4, 12 pattern recognition, pages 567–576, 2015. 5
[31] Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin [45] Maxim Tatarchenko, Stephan R Richter, René Ranftl,
Dai. Voxel-mae: Masked autoencoders for pre-training Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
large-scale point clouds. arXiv preprint arXiv:2206.09900, single-view 3d reconstruction networks learn? In Proceed-
2022. 1, 2 ings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 3405–3414, 2019. 7
[32] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-to-
End Transformer Model for 3D Object Detection. In ICCV, [46] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.
2021. 6 Videomae: Masked autoencoders are data-efficient learn-
ers for self-supervised video pre-training. arXiv preprint
[33] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
arXiv:2203.12602, 2022. 1
der Keller. Instant neural graphics primitives with a multires-
[47] Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and
olution hash encoding. ACM Trans. Graph., 41(4):102:1–
Matt J Kusner. Unsupervised point cloud pre-training via oc-
102:15, July 2022. 2
clusion completion. In Proceedings of the IEEE/CVF inter-
[34] Michael Oechsle, Songyou Peng, and Andreas Geiger. national conference on computer vision, pages 9782–9792,
Unisurf: Unifying neural implicit surfaces and radiance 2021. 1, 6, 7
fields for multi-view reconstruction. In Proceedings of the
[48] Jingwen Wang, Tymoteusz Bleja, and Lourdes Agapito.
IEEE/CVF International Conference on Computer Vision,
Go-surf: Neural feature grid optimization for fast, high-
pages 5589–5599, 2021. 2
fidelity rgb-d surface reconstruction. arXiv preprint
[35] Joseph Ortiz, Alexander Clegg, Jing Dong, Edgar Sucar, arXiv:2206.14735, 2022. 5, 12
David Novotny, Michael Zollhoefer, and Mustafa Mukadam. [49] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
isdf: Real-time neural signed distance fields for robot per- Komura, and Wenping Wang. Neus: Learning neural implicit
ception. arXiv preprint arXiv:2204.02296, 2022. 5 surfaces by volume rendering for multi-view reconstruction.
[36] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, arXiv preprint arXiv:2106.10689, 2021. 2, 3, 4, 12
Yonghong Tian, and Li Yuan. Masked autoencoders [50] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini-
for point cloud self-supervised learning. arXiv preprint vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
arXiv:2203.06604, 2022. 1, 2 Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet:
[37] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Learning multi-view image-based rendering. arXiv preprint
Pollefeys, and Andreas Geiger. Convolutional occupancy arXiv:2102.13090, 2021. 2
networks. In European Conference on Computer Vision, [51] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma,
pages 523–540. Springer, 2020. 6, 7, 12, 13 Michael M. Bronstein, and Justin M. Solomon. Dynamic
[38] Charles R Qi, Or Litany, Kaiming He, and Leonidas J graph cnn for learning on point clouds. ACM Transactions
Guibas. Deep hough voting for 3d object detection in point on Graphics (TOG), 2019. 6, 7
clouds. In Proceedings of the IEEE International Conference [52] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas
on Computer Vision, 2019. 5, 6 Guibas, and Or Litany. Pointcontrast: Unsupervised pre-
[39] Ruslan Rakhimov, Andrei-Timotei Ardelean, Victor Lem- training for 3d point cloud understanding. In European con-
pitsky, and Evgeny Burnaev. Npbg++: Accelerating neural ference on computer vision, pages 574–591. Springer, 2020.
point-based graphics. In Proceedings of the IEEE/CVF Con- 1, 2, 6
ference on Computer Vision and Pattern Recognition, pages [53] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin
15969–15979, 2022. 2 Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf:
[40] Yongming Rao, Benlin Liu, Yi Wei, Jiwen Lu, Cho-Jui Point-based neural radiance fields. In Proceedings of the
Hsieh, and Jie Zhou. Randomrooms: Unsupervised pre- IEEE/CVF Conference on Computer Vision and Pattern
training from synthetic shapes and randomized layouts for Recognition, pages 5438–5448, 2022. 2, 7, 12
3d object detection. In Proceedings of the IEEE/CVF Inter- [54] Ryosuke Yamada, Hirokatsu Kataoka, Naoya Chiba,
national Conference on Computer Vision, pages 3283–3292, Yukiyasu Domae, and Tetsuya Ogata. Point cloud pre-
2021. 1, 2, 6 training with natural 3d structures. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 21283–21293, 2022. 6
[55] Siming Yan, Zhenpei Yang, Haoxiang Li, Li Guan, Hao
Kang, Gang Hua, and Qixing Huang. Implicit autoencoder
for point cloud self-supervised representation learning. arXiv
preprint arXiv:2201.00785, 2022. 1, 2, 6, 7, 13
[56] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
Angjoo Kanazawa. PlenOctrees for real-time rendering of
neural radiance fields. In ICCV, 2021. 2, 6
[57] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
pixelNeRF: Neural radiance fields from one or few images.
https://arxiv.org/abs/2012.02190, 2020. 2
[58] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie
Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud
transformers with masked point modeling. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2022. 1, 2, 6
[59] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. NERF++: Analyzing and improving neural radiance
fields. https://arxiv.org/abs/2010.07492, 2020. 2
[60] Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin
Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-
m2ae: Multi-scale masked autoencoders for hierarchical
point cloud pre-training. arXiv preprint arXiv:2205.14401,
2022. 1, 2
[61] Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan
Misra. Self-supervised pretraining of 3d features on any
point-cloud. 2021. 1, 2, 6
Ponder: Point Cloud Pre-training via Neural Rendering
Supplementary Material
Table 9. 3D object detection AP25 and AP50 on ScanNet and SUN RGB-D. * means a different but stronger version of VoteNet.
Figure 8. More results of application examples of Ponder on the ScanNet validation set (part 1). The input point clouds are drawn as
spheres for better clarity.
Input Point Cloud Projected Point Cloud Reconstruction Image Synthesis Depth Synthesis
Figure 9. More results of application examples of Ponder on the ScanNet validation set (part 2). The input point clouds are drawn as
spheres for better clarity.