Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views16 pages

Ponder

The document presents Ponder, a novel self-supervised learning approach for point cloud representations using differentiable neural rendering. This method trains a point-cloud encoder by comparing rendered images with real RGB-D data, demonstrating effectiveness across various tasks such as 3D detection, segmentation, and reconstruction. Extensive experiments validate Ponder's superiority over existing pre-training methods in the 3D point cloud domain.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views16 pages

Ponder

The document presents Ponder, a novel self-supervised learning approach for point cloud representations using differentiable neural rendering. This method trains a point-cloud encoder by comparing rendered images with real RGB-D data, demonstrating effectiveness across various tasks such as 3D detection, segmentation, and reconstruction. Extensive experiments validate Ponder's superiority over existing pre-training methods in the 3D point cloud domain.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Ponder: Point Cloud Pre-training via Neural Rendering

Di Huang1,2 Sida Peng3 Tong He2,† Xiaowei Zhou3 Wanli Ouyang2


The University of Sydney1 Shanghai AI Laboratory2 Zhejiang University3
arXiv:2301.00157v1 [cs.CV] 31 Dec 2022

Abstract
Render

We propose a novel approach to self-supervised learning Compare


RGB-D
of point cloud representations by differentiable neural ren- Neural Scene
Representation
dering. Motivated by the fact that informative point cloud
features should be able to encode rich geometry and ap- 3D Object Detection
pearance cues and render realistic images, we train a point-
cloud encoder within a devised point-based neural renderer
by comparing the rendered images with real images on mas-
sive RGB-D data. The learned point-cloud encoder can be
easily integrated into various downstream tasks, including 3D Semantic Segmentation 3D Scene Reconstruction Image synthesis
not only high-level tasks like 3D detection and segmenta-
tion, but low-level tasks like 3D reconstruction and image Figure 1. This work proposes a novel point cloud pre-training
synthesis. Extensive experiments on various tasks demon- method via neural rendering, named Ponder. Ponder is directly
strate the superiority of our approach compared to exist- trained with RGB-D image supervision, and can be used for vari-
ing pre-training methods. The code will be released at ous applications, e.g. 3D object detection, 3D semantic segmenta-
https://dihuangdh.github.io/ponder. tion, 3d scene reconstruction, and image synthesis.

ous data augmentation policies. Inspired by these works,


many works [7, 18, 20, 22, 40, 52, 61] are proposed to learn
1. Introduction
geometry-invariant features on 3D point cloud.
We have witnessed the widespread success of supervised Completion-based methods are another line of research
learning in developing vision tasks, such as image classi- for 3D SSL, which utilizes a pre-training task of recon-
fication [11, 17] and object detection [16, 42]. In contrast structing the masked point cloud based on partial observa-
to the 2D image domain, current 3D point cloud bench- tions. By maintaining a high masking ratio, such a simple
marks only maintain limited annotations, in terms of quan- task encourages the model to learn a holistic understand-
tity and diversity, due to the extremely high cost of labo- ing of the input beyond low-level statistics. Although the
rious labeling. Self-supervised learning (SSL) for point masked autoencoders have been successfully applied for
cloud [7, 18, 20, 22, 25, 31, 36, 40, 47, 52, 55, 58, 60, 61], con- SSL in images [14] and videos [12,46], it remains challeng-
sequently, becomes one of the main driving forces and has ing and still in exploration due to the inherent irregularity
attracted increasing attention in the 3D research community. and sparsity of the point cloud data.
Previous SSL methods for learning effective 3D rep- Different from the two groups of methods above, we pro-
resentation can be roughly categorized into two groups: pose point cloud pre-training via neural rendering (Pon-
contrast-based [7, 18, 20, 22, 40, 52, 61] and completion- der). Our motivation is that neural rendering, one of the
based [25, 31, 36, 47, 55, 58, 60]. Contrast-based methods most amazing progress and domain-specific design in 3D
are designed to maintain invariant representation under dif- vision, can be leveraged to enforce the point cloud features
ferent transformations. To achieve this, informative sam- being able to encode rich geometry and appearance cues.
ples are required. In the 2D image domain, the above As illustrated in Figure 1, we address the task of learning
challenge is addressed by (1) introducing efficient posi- representative 3D features via point cloud rendering. To the
tive/negative sampling methods, (2) using a large batch size best of our knowledge, this is the first exploration of neural
and storing representative samples, and (3) applying vari- rendering for pre-training 3D point cloud models. Specif-
ically, given one or a sequence of RGB-D images, we lift
† denote corresponding author. them to 3D space and obtain a set of colored points. Points
Contrast-based training, boost the quality of geometry, and so on. Another
type of neural rendering leverages neural point clouds as
Contrast the scene representation. [2, 39] take points locations and
corresponding descriptors as input, rasterize the points with
Augmented Point Cloud Augmented Point Cloud
z-buffer, and use a rendering network to get the final image.
Completion-based
Later work of PointNeRF [53] renders realistic images from
Encoder Decoder neural point cloud representation using a NeRF-like render-
ing process. Our work is inspired by the recent progress of
Augmented Point Cloud Point Cloud Supervision
neural rendering.
Ponder
Self-supervised learning in point clouds. Current meth-
Encoder Neural Rendering
ods can be roughly categorized into two categories:
contrast-based and completion-based. Inspired by the
Point Cloud 3D Feature Volume works [6,15] from the 2D image domain, PointContrast [52]
RGB-D Supervision
is one of the pioneering works for 3D contrastive learn-
Figure 2. Different types of point cloud pre-training. ing. Similarly, it encourages the network to learn invariant
3D representation under different transformations. Some
are then forwarded to a 3D encoder to learn the geome- works [7, 18, 20, 22, 40, 61] follow the pipeline by either de-
try and appearance of the scene via a neural representation. vising new sampling strategies to select informative pos-
Provided specific parameters of the camera and the neural itive/negative training pairs, or explore various types of
representation from the encoder, neural rendering is lever- data augmentations. Another line of work is completion-
aged to render the RGB and depth images in a differentiable based [25, 31, 36, 55, 58, 60] methods, which get inspiration
way. The network is trained to minimize the difference be- from Masked Autoencoders [14]. PointMAE [36] proposes
tween rendered and observed 2D images. In doing so, our restoring the masked points via a set-to-set Chamfer Dis-
approach enjoys multiple advantages: tance. VoxelMAE [31] instead recovers the underlying ge-
ometry by distinguishing if the voxel contains points. An-
• Our method is able to learn effective point cloud rep- other work MaskPoint [25] pre-train point cloud encoder
resentation, which encodes rich geometry and appear- by performing binary classification to check if a sampled
ance clues by leveraging neural rendering. point is occupied. Later, IAE [55] proposes to pre-train
• Our method can be flexibly integrated into various point cloud encoder by recovering continuous 3D geometry
tasks. For the first time, we validate the effectiveness in an implicit manner. Different from the above pipelines,
of the proposed pre-training method to low-level tasks we propose a novel framework for point cloud pre-training
like surface reconstruction and image synthesis tasks. via neural rendering.

• The proposed method can leverage rich RGB-D im-


Multi-modal point cloud pre-training. Some recent
ages for pre-training. The easier accessibility of the
works explore the pre-training pipeline with multi-modality
RGB-D data enables the possibility of 3D pre-training
data of 2D images and 3D point clouds. Pri3D [19] use 3D
on a large amount of data.
point cloud and multi-view images to pre-train the 2D im-
We conduct comprehensive experiments on a host of tasks. age networks. CrossPoint [1] aligns the 2D image features
The consistent improvements demonstrate the effectiveness and 3D point cloud features through a contrastive learning
of our proposed Ponder. Our approach can serve as a strong pipeline. [23] proposes a unified framework for exploring
alternative to contrast-based methods and completion-based the invariances with different input data formats, including
methods in 3D point cloud pre-training. 2D images and 3D point clouds.
Different from previous methods, most of which attempt
2. Related Work to align 2D images and 3D point clouds in the feature space,
our method proposes to connect 2D and 3D in the RGB-D
Neural rendering. Neural Rendering is a type of render- image domain via differentiable rendering.
ing technology that uses neural networks to differentiablely
render images from 3D scene representation. NeRF [30] is 3. Methods
one of the representative neural rendering methods, which
represents the scene as the neural radiance field and renders An overview of our Ponder is presented in Figure 3. Pro-
the images via volume rendering. Based on NeRF, there are vided the camera pose, 3D point clouds are obtained by pro-
a series of works [4,33,34,41,49,50,56,57,59] trying to im- jecting the RGB-D images back to 3D space (Section 3.1).
prove the NeRF representation, including accelerate NeRF Then, we extract point-wise feature using a point cloud en-
3D Feature Volume V
<latexit sha1_base64="QIm6YtMEt+H9KlqUIGZS2mpuCrk=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdVl047KCfUAbymQ6aYdOZsLMjVBCP8ONC0Xc+jXu/BsnbRbaemDgcM69zLknTAQ36HnfTmltfWNzq7xd2dnd2z+oHh61jUo1ZS2qhNLdkBgmuGQt5ChYN9GMxKFgnXByl/udJ6YNV/IRpwkLYjKSPOKUoJV6/ZjgmBKRtWeDas2re3O4q8QvSA0KNAfVr/5Q0TRmEqkgxvR8L8EgIxo5FWxW6aeGJYROyIj1LJUkZibI5pFn7plVhm6ktH0S3bn6eyMjsTHTOLSTeUSz7OXif14vxegmyLhMUmSSLj6KUuGicvP73SHXjKKYWkKo5jarS8dEE4q2pYotwV8+eZW0L+r+Vd1/uKw1bos6ynACp3AOPlxDA+6hCS2goOAZXuHNQefFeXc+FqMlp9g5hj9wPn8Aki+Rcg==</latexit>

SDF

fs
<latexit sha1_base64="07aa7o5lYgi+sSPGEB2Sb6m3DlA=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0EPZ1v1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vktZF1atVvfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1ZEjdY=</latexit>

Point Cloud
Encoder

Query Point
fp
<latexit sha1_base64="m+1VKmYYcVLEOFQ7JUNGeVUAkZM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0EPaTfrniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYP6PKcCZwWuqlGhPKxnSIXUsljVD72fzUKTmzyoCEsbIlDZmrvycyGmk9iQLbGVEz0sveTPzP66YmvPYzLpPUoGSLRWEqiInJ7G8y4AqZERNLKFPc3krYiCrKjE2nZEPwll9eJa2LqlereveXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwBRuI3T</latexit>

Feature RGB
Point Cloud X Rendered
<latexit sha1_base64="rPqcaONwIPoQvdMhQn4HkZvtxqk=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiQi6rLoxmUF+4A2lMl00g6dTMLMjVBCP8ONC0Xc+jXu/BsnbRbaemDgcM69zLknSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wSTu9zvPHFtRKwecZpwP6IjJULBKFqp148ojhmVWXc2qNbcujsHWSVeQWpQoDmofvWHMUsjrpBJakzPcxP0M6pRMMlnlX5qeELZhI54z1JFI278bB55Rs6sMiRhrO1TSObq742MRsZMo8BO5hHNspeL/3m9FMMbPxMqSZErtvgoTCXBmOT3k6HQnKGcWkKZFjYrYWOqKUPbUsWW4C2fvEraF3Xvqu49XNYat0UdZTiBUzgHD66hAffQhBYwiOEZXuHNQefFeXc+FqMlp9g5hj9wPn8AlTmRdA==</latexit>

(PointNet, PointNet++ or DGCNN)


RGB-D Images
Multi-view View 1
fc
<latexit sha1_base64="qF7K6rfWsRTHlA5btG1iOODigRo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0EPZZv1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vktZF1atVvfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifPz4EjcY=</latexit>

RGB-D Images View 2


Supervision

Figure 3. The pipeline of our point cloud pre-training via neural rendering (Ponder). Given multi-view RGB-D images, we first
construct the point cloud by back-projection, then use a point cloud encoder fp to extract per-point features E. E are organized to a 3D
feature volume (visualized as an image in this figure) by average pooling. Finally, the 3D feature volume is rendered to multi-view RGB-D
images via a differentiable neural rendering, which are compared with the original input multi-view RGB-D images as the supervision.
Point cloud encoder fp and color decoder fc are used for transfer learning.

coder (Section 3.2) and organize it to a 3D feature volume empty space, we perform average pooling, followed by a
(Section 3.3), which is used to reconstruct the neural scene 3D CNN, to aggregate features from the nearby points. The
representation and render images in a differentiable manner dense 3D volume is denoted as V.
(Section 3.4).
3.4. Pre-training with Neural Rendering
3.1. Constructing point cloud from RGB-D images
This section introduces how to reconstruct the implicit
The proposed method makes use of sequential RGB- scene representation and render images differentiablely. We
D images {(Ii , Di )}Ni=1 , the camera intrinsic parameters first give a brief introduction to neural scene representation,
{Ki }N
i=1 , and extrinsic poses {ξi }N
i=1 ∈ SE(3). N is the then illustrate how to integrate it into our point cloud pre-
input view number. SE(3) refers to the Special Euclidean training pipeline. Last, we show the differentiable render-
Group representing 3D rotations and translations. The cam- ing formulation to render color and depth images from the
era parameters can be easily obtained from SfM or SLAM. neural scene representation.
We construct the point cloud X by back-projecting
RGB-D images to point clouds in a unified coordinate: Brief introduction of neural scene representation.
N Neural scene representation aims to represent the scene ge-
ometry and appearance through a neural network. In this
[
X = π −1 (Ii , Di , ξi , Ki ), (1)
i
paper, we use the Signed Distance Function (SDF), which
measures the distance between a query point and the sur-
where π −1 back-projects the RGB-D image to 3D world
face boundary, to represent the scene geometry implicitly.
space using camera poses. Note that different from pre-
SDF is capable of representing high-quality geometry de-
vious methods which only consider the point location, our
tails. For any query point of the scene, the neural network
method attributes each point with both point location and
takes points features as input and outputs the corresponding
RGB color. The details of π −1 are provided in the supple-
SDF value and RGB value. In this way, the neural network
mentary material.
captures both the geometry and appearance information of
3.2. Point cloud encoder for feature extraction a specific scene. Following NeuS [49], the scene can be
reconstructed as:
Given the point cloud X constructed from RGB-D im-
ages, a point cloud encoder fp is used to extract per-point s(p) = f˜s (p), c(p, d) = f˜c (p, d), (3)
feature embedding E:
where f˜s is the SDF decoder and f˜c is the RGB color de-
E = fp (X ). (2) coder. f˜s takes point location p as input, and predicts the
The encoder fp pre-trained with the method mentioned in SDF value s. f˜c takes point location p and viewing direc-
the Section 3.4 serves as a good initialization for various tion d as input, and outputs the RGB color value c. Both f˜s
downstream tasks. and f˜c are implemented by simple MLP networks.

3.3. Building feature volume


Neural scene representation from point cloud input in
Once the feature extraction is done, we map the point Ponder. To predict a neural scene representation from the
embeddings E to a 3D sparse feature volume. To fill in the input point cloud, we change the scene formulation to take
3D feature volume V as an additional input. Specifically,
given a 3D query point p and viewing direction d, the fea-
ture embedding V(p) can be extracted from the processed
feature volume V by trilinear interpolation. The scene is
then represented as:

s(p) = fs (p, V(p)), c(p, d) = fc (p, d, V(p)), (4)


where V is predicted by the point cloud encoder fp and en-
codes information of each scene. fs and fc are SDF and Projected Points Rendered Color Reference Color Rendered Depth Reference Depth

RGB decoders shared for all scenes. Different from Equa-


tion (3), which is used for storing single-scene information Figure 4. Rendered images by Ponder on the ScanNet validation
in the {f˜s , f˜c }, the formulation in Equation (4) includes an set. The projected point clouds are visualized in the first column.
Even though input point clouds are very sparse, our model is still
extra input V(p) to facilitate representing the information
capable of rendering color and depth images similar to the refer-
of multiple scenes.
ence images.

Differentiable rendering. Given the dense 3D volume V


and viewing point, we use differentiable volume render- Rendered examples. The rendered color images and
ing to render the projected color images and depth images. depth images are shown in Figure 4. As shown in the fig-
For each rendering ray with camera origin o and viewing ure, even though the input point cloud is pretty sparse, our
direction d, we sample a set of ray points {p(z)|p(z) = method is still capable of rendering color and depth images
o + zd, z ∈ [zn , zf ]} along the ray, where z denotes the similar to the reference image.
length of the ray. Note that o and d can be calculated from
paired camera parameters {(Ki , ξi )}. zn and zf denote 3.5. Pre-training loss
the near and far bounds of the ray. Different from previ-
ous methods [30, 49], we automatically determine {zn , zf } We leverage the input {Ii , Di } to supervise neural scene
by the ray intersection with the 3D feature volume box, us- representation reconstruction. The total loss function con-
ing axis-aligned bounding boxes (AABB) algorithm. Then, tains five parts,
the ray color and depth value can be aggregated as:
Z zf L = λc Lc + λd Ld + λe Le + λs Ls + λf Lf , (10)
Ĉ = ω(z)c(p(z), d)dz, (5)
z
Z nzf which are loss functions responsible for color supervision
D̂ = ω(z)zdz, (6) Lc , depth supervision Ld , Eikonal regularization Le , near-
zn surface SDF supervision Ls , and free space SDF supervi-
sion Lf . These loss functions are illustrated in the follow-
where the Ĉ is the ray color and the D̂ is the ray depth. ing section.
We follow NeuS [49] to build an unbiased and occlusion-
awareness weight function w(z):
w(z) = T (z) · ρ(z). (7) Color and depth loss. Lc and Ld are the color loss and
T (z) measures the accumulated transmittance from zn to z depth loss, which measure consistency between the ren-
and ρ(z) is the occupied density function which are defined dered pixels and the ground-truth pixels. Assume that we
as: sample Nr rays for each image and Np points for each ray,
Z zf then the Lc and Ld can be written as:
T (z) = exp(− ρ(z)dz), (8)
zn Nr
1 X

− dΦ
 Lc = ||Ĉ − C||22 (11)
dz (s(p(z)))
h

ρ(z) = max ,0 . (9) Nr i


Φh (s(p(z)))
Nr
1 X
Φh (x) is the Sigmoid function Φh (x) = (1 + e−hx )−1 Ld = ||D̂ − D||22 , (12)
where h−1 is treated as a trainable parameter, h−1 ap- Nr i
proaches to zero as the network training converges. In prac-
tice, we use a numerically approximated version by quadra- where C and D are the ground-truth color and depth re-
ture. We make the decode networks {fs , fc } relatively spectively for each ray, Ĉ and D̂ are their corresponding
smaller than [30, 49] to accelerate the training process. rendered ones in Eq. (5) and Eq. (6).
Loss for SDF regularization. Le is the widely used Each scene is carefully scanned by an RGB-D camera, lead-
Eikonal loss [13] for SDF regularization: ing to about 2.5 million RGB-D frames in total. We follow
the same train/val split with VoteNet [38].
Nr ,Np
1 X
Le = (|∇s(pi,j )| − 1)2 , (13)
Nr Np i,j Data preparation. During pre-training, a mini-batch of
batch size 8 includes point clouds from 8 scenes. The point
where ∇s(pi,j ) denotes the gradient of SDF s at location cloud of a scene, serving as the input of the point cloud en-
pi,j . Since SDF is a distance measure, Le encourages this coder in our approach, is back-projected from the 5 RGB-D
distance to have a unit norm gradient at the query point. frames of the video for the scene with an interval of 20. The
5 frames are also used as the supervision of the network.
Near-surface and free space loss for SDF. To stabilize
the training and improve the reconstruction performance, Data augmentation. We augment the point cloud by ran-
similar to iSDF [35] and GO-Surf [48], we add additional dom sampling, normalization, and random masking. First,
approximate SDF supervision to help the SDF estimation. we randomly down-sample the point cloud to 20,000 points.
Specifically, for near-surface points, the difference between Then, the point cloud is normalized into a 3D unit cube. Fi-
rendered depth and ground-truth depth can be viewed as the nally, we apply the same masking strategy as used in Mask
pseudo-SDF ground-truth supervision; for points far from Point [25]. Specifically, we use FPS to split the point cloud
the surface, a free space loss is used to regularize the irreg- into 2,048 groups, each group containing 64 points, then
ular SDF value additionally. To calculate the approximate mask the point groups with a mask ratio of 90%.
SDF supervision, we first define an indicator b(z) for each
sampled ray point with ray length z and corresponding GT Implementation details. We train the proposed pipeline
depth D: for 100 epochs using an AdamW optimizer [29] with a
b(z) = D − z. (14) weight decay of 0.05. The learning rate is initialized as 1e-
b(z) can be viewed as the approximate SDF value, which is 4 with Exponential scheduling. For the rendering process,
credible only when b(z) is small. Let t be a human-defined we randomly choose 128 rays for each image and sample
threshold, which is set as 0.05 in this paper. For sampled ray 128 points for each ray. More implementation details can
points that satisfy b(z) ≤ t, we leverage the near-surface be found in the supplementary materials.
SDF loss to constrain the SDF prediction s(zi,j ): 4.2. Transfer Learning
Nr ,Np
1 X In contrast to previous methods, our approach is able to
Ls = |s(zi,j ) − b(zi,j )|. (15) encode rich geometry and appearance cues into the point
Nr Np i,j
cloud representations via neural rendering. These strengths
For the remaining sampled ray points, we use a free space make it flexible to be applied to various tasks, including not
loss: only 3D semantic segmentation and 3D detection tasks but
also low-level surface reconstruction and image synthesis.
Nr ,Np
1 X
Lf = max(0, e−α·s(zi,j ) −1, s(zi,j )−b(zi,j )), 4.2.1 High-level 3D Tasks
Nr Np i,j
(16) 3D object detection. For transfer learning on 3D ob-
where α is set as 5 following the same with [35, 48]. Note ject detection task, we use VoteNet [38] as the baseline.
that due to the noisy depth images, we only apply Ls and VoteNet leverage a voting mechanism to generate object
Lf on the rays that have valid depth values. centers, which are used for 3D bounding box proposals.
In our experiments, we follow a similar loss of weight Two datasets are applied to verify the effectiveness of our
with GO-Surf [48], which sets λc as 10.0, λd as 1.0, λs as method: ScanNet [10] and SUN RGB-D [44]. Differ-
10.0, and λf as 1.0. We observe that the Eikonal term in ent from ScanNet, which contains fully reconstructed 3D
our method can easily lead to over-smooth reconstructions, scenes, SUN RGB-D is a single-view RGB-D dataset with
thus we use a small weight of 0.01 for the Eikonal loss. 3D bounding box annotations. It has 10,335 RGB-D images
for 37 object categories. For pre-training, we use Point-
4. Experiments Net++ as the point cloud encoder fp , which is identical to
the backbone used in VoteNet. We pre-train the point cloud
4.1. Pre-training
encoder on the ScanNet dataset and transfer the weight as
Datasets. We use ScanNet [10] RGB-D images as our the VoteNet initialization. Following [38], we use average
pre-training data. ScanNet is a widely used real-world in- precision with 3D detection IoU threshold 0.25 and thresh-
door dataset, which contains more than 1500 indoor scenes. old 0.5 as the evaluation metrics.
Detection Pre-training Pre-training Pre-training ScanNet SUN RGB-D
Method
Model Type Data Epochs AP50 ↑ AP25 ↑ AP50 ↑ AP25 ↑
3DETR [32] 3DETR - - - 37.5 62.7 30.3 58.0
Point-BERT [58] 3DETR Completion 3D Model 300 38.3 61.0 - -
MaskPoint [25] 3DETR Completion Depth 300 40.6 63.4 - -
VoteNet [38] VoteNet - - - 33.5 58.6 32.9 57.7
STRL [20] VoteNet Contrast Depth 100 38.4 59.5 35.0 58.2
RandomRooms [40] VoteNet Contrast Synthesis 300 36.2 61.3 35.4 59.2
PointContrast [52] VoteNet Contrast 3D Model - 38.0 59.2 34.8 57.5
PC-FractalDB [54] VoteNet Contrast Synthesis - 38.3 61.9 33.9 59.4
DepthContrast [61] VoteNet Contrast Depth 1000 39.1 62.1 35.4 60.4
IAE [55] VoteNet Completion 3D Model 1000 39.8 61.5 36.0 60.4
Ponder VoteNet Rendering Depth 100 40.9 64.2 36.1 60.3
Ponder VoteNet Rendering Color & Depth 100 41.0 63.6 36.6 61.0

Table 1. 3D object detection AP25 and AP50 on ScanNet and SUN RGB-D. VoteNet [38] and 3DETR [32] are two baseline 3D object de-
tection models. The DepthContrast [61] and Point-BERT [58] results are adopted from IAE [55] and MaskPoint [25]. Ponder outperforms
both VoteNet-based and 3DETR-based point cloud pre-training methods with fewer training epochs.

The 3D detection results are shown in Table 1. Our mentation performance by a large margin, boosting OA
method improves the baseline of VoteNet without pre- and mIoU for 2.1% and 5%, respectively. Jigsaw and
training by a large margin, boosting AP50 by 7.5% and OcCo use ShapeNet as the pre-train dataset. Although
3.7% for ScanNet and SUN RGB-D, respectively. IAE [55] they get improvements compared with the baseline, the
is a pre-training method that represents the inherent 3D ge- limited scale of training data constrains the transferring
ometry in a continuous manner. Our learned point cloud ability. IAE achieves significant improvements by lever-
representation achieves higher accuracy because it is able aging the large-scale dataset and an implicit reconstruc-
to recover both the geometry and appearance of the scene. tion manner. Compared with IAE, the proposed approach
The AP50 and AP25 of our method are higher than that of achieves a higher semantic segmentation performance with
IAE by 1.2% and 2.1% on ScanNet, respectively. Mask- the DGCNN backbone (+0.3% for OA and +0.4% for
Point [25] is another method aiming to learn a continuous mIoU). Besides, IAE requires a large amount of 3D mesh
surface by classifying if the query point is occupied. How- for supervision. Our approach, in contrast, only requires
ever, its performance can be constrained due to the noisy RGB-D images as the supervision, which is much cheaper
labeling of the query point occupancy value. As presented and easy to fetch.
in Table 1, even with an inferior backbone (PointNet++ vs
3DETR), our method is able to achieve better accuracy with
4.2.2 Low-level 3D Tasks
fewer pre-training epochs.
Low-level 3D tasks like scene reconstruction and image
3D semantic segmentation. 3D semantic segmentation synthesis are getting increasing attention due to their wide
is another fundamental scene understanding task. Follow- applications. However, most of them are trained from
ing [43,47,55], we choose DGCNN [51] as our baseline for scratch. How to pre-train a model with a good initialization
a fair comparison. DGCNN applies a dynamic graph CNN is desperately needed. We are the first pre-training work to
as the backbone. For pre-training, we use DGCNN as the demonstrate a strong transferring ability to such low-level
point cloud encoder fp , and pre-train the model on ScanNet. 3D tasks.
We validate the effectiveness of our method by transfer-
ring the weights to Stanford Large-Scale3D Indoor Spaces 3D scene reconstruction. 3D scene reconstruction task
(S3DIS) [3] dataset, which is an indoor 3D understanding aims to recover the scene geometry, e.g. mesh, from
dataset containing 6 large-scale indoor scenes with point se- the point cloud input. We choose ConvONet [37] as the
mantic annotations. Following the same setting of [51], we baseline model, whose architecture are widely adopted
use the overall accuracy (OA) mean IoU(mIoU) on points in [9, 26, 56]. Following the same setting as ConvONet, we
as the evaluation metric, and report the average evaluation conduct experiments on the Synthetic Indoor Scene Dataset
results across six folds. (SISD) [37], which is a synthetic dataset and contains 5000
Table 2 shows the quantitative results. Compared with scenes with multiple ShapeNet [5] objects. We pre-train the
the DGCNN baseline, the proposed method boost the seg- PointNet encoder, which is the same as the original Con-
Method OA↑ mIoU↑
DGCNN [51] 84.1 56.1 Method Encoder IoU↑ Normal Consistency↑ F-Score↑
Jigsaw [43] 84.4 56.6 ConvONet [37] PointNet 84.9 0.915 0.964
OcCo [47] 85.1 58.5 Ponder PointNet 85.7 0.917 0.965
IAE [55] 85.9 60.7 ConvONet PointNet++ 77.8 0.887 0.906
Ponder 86.2 61.1 Ponder PointNet++ 80.2 0.893 0.920

Table 2. 3D semantic segmentation OA Table 3. 3D scene reconstruction IoU, NC, and F-Score on SISD dataset with
and mIoU on S3DIS dataset with DGCNN PointNet and PointNet++ model. For both PointNet and PointNet++,
model. Ponder outperforms previous state-of- Ponder is able to boost the reconstruction performance.
the-art models.

Supervision ScanNet SUN RGB-D


Depth 40.9 36.1
Color 40.5 35.8
Color+RGB 41.0 36.6

Table 4. Ablation study for supervision type. 3D detection AP50


on ScanNet and SUN RGB-D. Combining color supervision and
depth supervision can lead to better detection performance than
using a single type of supervision.

place the 2D image features of Point-NeRF with point fea-


tures extracted by a DGCNN network. Following the same
Figure 5. Comparison of image synthesis from point clouds.
setting with PointNeRF, we use DTU [21] as the evaluation
Compared with training from scratch, our Ponder model is able to
dataset. DTU dataset is a multiple-view stereo dataset con-
converge faster and achieve better image synthesis results.
taining 80 scenes with paired images and camera poses. We
transfer both the DGCNN encoder and color decoder as the
vONet implementation, and test the reconstruction quality weight initialization of Point-NeRF. We use PSNR as the
on the SISD dataset. Additionally, to compare with another metric for synthesized image quality evaluation.
self-supervised learning method IAE [55], we add extra ex-
The results are shown in Figure 5. By leveraging the pre-
periments using VoteNet-style PointNet++ as the encoder
trained weights of our method, the image synthesis model is
of ConvONet. Following [37], we use Volumetric IoU, Nor-
able to converge faster with fewer training steps and achieve
mal Consistency, and F-Score [45] with the threshold value
better final image quality than training from scratch.
of 1% as the evaluation metrics.
The results are shown in Table 3. Compared to the
baseline ConvONet model with PointNet, the proposed ap- 4.3. Ablation study
proach is able to improve the reconstruction quality (+0.8% In this section, we do two ablation experiments. First,
for IoU). By replacing the encoder of ConvONet from we show the effectiveness of using different 2D supervi-
PointNet to PointNet++, ours is able to achieve more accu- sion. Then, we test how the view number affects the final
racy improvement (+2.4% for IoU and +0.014 for F-Score). performance. For both experiments, we use the 3D object
Our method also gets better reconstruction results than IAE. detection task as the transfer learning task.
Check our supplementary materials for more details.

Image synthesis from point clouds. We also validate the Influence on Rendering Targets. The rendering part of
effectiveness of our method on another low-level task of im- our method contains two items: RGB color image and depth
age synthesis from point clouds. We use Point-NeRF [53] image. We study the influence of each item with the trans-
as the baseline. Point-NeRF uses neural 3D point clouds ferring task of 3D detection. The results are presented in Ta-
with associated neural features to render images. It can be ble 4. Combining depth and color images for reconstruction
used both for a generalizable setting for various scenes and a shows the best detection results. In addition, using depth re-
single-scene fitting setting. In our experiments, we mainly construction presents better performance than color recon-
focus on the generalizable setting of Point-NeRF. We re- struction for 3D detection.
Input Point Cloud Projected Point Cloud Reconstruction Image Synthesis Reference Image Depth Synthesis Reference Depth

Figure 6. Direct applications of Ponder on the ScanNet validation set. The proposed Ponder model can be directly used for various
applications, such as 3D reconstruction and image synthesis. The input point clouds are drawn as spheres for better clarity.

Number of input RGB-D view. Our method utilizes N depth may even look better compared with the ground-truth
RGB-D images, where N is the input view number. We depth image which has irregular values.
study the influence of N and conduct experiments on 3D
detection, as shown in Table 5. Using multi-view super- View number ScanNet SUN RGB-D
vision helps to reduce single-view ambiguity. Similar ob-
servations are also found in the multi-view reconstruction 1 view 40.1 35.4
task [27]. Compared with the single view, multiple views 3 views 40.8 36.0
achieve higher accuracy, boosting AP50 by 0.9% and 1.2% 5 views 41.0 36.6
for ScanNet and Sun RGB-D datasets, respectively.
Table 5. Ablation study for view number. 3D detection AP50
4.4. Other applications on ScanNet and SUN RGB-D. Using multi-view supervision for
In previous sections, we show that the proposed pipeline point cloud pre-training can achieve better performance.
can be used for transfer learning. In this section, we show
that the pre-trained model from our pipeline Ponder itself 5. Conclusion
can also be directly used for surface reconstruction and im-
In this paper, we show that differentiable neural render-
age synthesis from sparse point clouds.
ing is a powerful tool for point cloud representation learn-
ing. The proposed pre-training pipeline, Ponder, is able to
3D reconstruction from sparse point clouds. The encode rich geometry and appearance cues into the point
learned model has the capability to recover the scene sur- cloud representation via neural rendering. For the first time,
face from sparse point clouds. Specifically, after learning our model can be transferred to not only high-level 3D per-
the neural scene representation, we query the SDF value in ception tasks but also 3D low-level tasks, like 3D recon-
the 3D space and leverage the Marching Cubes [28] to ex- struction and image synthesis from point clouds. Also, the
tract the surface. We show the reconstruction results in Fig- learned Ponder model can be directly used for 3D recon-
ure 6. The results show that even though the input is sparse struction and image synthesis from sparse point clouds.
point clouds from complex scenes, our method is able to Several directions could be explored in future works.
recover high-fidelity meshes. First, there are various types of neural rendering, which
could also be leveraged for point cloud representation learn-
Image synthesis from sparse point clouds. Another in- ing. Second, other 3D domain-specific designs could be
teresting experiment to explore is that our pipeline is able integrated into point cloud pre-training pipelines. Third,
to render realistic images from sparse point cloud input. As exploring the proposed pre-training pipeline Ponder on a
shown in Figure 6, our method is able to recover similar larger dataset and more downstream tasks is also a potential
color images with the ground truth. Also, the recovered research direction.
References [12] Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaim-
ing He. Masked autoencoders as spatiotemporal learners.
[1] Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, arXiv preprint arXiv:2205.09113, 2022. 1
Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- [13] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
drigo. Crosspoint: Self-supervised cross-modal contrastive Yaron Lipman. Implicit geometric regularization for learning
learning for 3d point cloud understanding. In Proceedings of shapes. arXiv preprint arXiv:2002.10099, 2020. 5
the IEEE/CVF Conference on Computer Vision and Pattern [14] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
Recognition, pages 9902–9912, 2022. 2 Dollár, and Ross Girshick. Masked autoencoders are scalable
[2] Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry vision learners. In Proceedings of the IEEE/CVF Conference
Ulyanov, and Victor Lempitsky. Neural point-based graph- on Computer Vision and Pattern Recognition, pages 16000–
ics. In European Conference on Computer Vision, pages 16009, 2022. 1, 2
696–712. Springer, 2020. 2 [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
[3] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioan- Girshick. Momentum contrast for unsupervised visual rep-
nis Brilakis, Martin Fischer, and Silvio Savarese. 3d seman- resentation learning. In Proceedings of the IEEE/CVF con-
tic parsing of large-scale indoor spaces. In Proceedings of ference on computer vision and pattern recognition, pages
the IEEE conference on computer vision and pattern recog- 9729–9738, 2020. 2
nition, pages 1534–1543, 2016. 6 [16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
[4] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter shick. Mask r-cnn. In Proceedings of the IEEE international
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. conference on computer vision, pages 2961–2969, 2017. 1
Mip-nerf: A multiscale representation for anti-aliasing neu- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
ral radiance fields. In Proceedings of the IEEE/CVF Inter- Deep residual learning for image recognition. In Proceed-
national Conference on Computer Vision, pages 5855–5864, ings of the IEEE conference on computer vision and pattern
2021. 2 recognition, pages 770–778, 2016. 1
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, [18] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Xie. Exploring data-efficient 3d scene understanding with
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: contrastive scene contexts. In Proceedings of the IEEE/CVF
An information-rich 3d model repository. arXiv preprint Conference on Computer Vision and Pattern Recognition,
arXiv:1512.03012, 2015. 6 pages 15587–15597, 2021. 1, 2
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [19] Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and
offrey Hinton. A simple framework for contrastive learning Matthias Nießner. Pri3d: Can 3d priors help 2d represen-
of visual representations. In International conference on ma- tation learning? In Proceedings of the IEEE/CVF Inter-
chine learning, pages 1597–1607. PMLR, 2020. 2 national Conference on Computer Vision, pages 5693–5702,
2021. 2
[7] Yujin Chen, Matthias Nießner, and Angela Dai. 4dcontrast:
[20] Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu.
Contrastive learning with dynamic correspondences for 3d
Spatio-temporal self-supervised representation learning for
scene understanding. In Proceedings of the European Con-
3d point clouds. arXiv preprint arXiv:2109.00179, 2021. 1,
ference on Computer Vision (ECCV), 2022. 1, 2
2, 6
[8] Zhang Chen, Yinda Zhang, Kyle Genova, Sean Fanello, [21] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola,
Sofien Bouaziz, Christian Häne, Ruofei Du, Cem Keskin, and Henrik Aanæs. Large scale multi-view stereopsis eval-
Thomas Funkhouser, and Danhang Tang. Multiresolution uation. In Proceedings of the IEEE conference on computer
deep implicit functions for 3d shape representation. In vision and pattern recognition, pages 406–413, 2014. 7
Proceedings of the IEEE/CVF International Conference on [22] Li Jiang, Shaoshuai Shi, Zhuotao Tian, Xin Lai, Shu Liu,
Computer Vision, pages 13087–13096, 2021. 12 Chi-Wing Fu, and Jiaya Jia. Guided point contrastive learn-
[9] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. ing for semi-supervised point cloud semantic segmentation.
Implicit functions in feature space for 3d shape reconstruc- In Proceedings of the IEEE/CVF International Conference
tion and completion. In Proceedings of the IEEE/CVF Con- on Computer Vision, pages 6423–6432, 2021. 1, 2
ference on Computer Vision and Pattern Recognition, pages [23] Lanxiao Li and Michael Heizmann. A closer look at invari-
6970–6981, 2020. 6, 12 ances in self-supervised pre-training for 3d vision. arXiv
[10] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- preprint arXiv:2207.04997, 2022. 2
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: [24] Lanxiao Li and Michael Heizmann. A closer look at invari-
Richly-annotated 3d reconstructions of indoor scenes. In ances in self-supervised pre-training for 3d vision. In ECCV,
Proc. Computer Vision and Pattern Recognition (CVPR), 2022. 14
IEEE, 2017. 5 [25] Haotian Liu, Mu Cai, and Yong Jae Lee. Masked discrim-
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ination for self-supervised learning on point clouds. arXiv
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, preprint arXiv:2203.11183, 2022. 1, 2, 5, 6, 13
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [26] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
vain Gelly, et al. An image is worth 16x16 words: Trans- Christian Theobalt. Neural sparse voxel fields. Advances
formers for image recognition at scale. arXiv preprint in Neural Information Processing Systems, 33:15651–15663,
arXiv:2010.11929, 2020. 1 2020. 6
[27] Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, [41] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
and Wenping Wang. Sparseneus: Fast generalizable neu- Geiger. Kilonerf: Speeding up neural radiance fields with
ral surface reconstruction from sparse views. arXiv preprint thousands of tiny mlps, 2021. 2
arXiv:2206.05737, 2022. 8 [42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
[28] William E Lorensen and Harvey E Cline. Marching cubes: Faster r-cnn: Towards real-time object detection with region
A high resolution 3d surface construction algorithm. ACM proposal networks. Advances in neural information process-
siggraph computer graphics, 21(4):163–169, 1987. 8 ing systems, 28, 2015. 1
[29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay [43] Jonathan Sauder and Bjarne Sievers. Self-supervised deep
regularization. arXiv preprint arXiv:1711.05101, 2017. 5 learning on point clouds by reconstructing space. Advances
[30] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, in Neural Information Processing Systems, 32, 2019. 6, 7
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [44] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
Representing scenes as neural radiance fields for view syn- Sun rgb-d: A rgb-d scene understanding benchmark suite. In
thesis. Communications of the ACM, 65(1):99–106, 2021. 2, Proceedings of the IEEE conference on computer vision and
4, 12 pattern recognition, pages 567–576, 2015. 5
[31] Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin [45] Maxim Tatarchenko, Stephan R Richter, René Ranftl,
Dai. Voxel-mae: Masked autoencoders for pre-training Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
large-scale point clouds. arXiv preprint arXiv:2206.09900, single-view 3d reconstruction networks learn? In Proceed-
2022. 1, 2 ings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 3405–3414, 2019. 7
[32] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-to-
End Transformer Model for 3D Object Detection. In ICCV, [46] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.
2021. 6 Videomae: Masked autoencoders are data-efficient learn-
ers for self-supervised video pre-training. arXiv preprint
[33] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
arXiv:2203.12602, 2022. 1
der Keller. Instant neural graphics primitives with a multires-
[47] Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, and
olution hash encoding. ACM Trans. Graph., 41(4):102:1–
Matt J Kusner. Unsupervised point cloud pre-training via oc-
102:15, July 2022. 2
clusion completion. In Proceedings of the IEEE/CVF inter-
[34] Michael Oechsle, Songyou Peng, and Andreas Geiger. national conference on computer vision, pages 9782–9792,
Unisurf: Unifying neural implicit surfaces and radiance 2021. 1, 6, 7
fields for multi-view reconstruction. In Proceedings of the
[48] Jingwen Wang, Tymoteusz Bleja, and Lourdes Agapito.
IEEE/CVF International Conference on Computer Vision,
Go-surf: Neural feature grid optimization for fast, high-
pages 5589–5599, 2021. 2
fidelity rgb-d surface reconstruction. arXiv preprint
[35] Joseph Ortiz, Alexander Clegg, Jing Dong, Edgar Sucar, arXiv:2206.14735, 2022. 5, 12
David Novotny, Michael Zollhoefer, and Mustafa Mukadam. [49] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
isdf: Real-time neural signed distance fields for robot per- Komura, and Wenping Wang. Neus: Learning neural implicit
ception. arXiv preprint arXiv:2204.02296, 2022. 5 surfaces by volume rendering for multi-view reconstruction.
[36] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, arXiv preprint arXiv:2106.10689, 2021. 2, 3, 4, 12
Yonghong Tian, and Li Yuan. Masked autoencoders [50] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini-
for point cloud self-supervised learning. arXiv preprint vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
arXiv:2203.06604, 2022. 1, 2 Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet:
[37] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Learning multi-view image-based rendering. arXiv preprint
Pollefeys, and Andreas Geiger. Convolutional occupancy arXiv:2102.13090, 2021. 2
networks. In European Conference on Computer Vision, [51] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma,
pages 523–540. Springer, 2020. 6, 7, 12, 13 Michael M. Bronstein, and Justin M. Solomon. Dynamic
[38] Charles R Qi, Or Litany, Kaiming He, and Leonidas J graph cnn for learning on point clouds. ACM Transactions
Guibas. Deep hough voting for 3d object detection in point on Graphics (TOG), 2019. 6, 7
clouds. In Proceedings of the IEEE International Conference [52] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas
on Computer Vision, 2019. 5, 6 Guibas, and Or Litany. Pointcontrast: Unsupervised pre-
[39] Ruslan Rakhimov, Andrei-Timotei Ardelean, Victor Lem- training for 3d point cloud understanding. In European con-
pitsky, and Evgeny Burnaev. Npbg++: Accelerating neural ference on computer vision, pages 574–591. Springer, 2020.
point-based graphics. In Proceedings of the IEEE/CVF Con- 1, 2, 6
ference on Computer Vision and Pattern Recognition, pages [53] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin
15969–15979, 2022. 2 Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf:
[40] Yongming Rao, Benlin Liu, Yi Wei, Jiwen Lu, Cho-Jui Point-based neural radiance fields. In Proceedings of the
Hsieh, and Jie Zhou. Randomrooms: Unsupervised pre- IEEE/CVF Conference on Computer Vision and Pattern
training from synthetic shapes and randomized layouts for Recognition, pages 5438–5448, 2022. 2, 7, 12
3d object detection. In Proceedings of the IEEE/CVF Inter- [54] Ryosuke Yamada, Hirokatsu Kataoka, Naoya Chiba,
national Conference on Computer Vision, pages 3283–3292, Yukiyasu Domae, and Tetsuya Ogata. Point cloud pre-
2021. 1, 2, 6 training with natural 3d structures. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 21283–21293, 2022. 6
[55] Siming Yan, Zhenpei Yang, Haoxiang Li, Li Guan, Hao
Kang, Gang Hua, and Qixing Huang. Implicit autoencoder
for point cloud self-supervised representation learning. arXiv
preprint arXiv:2201.00785, 2022. 1, 2, 6, 7, 13
[56] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
Angjoo Kanazawa. PlenOctrees for real-time rendering of
neural radiance fields. In ICCV, 2021. 2, 6
[57] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
pixelNeRF: Neural radiance fields from one or few images.
https://arxiv.org/abs/2012.02190, 2020. 2
[58] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie
Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud
transformers with masked point modeling. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2022. 1, 2, 6
[59] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. NERF++: Analyzing and improving neural radiance
fields. https://arxiv.org/abs/2010.07492, 2020. 2
[60] Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin
Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-
m2ae: Multi-scale masked autoencoders for hierarchical
point cloud pre-training. arXiv preprint arXiv:2205.14401,
2022. 1, 2
[61] Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan
Misra. Self-supervised pretraining of 3d features on any
point-cloud. 2021. 1, 2, 6
Ponder: Point Cloud Pre-training via Neural Rendering
Supplementary Material

A. Implementation Details The above Equation 19 is the back-projection equation π −1


used in this paper.
In this section, we give more implementation details of
our Ponder model. Our code will be released upon ac-
ceptance. Training Time. The Ponder model is trained with 8
NVIDIA A100 GPUs for 96 hours.
A.1. Pre-training Details
A.2. Transfer Learning Details
3D feature volume. In our experiments, we build a hi-
erarchical feature volume V with a resolution of [16, 32, 3D scene reconstruction. ConvONet [37] reconstructs
64]. Building a 3D hierarchical feature volume has been scene geometry from the point cloud input. It follows a
wildly used for recovering detailed 3D geometry, e.g. [8,9]. two-step manner, which first encodes the point cloud into
After processing the 3D feature volume with a 3D CNN, a 3D feature volume or multiple feature planes, then de-
we use trilinear interpolation to get the feature of the query codes the occupancy probability for each query point. To
point p, denoted as V(p). We use the drop-in replacement test the transfer learning ability of our point cloud encoder,
of grid sampler from [48] to accelerate the training. we directly replace the point cloud encoder of ConvONet,
without any other modification. We choose the highest per-
Ray sampling strategy. Similar to [30, 49], we sample forming configuration of ConvONet as the baseline setting,
twice for each rendering ray. First, we uniformly sample which uses a 3D feature volume with a resolution of 64.
coarse points between the near bound zn and far bound zf . For the training of ConvONet, we follow the same training
Then, we use importance sampling with the coarse proba- setting as the released code1 .
bility estimation to sample fine points. Folowing [49], the
coarse probability is calculated based on Φh (s). By this Image synthesis from point clouds. Point-NeRF [53]
sampling strategy, our method can automatically determine renders images from neural point cloud representation. It
sample locations and can collect more points near the sur- first generates neural point clouds from multi-view images,
face, which makes the training process more efficient. then uses point-based volume rendering to synthesize im-
ages. To transfer the learned network weight to the Point-
Back projection Here we give details of the back projec- NeRF pipeline, we 1) replace the 2D image feature back-
tion function π −1 to get point clouds from depth images. bone with a pre-trained point cloud encoder to get the neural
Let K be camera intrinsic parameters, ξ = [R|t] be camera point cloud features, 2) replace the color decoder by a pre-
extrinsic parameters, where R is the rotation matrix and t is trained color decoder, 3) keep the other Point-NeRF module
the translation matrix. Xuv is the projected point location untouched. Since a large amount of point cloud is hard to
and Xw is the point location in the 3D world coordinate. be directly processed by the point cloud encoder, we down-
Then, according to the pinhole camera model: sample the point cloud to 1%, which will decrease the ren-
dering quality but help reduce the GPU memory require-
sXuv = K(RXw + t), (17)
ments. We report the PSNR results of the unmasked region
where s is the depth value. After expanding the Xuv and as the evaluation metric, which is directly adopted from the
Xw :     original codebase2 . For training Point-NeRF, we follow the
u X same setting as Point-NeRF.
s v  = K(R  Y  + t). (18)
1 Z B. Supplementary Experiments
Then, the 3D point location can be calculated as follows:
    B.1. Ablation Study
X u
 Y  = R−1 (K−1 s v  − t) (19) 1 https://github.com/autonomousvision/convolutional occupancy networks
2 https://github.com/Xharlie/pointnerf
Z 1
Mask ratio ScanNet SUN RGB-D
0% 40.7 37.3
25% 40.7 36.2
50% 40.3 36.9
75% 41.7 37.0
90% 41.0 36.6

Table 6. Ablation study for mask ratio. 3D detection AP50 on


ScanNet and SUN RGB-D.

Resolution ScanNet SUN RGB-D


16 40.7 36.6
Figure 7. Label efficiency training. We show the 3d object de-
16+32+64 41.0 36.6 tection experiment results using limited downstream data. Our
pretrained model is capable of achieving better performance than
Table 7. Ablation study for feature volume resolution. training from scratch using the same percentage of data or requires
3D detection AP50 on ScanNet and SUN RGB-D. fewer data to get the same detection accuracy.

Method IoU↑ Normal Consistency↑ F-Score↑


gets a better reconstruction performance than both the Con-
ConvOcc [37] 0.778 0.887 0.906
vONet and IAE.
IAE [55] 0.757 0.887 0.910
Ours 0.802 0.893 0.920
Label Efficiency Training. We also do experiments to
Table 8. 3D scene reconstruction IoU, NC, and F-Score on SISD show the performance of our method with limited label-
dataset with PointNet++ model.
ing for the downstream task. Specifically, we test the la-
bel efficiency training on the 3D object detection task for
Influence on mask ratio. In this paper, we use random ScanNet. Following the same setting with IAE [55], we
masking as one type of point cloud augmentation. We ap- use 20%, 40%, 60%, and 80% of ground truth annotations.
ply the same mask ratio as MaskPoint [25]. Here, we give The results are shown in Figure 7. We show constantly im-
additional experimental results to show the influence of us- proved results over training from scratch, especially when
ing different mask ratios in Table 6. For the mask ratio of only 20% of the data is available.
0%, we do not apply any mask strategy to the input point
cloud.
Color information for downstream tasks. Different
from previous works, since our pre-training model uses a
3D feature volume resolution. As mentioned in Section colored point cloud as the input, we also use color informa-
A, Ponder build a 3D feature volume with a resolution of tion for the downstream tasks. Results are shown in Table
[16, 32, 64], which is inspired by the recent progress of 9. Using color as an additional point feature can help the
multi-resolution in 3D reconstruction. However, building VoteNet baseline achieve better performance on the SUN
such a 3D feature volume with large resolutions requires RGB-D dataset, but get little improvement on the ScanNet
heavy GPU memory usage. We conduct experiments in Ta- dataset. This shows that directly concatenating point posi-
ble 7 to test the performance with a smaller resolution. As tions and colors as point features shows limited robustness
shown in the table, even with a small resolution, Ponder to application scenarios. By leveraging the proposed Pon-
is still able to achieve comparable accuracy, demonstrating der pre-training method, the network is well initialized to
the robustness to the feature volume resolution. handle the point position and color features, and achieve
B.2. Transfer Learning better detection accuracy.

3D scene reconstruction As mentioned in the paper, we


transfer the learned PointNet++ model of IAE to the 3D More comparisons on 3D detection. More detection ac-
reconstruction task. The results are shown in Table 8. Com- curacy comparisons are given in Table 9. Even using an in-
pared with the ConvONet baseline, the IAE pre-trained ferior backbone, our Ponder model is able to achieve simi-
model gets a better F-Score with 0.004 but gets worse re- lar detection accuracy with 9 in ScanNet and better accuracy
sults on the IoU metric. Our method, on the other hand, in SUN RGB-D.
Detection Pre-training Pre-training Pre-training ScanNet SUN RGB-D
Method
Model Type Data Epochs AP50 ↑ AP25 ↑ AP50 ↑ AP25 ↑
VoteNet* VoteNet* - - - 37.6 60.0 33.3 58.4
DPCo [24] VoteNet* Contrast Depth 120 41.5 64.2 35.6 59.8
IPCo [24] VoteNet* Contrast Color & Depth 120 40.9 63.9 35.5 60.2
VoteNet (w color) VoteNet - - - 33.4 58.8 34.3 58.3
Ponder VoteNet Rendering Depth 100 40.9 64.2 36.1 60.3
Ponder VoteNet Rendering Color & Depth 100 41.0 63.6 36.6 61.0

Table 9. 3D object detection AP25 and AP50 on ScanNet and SUN RGB-D. * means a different but stronger version of VoteNet.

B.3. More application examples


As mentioned in the paper, the pre-trained Ponder model
can be directly used for surface reconstruction and image
synthesis tasks. We give more application examples in Fig-
ure 8 and Figure 9.
Input Point Cloud Projected Point Cloud Reconstruction Image Synthesis Depth Synthesis

Figure 8. More results of application examples of Ponder on the ScanNet validation set (part 1). The input point clouds are drawn as
spheres for better clarity.
Input Point Cloud Projected Point Cloud Reconstruction Image Synthesis Depth Synthesis

Figure 9. More results of application examples of Ponder on the ScanNet validation set (part 2). The input point clouds are drawn as
spheres for better clarity.

You might also like