Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views10 pages

Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields

The document presents Point2Pix, a novel point cloud renderer that synthesizes photo-realistic images from colored point clouds without the need for multi-view images or fine-tuning, primarily targeting indoor scenes. It introduces efficient ray sampling and Multi-scale Radiance Fields to enhance image quality and rendering speed. Extensive experiments demonstrate the method's effectiveness and generalization across various datasets.

Uploaded by

Ben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields

The document presents Point2Pix, a novel point cloud renderer that synthesizes photo-realistic images from colored point clouds without the need for multi-view images or fine-tuning, primarily targeting indoor scenes. It introduces efficient ray sampling and Multi-scale Radiance Fields to enhance image quality and rendering speed. Extensive experiments demonstrate the method's effectiveness and generalization across various datasets.

Uploaded by

Ben
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields

Tao Hu1 Xiaogang Xu1 * Shu Liu2 Jiaya Jia1,2


1 2
The Chinese University of Hong Kong SmartMore
{taohu, xgxu, leojia}@cse.cuhk.edu.hk, [email protected]
arXiv:2303.16482v1 [cs.CV] 29 Mar 2023

Abstract methods [23,49,54,55] are scene-specific, thus taking much


time to train from scratch for novel scenes in abundant
Synthesizing photo-realistic images from a point cloud is multi-view images, which limits the practical applications.
challenging because of the sparsity of point cloud represen- In this work, we bridge the gap between point clouds and
tation. Recent Neural Radiance Fields and extensions are NeRF, thus proposing a novel point cloud renderer, called
proposed to synthesize realistic images from 2D input. In Point2Pix, to synthesize photo-realistic images from col-
this paper, we present Point2Pix as a novel point renderer ored point clouds. Compared with most NeRF-based meth-
to link the 3D sparse point clouds with 2D dense image pix- ods [23, 29, 53, 54], ours does not necessarily require multi-
els. Taking advantage of the point cloud 3D prior and NeRF view images or fine-tuning procedures for indoor scenes.
rendering pipeline, our method can synthesize high-quality First, point clouds are treated as underlying anchors of
images from colored point clouds, generally for novel in- NeRF. The training process of NeRF is to learn 3D point at-
door scenes. To improve the efficiency of ray sampling, tributes from given locations. Because there is no mapping
we propose point-guided sampling, which focuses on valid ground truth for multi-view images, NeRF-based methods
samples. Also, we present Point Encoding to build Multi- [23, 29] indirectly train their networks with a pixel recon-
scale Radiance Fields that provide discriminative 3D point struction loss. Note that point clouds are exactly made up of
features. Finally, we propose Fusion Encoding to efficiently points with location and attributes, thus can provide train-
synthesize high-quality images. Extensive experiments on ing pairs for the mapping function to conduct supervised
the ScanNet and ArkitScenes datasets demonstrate the ef- learning and improve performance.
fectiveness and generalization.
Then, point clouds can also improve the efficiency of ray
sampling. NeRF-based methods [23, 29, 54] learn the 3D
shape and structure from multi-view images, which do not
1. Introduction involve geometric prior in novel scenes. Thus, dense uni-
Point cloud rendering aims to synthesize images from form sampling [5, 46] or coarse-to-fine sampling [29, 54]
point clouds at given camera parameters, which has been were used to synthesize high-quality images. These strate-
frequently utilized in 3D visualization, navigation, and aug- gies are inefficient because most of the locations in 3D
mented reality. There are many advantages to point cloud scenes are empty [23, 31]. Since point clouds represent a
representation, such as flexible shape and general 3D prior. relatively fine shape of 3D scenes, the area around existing
However, since point clouds are generally produced by 3D points deserves more attention. Based on this observation,
scanners (RGBD or LiDAR) [3, 9, 14] or by Multi-View we propose a point-guided sampling strategy to mainly fo-
Stereo (MVS) from images [4, 15, 52], the points are usu- cus on the local area of points in the point cloud. It can
ally sparsely distributed in 3D scenes. Although traditional significantly reduce the number of required samples while
graphics-based renderers [38, 42, 50, 58] can render point maintaining decent synthesis accuracy.
clouds to images without training or finetuning, the quality Further, point-based networks can provide 3D features
is not satisfying with hole artifacts and missing details [10]. prior to subsequent applications, general for novel scenes.
Recently, Neural Radiance Fields (NeRF) [29] were pro- Although many methods [7, 18, 36, 37] have been proposed
posed for 3D representation and high-fidelity novel view for various point cloud understanding tasks, they are usu-
synthesis. It employs an implicit function to directly map ally designed for existing points. In this work, we propose
each point’s spatial information (location and direction) to Multi-scale Radiance Fields, including Point Encoder and
attributes (color and density). However, most NeRF-based MLP networks, to extract multi-scale features for any loca-
tion in the scene. These 3D point features are discriminative
* Corresponding author. and general, ensuring finetuning-free rendering. Also, in-
spired by recent NeRF-based image generators [19, 22, 57], structures. PointNet++ [37] introduces hierarchical feature
we render the 3D point features as multi-scale feature maps. learning to encode local point features. Although 3D CNN
Our fusion decoder gradually synthesizes high-resolution can also deal with point cloud data after voxelization, the
images. It can not only fill the possible holes but also im- maximal resolution of 3D volumes is low because of huge
prove the quality of rendered images. Our main contribu- memory consumption and slow computation. Thus, sparse
tions are summarized as follows. 3D CNNs [7, 17, 20, 45] draw more attention. SparseCon-
vNet [18] differs from previous sparse convolutional net-
• We propose Point2Pix to link point clouds with image works since it uses a rectangular grid, instead of dilating the
space, which renders point clouds into photo-realistic observation. In this paper, we adopt a more efficient sparse
images. 3D CNN – MinkowskiEngine [7] – as the basic point en-
• We present an efficient ray sampling strategy and fu- coder to extract 3D prior from point clouds.
sion decoder to greatly decrease the number of samples 2.3. NeRF-based Synthesis
in each ray and the total number of rays, thus acceler-
ating the rendering process. NeRF [29] well balances neural network and physical
rendering, thus achieving state-of-the-art performance in
• We propose Multi-scale Radiance Fields, which ex- novel view synthesis. To handle dynamic scenes where ob-
tract discriminative 3D prior for arbitrary 3D locations. jects move, a deformable function is learned to align differ-
ent frames [24, 33]. For human reconstruction and synthe-
• Extensive experiments and ablation studies on indoor sis, many methods, such as Neural-Body [35], Neural-Actor
datasets demonstrate the effectiveness and generaliza- [25], and Anim-NeRF [6], introduce the parameterized hu-
tion of the proposed method. man model SMPL [26] as a strong 3D prior and achieve im-
pressive performance. To generate high-resolution images,
2. Related Work methods of [19,22,32,57] first render low-resolution feature
In this section, we briefly review the related works, in- maps instead of images, then upsample features to the final
cluding various point renderers, point-based networks in ex- images via 2D CNN. NeRF’s input is only multi-view im-
tracting 3D point features, and NeRF-based synthesis. ages and camera parameters, when combined with 3D prior,
such as depth and point cloud, the performance can be fur-
2.1. Point-based Rendering ther improved [1,13,31,51]. Our model also combines deep
Traditional Point Rendering [38,42,58] is based on com- learning with physical rendering, while taking more advan-
puter graphics, which generates images from point clouds tage of point clouds as 3D priors to render decent-quality
by simulating the physical process of imaging, considering images in general indoor scenes.
geometry [44], material [12], BRDF [16], and lighting [21].
3. Our Approach
The rendering pipeline is general for arbitrary scenes. But
it cannot fill missing points and thus generate vacant pixels. For a point cloud P = ∪K k=1 {pk , ck } with K points,
In the deep learning era, neural-based point renderers pk = (xk , yk , zk ) ∈ R3 and corresponding colors ck =
[10, 11, 41] made great progress in generating images from (rk , gk , bk ) ∈ R3 , our goal is to synthesize a high-fidelity
point clouds. They first extract multi-scale projected fea- image Iˆ at the given camera parameter V via our proposed
tures from point clouds and then use a decoder to gener- renderer (Point2Pix) R, formulated as
ate images. NPBG [11] augments each point by a neu-
ral descriptor to encode local geometry and appearance. I = \mathcal {R}(\textbf {P}, V), \label {eq:point_rendering} (1)
NPCR [10] represents point features by 3D Convolution where V is represented by H × W rays r. Each ray starts
Neural Network (CNN), and converts the 3D representation from camera center o in pixel direction d.
to multi-plane images. Different from these methods, our To begin with, we introduce the background knowledge
Point2Pix combines a more discriminative point encoder of NeRF in Sec. 3.1. Then, we propose an efficient point-
with the renderer pipeline of NeRF, thus achieving better based ray sampling strategy in Sec. 3.2. We also build a
performance. network to extract multi-scale 3D prior from point clouds
in Sec. 3.3. Finally, we show how to combine the point
2.2. Point-based Networks feature with NeRF to render target images in Sec. 3.4. The
Point-based networks have been developed for many overview of our framework can be seen in Fig. 1.
years [7, 8, 17, 18, 36, 37]. For general point understand-
3.1. Preliminary
ing, PointNet [36] utilizes point-wise Multi-Layer Percep-
tion (MLP) and Pooing to extract features for 3D classifica- Given multi-view camera-calibrated images of a scene,
tion and semantic segmentation, while not capturing local NeRF [29] synthesizes high-quality novel view images. It
(a) Multi-scale Radiance Fields
Point Encoding 𝑭𝟏 𝑭𝟐 𝑭𝟑 𝑭𝟒

Point-guided Sampling MLP MLP MLP MLP

NeRF-based Rendering
𝟏
𝒇 𝒇𝟐 𝒇𝟑 𝒇𝟒
𝒇!
x2 x2 x2
𝜸 𝜷
Point Cloud ⊙ ⊕ Image
𝓕𝒍"𝟏 𝓕𝒍 𝟏 𝟐 𝟑 𝟒
𝓕 𝓕 𝓕 𝓕
LayerNorm PixelShuffle
(b) Fusion & Upsampling (c) Fusion Decoding

Figure 1. Overview of our proposed Point2Pix. (a) Multi-scale Radiance Fields. For an input point cloud, we first extract multiple 3D
feature volumes in four scales. Next, for any queried point, we first linearly interpolate the coarse features from these feature volumes and
infer the final features through MLP networks. (b) Fusion and Upsampling. Four 2D feature maps are respectively rendered through NeRF.
They are fused with the previous 2D CNN output and then are upsampled by 2 times. (c) Fusion Decoding. We finally design a neural
renderer to gradually synthesize target images from the projected feature maps.

mainly consists of ray sampling, implicit function, and vol- 3.2. Point-guided Sampling
ume rendering.
According to Eq. (4), increasing the number of samples
Ray Sampling. Starting from the camera center o, ray sam- N along each ray can generate more realistic results [29].
pling is the process that obtains a series of positions xi along However, the required computing resources and running
ray r with direction d as time will also linearly grow. Our model is based on a point
cloud with a relatively fine shape prior. Thus, we propose
\textbf {x}_i = \textbf {o} + z_i \cdot \textbf {d}, \quad i = 1, 2, ..., N, (2) point-guided sampling to achieve more efficient ray sam-
pling through the guidance of the point cloud.
where zi is the sampling depth and N is the number of sam- For any queried point xi , we first find the nearest neigh-
ples on each ray. bour point pi , then check whether xi is located in pi ’s ball
Implicit Function. An implicit function fθ is trained as a area with radius r or not, as
mapping from each queried location xi and direction d to
corresponding colors ci = (ri , gi , bi ) and density σi , as \parallel \textbf {p}_i - \textbf {x}_i \parallel _2 \leq r. \label {eq:sampling} (5)
If the above condition is satisfied, we treat the queried point
(\textbf {c}_i, \sigma _i) = f_\theta (\textbf {x}_i, \textbf {d}), \label {eq:implicit_function} (3) xi as a valid sample and obtain the point feature in Sec. 3.3.
If there is no valid sample along one ray, we adopt uniform
where fθ is an MLP network, and θ is its parameter. sampling from default near to far depth. As illustrated in
Volume Rendering. Each ray (or pixel) color ĉ is calcu- Fig. 2. Compared with previous uniform and coarse-to-fine
lated via volume rendering [28] as sampling [5, 29, 46], our sampling strategy reduces compu-
tation and memory costs.

\begin {aligned} \label {eq:volume_rendering} \hat {\textbf {c}} &= \sum _{i=1}^{N} T_i \alpha _i \textbf {c}_i,\\ \alpha _i &= 1 -\textbf { exp}(-\sigma _i \delta _i),\\ T_i &= \textbf {exp}(-\sum _{j=1}^{i-1}\sigma _j\delta _j), \end {aligned} 3.3. Multi-scale Radiance Fields
We extract discriminative 3D point and ray features via
(4) constructing Multi-scale Radiance Fields, including Point
Encoding and NeRF-based Feature Rendering.
Point Encoding. Point Encoding is to output a discrim-
inative 3D point feature for each valid sample xi . We
adopt 3D sparse UNet from the Minkowski Engine (ME)
where δj is the distance between neighbor samples along [7] as the backbone of our point encoder. ME is an auto-
the ray r. differentiation library to build a 3D CNN for sparse tensors,
Nearest input. This module is inspired by SPADE [34], while we use
Point Layer Normalization [2]. Specifically, as shown in Eq. (9),
𝒅 for the rendered feature map fl , we calculate the conditional
𝒑! 𝒙!
Valid
parameters, including scale γ and bias β, by a Conv2D mod-
Sample ule. Then for feature F l−1 from the previous stage, we nor-
Invalid malize it by Layer Normalization and scale it by γ. Finally,
𝒐 Sample the fused feature F l is obtained by adding the bias β as

Figure 2. The proposed point-guided sampling. For any queried (\gamma , \beta ) &= \textbf {Conv2D}(\textbf {f}^l), \\ \nonumber \mathcal {F}^{l-1} &= \gamma \cdot \textbf {LayerNorm}(\mathcal {F}^{l-1}) + \beta . \label {eq:fusion}
point xi , we find its nearest point pi in the point cloud. If xi is
located in the ball area (with radius r) of pi , it is a valid sample.
Invalid samples are omitted to improve sampling efficiency.
Upsampling. We adopt PixelShuffle [43] as our upsam-
pling modules that upsample the fused feature F l−1 by 2
that are converted from point clouds. As illustrated in Fig. times at each stage, instead of using bilinear or nearest in-
1, our point encoder extracts multiple 3D feature volumes terpolation. PixelShuffle [43] is frequently adopted in the
from the raw point cloud in L different scales. We select Fl super-resolution task, which utilizes a convolution layer to
at scale l to construct multi-scale radiance fields. extend the channel size and reshape them into the spatial
For each valid sample xi at each scale l, we query fea- size, as
ture in Fl to obtain the interpolated feature Fil and employ
an implicit function Φl to infer density σil and final point \mathcal {F}^{l}&= \textbf {Pixelshuffle}(\textbf {Conv2D}(\mathcal {F}^{l-1}), 2). (9)
feature fil for xi as
ToRGB. Finally, we introduce the decoder of the present
(\sigma _i^l, \textbf {f}_i^l) = \Phi _l(\textbf {F}_i^l) = \Phi _l(\textbf {F}^l[\textbf {x}_i]). \label {eq:feature_function} (6) large-scale generator, like [39], as a post-process to gener-
ate the final rendering image Iˆ for the whole point cloud
NeRF-based Feature Rendering. Then we render the renderer.
queried 3D point features to 2D feature maps and gener-
ates images at different scales. At each feature scale l, we 3.5. Loss Function
aggregate density σil and feature fli to generate 2D feature
For the NeRF-based rendering images and neural ren-
map fl by volume rendering of NeRF [29] as
dering images, their optimization goals are the ground-truth
images with target camera parameters. We employ point
\begin {aligned} \textbf {f}^l &= \sum _{i=1}^{N} \textbf {w}_i^l \textbf {f}_i ^l, \\ \textbf {w}_i^l &= \textbf {exp}(-\sum _{j=1}^{i-1}\sigma _j^l\delta _j) (1 - \textbf {exp}(-\sigma _i^l \delta _i)). \end {aligned} \label {eq:feature_rendering} cloud loss, NeRF rendering loss, neural rendering loss, and
perceptual loss to train the parameters of the proposed point
(7) encoder and fusion decoder as

\mathcal {L} = \lambda _{pc} \mathcal {L}_{pc} + \lambda _{nr} \mathcal {L}_{nr} + \lambda _{per} \mathcal {L}_{per}, (10)

So far, we obtained L rendering feature maps {fl } ∈ where λpc , λnr , and λper respectively control weights of
RCl ×Hl ×Wl , where Hl and Wl respectively represent the these losses.
feature height and width, and Cl is the number of channels. Point Cloud Loss. All points pk in raw point clouds pro-
vide groud-truth mapping from locations xk to densities σ̂k
3.4. Fusion Decoding and colors ĉk . Denoting the queried 3D densities and colors
Although we propose an efficient ray sampling strategy from Point2pix in point pk as ck and σk , the point cloud
to reduce memory consumption, it still requires more than loss can be represented as
20 GB GPU memory to render target images with the size
of 480 × 640, as shown in Tab. 4. In addition, there are still \mathcal {L}_{pc} = \sum _{k=1}^{K} ( \parallel \hat {\textbf {c}}_k - \textbf {c}_k \parallel ^2 + \frac {1}{D} \max (0, D-\sigma _k)). \label {eq:loss_pc} (11)
many holes to be filled in the 2D-rendered image space. To
address these issues, we design a Fusion Decoder as a neural
renderer that synthesizes final images from these rendered We encourage the predicted densities at pk to be greater
feature maps fl , l ∈ [1, L] by conditional convolution and than a threshold D.
upsampling modules. Neural Rendering Loss. Lnr is the MSE between rendered
Fusion. Our conditional convolution is to fuse the previous images I from fusion decoder and ground truths Iˆ as
layer’s feature F l−1 with the rendered features fl , which
treats the rendered feature at each scale l as the conditional \mathcal {L}_{nr} = \parallel \hat {I} - I \parallel ^2_2. (12)
Dataset ScanNet [9] ARKitScenes [3]
Metrics PSNR ↑ SSIM ↑ LPIPS ↓ PSNR↑ SSIM↑ LPIPS↓
Pytorch3D [38] 13.62 0.528 0.779 15.21 0.581 0.756
Pix2PixHD [47] 15.59 0.601 0.611 15.94 0.636 0.605
NPCR [10] 16.22 0.659 0.574 16.84 0.661 0.518
NPBG++ [11] 16.81 0.671 0.585 17.23 0.692 0.511
ADOP [41] 16.83 0.699 0.577 17.32 0.707 0.495
Point-NeRF [51] 17.53 0.685 0.517 17.61 0.715 0.508
Point2Pix (Ours) 18.47 0.723 0.484 18.84 0.734 0.471

Table 1. Comparing our method with different point renderers on the ScanNet [9] and ARkitScenes [3] datasets. There is no finetuning
process in this experiment, which demonstrates the generalization in novel scenes.

Perceptual Loss. Lper is a frequently used loss in image for the finetuning evaluation, methods can refine their re-
synthesis, which improves the realism of generation, as sults on specific scenes to improve performance. Since fine-
tuning evaluation in each case usually consumes much re-
sources and time, we randomly choose 8 testing scenes from
\mathcal {L}_{per} = \sum _{l=1}^{L} \parallel \phi (\hat {I}^l)- \phi (I^l )\parallel _1, (13) ScanNet datasets and the same number from the ArkitScene
dataset for finetuning evaluation.
where ϕ(·) means extracting VGG features. Implementation Details. We adopt MinkUnet14A as our
Point Encoder. The radius r for point-guided sampling is
4. Experiments 0.08 meters. The maximal number N of samples for each
ray is 128. We extract feature volumes with L = 4 scales.
In this section, we conduct experiments to demonstrate 1 1 1
Thus the scales are , , , and 1 respectively. The reso-
the effectiveness of our proposed method. First, we in- 8 4 2
troduce the indoor datasets and evaluation metrics. Then, lution of the final rendered images is 640 × 480. During
we quantitatively and qualitatively compare the proposed training, the initial learning rate is 0.004 with AdamW [27]
method with state-of-the-art point cloud renderers to show optimizer. It exponentially decays to 0.0004 till the end (by
our advantages. Next, ablation studies are performed to val- 500 epochs). We set λpc = 0.1, λnr = 1.0, and λper = 0.1
idate the effect of each proposed module, including point- empirically. The density threshold D = 10. We train our
guided sampling, point encoder, and fusion decoder. Fi- model on 4 NVIDIA Titan-V GPUs, and the batch size is 1
nally, we apply our method to point cloud applications. for each GPU.

4.1. Experimental Settings 4.2. Comparison with Point Renderers


Datasets. We perform experiments on indoor datasets con- We first compare our method with different point render-
taining point cloud and multi-view images, including Scan- ing methods by non-finetuning evaluation. After training,
Net [9] and ARKitScenes [3]. ScanNet [9] is an RGBD all methods are directly tested in novel scenes. The com-
scanned dataset, which contains 2.5 million images at dif- petitors include graphics-based point cloud renderer Py-
ferent views in 1,513 scenes. The dataset has been anno- torch3D [38], previous neural-based point renderers NPBG
tated with calibrated cameras and colored point clouds. We [11] and NPCR [10], and image generator Pix2PixHD [47].
split the first 1,200 scenes as a training set and the rest as For Pix2PixHD, we use it as an image-to-image generator
a testing set. ARkitScenes [3] is a 3D indoor-scene under- to translate the rendered images from graphics-based point
standing dataset, whose scenes are captured by Apple iPad renderer [38] to the ground-truth images. The evaluation re-
Pro. There are around 5,000 scenes, and we choose the first sults are shown in Tab. 1. Ours achieves significantly higher
4,500 scenes for training and the 500 scenes for testing. accuracy than other solutions, which reflects the great ad-
Metrics. We adopt three common metrics, including vantage in practical applications.
PSNR, SSIM [48], and LPIPS [56], to evaluate the perfor-
4.3. Comparison with NeRF-based Synthesis
mance of Point2Pix. They measure the reconstruction accu-
racy between the rendered images and ground-truth images. We also compare the proposed method with NeRF-based
Evaluation. We evaluate the rendering quality of differ- synthesis in both non-finetuning and finetuning evaluations.
ent methods in two aspects, including non-finetuning and In this experiment, we adopt the same coarse-to-fine ray
finetuning. The non-finetuning evaluation means directly sampling as NeRF-based methods [29, 54] for a fair com-
measuring the rendering quality in the testing datasets. As parison. The results are illustrated in Tab. 2.
Pytorch3D Pix2PixHD NCPR

Ours Ours
Ground Truth
(non-finetuning) (finetuning)
Figure 3. Qualitative comparison between different point renderers on the ScanNet [9].

Method Time PSNR(↑) SSIM (↑) LPIPS(↓) 4.4. Qualitative Comparison


Point-NeRF [51] 0 mins 17.53 0.685 0.517
Point2Pix (Ours) 0 mins 18.47 0.723 0.484 We also qualitatively compare our Point2Pix with other
NeRF [29] ∼30 hours 21.33 0.788 0.355
point renderers and NeRF-based synthesis. The visualiza-
NSVF [23] ∼40 hours 22.47 0.791 0.337
PlenOctrees [54] ∼30 hours 22.02 0.795 0.341 tion is illustrated in Fig. 3 and Fig. 4. The Graphics-
Instant-NGP [30] 20 mins 21.94 0.775 0.363 based point renderer Pytorch3D [38] usually generates im-
Plenoxels [53] 20 mins 22.35 0.780 0.346 ages with holes because of sparse points. Due to missing 3D
Point-NeRF [51] 20 mins 22.55 0.792 0.336 prior, the generated images are not realistic. Ours achieves
Point2Pix (Ours) 20 mins 23.02 0.815 0.318
the best visual quality, which shows Point2Pix’s superiority.
Table 2. Comparing our method with NeRF-based methods on the
ScanNet dataset [9]. “Time” means the average finetuning time 4.5. Ablation Studies
for all scenes.
Effect of Point-guided Sampling. To prove the efficiency
of our point-guided sampling, we replace our sampling with
other strategies used in NeRF-based methods, including
uniform [5, 46] and coarse-to-fine sampling [29, 54]. Uni-
To achieve general view synthesis in novel scenes, MVS- form sampling means uniformly obtaining N points on each
NeRF [5] and IBRNet [46] combine image prior with NeRF, ray in near-to-far depth. Coarse-to-fine sampling is pro-
while ours combines the point cloud prior. For a fair com- posed by the original NeRF [29]. The coarse stage uni-
parison, we also pre-train these two methods on the same formly samples N2 points and the fine stages samples N2
pretraining ScanNet dataset. Our method achieves the high- points according to the probability from the coarse stage.
est performance among all. Although NeRF [29], NSVF Our point-guided sampling uniformly samples N points
[23], and PlenOctrees [54] can achieve competitive accu- while only inferring the valid samples. The results are
racy, their training time is much longer. When training shown in Tab. 3. Since our sampling strategy considers the
Instant-NGP [30], Plenoxels [53], Point-NeRF [51], and our point cloud 3D prior, we significantly decrease the average
Point2Pix in 20 minutes, ours achieves better performance. number of samples and reduce the rendering time and GPU
This experiment demonstrates the advantage of point cloud memory.
prior when combined with NeRF. Effect of Multi-scale Radiance Fields. Previous meth-
Pytorch3D Pix2PixHD NPBG++

Ours Ours
Ground Truth
(non-finetuning) (finetuning)
Figure 4. Qualitative comparison between different point renderers and NeRF-based methods on the ArkitScenes [3] dataset.

Render Time Train Memory In the literature of point cloud analysis, many point-based
Ray Sampling # Val. Sampl. PSNR (↑)
(seconds, ↓) (GB, ↓)
Uniform 128 17.96 3.13 8.54 networks [7, 18, 36, 37] were proposed. We compare differ-
Coarse-to-Fine 128 18.69 5.78 9.17 ent backbones in this experiment.
Point-Guide (Ours) 15.6 18.47 0.92 6.68 The candidate networks include PointNet++ [37], Spar-
seConvNet [18], and MinkUnet [7] (MinkUnet14A and
Table 3. Comparison between different ray sampling strategies on MinkUnet34C). We evaluate them by only changing the
ScanNet [9]. In our method, the mean sampling number for each
point encoder, and the results are shown in Tab. 5. Point-
ray is only 15.6, which is significantly less than the others.
Net++ [37] consumes the largest memory and takes the
longest rendering time, while the final synthesized accuracy
ods, like GIRAFFE [32], HeadNeRF [22], StyleNeRF, and is low. MinkUnet achieves the best results and is faster than
CIPS-3D [57], also render feature maps via NeRF, which SparseConvNet [18]. We select MinkUnet14A as our point
can effectively reduce the memory and rendering time. encoder since it is more efficient than MinkUnet34C.
However, different from ours, they only adopt a single scale Effect of Fusion Decoder. Previous neural point encoders
of radiance fields. We validate our Multi-scale Radiance usually adopt image-to-image translator [40, 47] to render
Fields in this experiment and show results in Tab. 4. images from projected feature maps. We conduct this ex-
We first study the effect of the number of rays. In the periment to analyze the effect of our Fusion Decoder. We
condition of a single scale, if we directly render the image construct different alternatives by combining different de-
at the final resolution by NeRF, the memory consumption coder and fusion strategies, as shown in Tab. 6.
is heavy, and the running time is long. Decreasing sampled The combination between U-Net [40] and concatenation
rays and increasing upsampling scale can promote the syn- strategy is the most frequently adopted [10, 11, 41], while
thesis quality and rendering efficiency. With the increasing its performance is not high. When replacing the decoder
number of NeRF scales, accuracy further improves, which with PixelShuffle [43], accuracy improves. It shows that
validates the effectiveness of our multi-scale radiance fields. the neural renderer does not require large respective fields
We finally adopt the combination in the last column to bal- as U-Net. When the concatenate strategy is replaced with
ance the accuracy and time. our proposed fusion model, the performance is further im-
Selection of Point Encoder. In our Point2Pix, the Point proved, showing the rationality of our design.
Encoder is the backbone to provide multi-scale 3D features. Effect of Point Cloud Loss. We perform this experiment
# Scales 1 1 1 2 4
# Rays 640 × 480 320 × 240 80 × 60 80 × 60 80 × 60
Upsampling ×1 ×2 ×8 ×8 ×8
PSNR (↑) 17.49 17.86 18.05 18.16 18.47
Rendering Time (seconds, ↓) 13.12 3.56 0.85 0.92 0.96
Training Memory (GB, ↓) 22.93 11.12 6.27 6.68 6.95
Table 4. Effect of different combinations in terms of the number of scales, number of rays and scale of upsampling. We choose the
combination in the last column, which achieves the best performance and is also efficient.

Training Memory Rendering Time PSNR


Point Encoder
(GB, ↓) (seconds, ↓) (dB, ↑) Point
PointNet++ [37] 9.14 3.41 16.53 Inpainting
SparseConvNet [18] 8.18 2.18 18.26
MinkUnet14A [7] 6.95 0.96 18.47
MinkUnet34C [7] 8.26 1.45 18.45
Table 5. Comparison between different point encoders. We adopt
Point
MinkUnet14A [7] to extract the basic 3D prior since it achieves Upsamping
accurate synthesized results and is also lightweight.
PSNR SSIM LPIPS
Decoder Fusion Strategy
(↑) (↑) (↓) Point2Pix (Ours)
Raw Point Ploud
U-Net [40] Concatenate 18.04 0.708 0.511
PixelShuffle [43] Concatenate 18.13 0.712 0.499
Figure 5. Our Point2Pix application in point cloud in-painting and
PixelShuffle [43] SPADE (LayerNorm) 18.47 0.723 0.484
upsampling. We upsample and fix the missing part of the raw point
Table 6. Comparison between different neural renderers. cloud by random and dense sampling around existing points.

λpc 0.0 0.1 1.0


PSNR (dB, ↑) 18.23 18.47 18.30 ages in indoor scenes. We introduce the advantages of point
cloud representation to NeRF where existing points provide
Table 7. The Effect of point cloud loss. ground truth pairs during training, the point area can guide
the ray sampling process and the 3D prior feature can be
generalized to novel scenes. We propose Multi-scale Ra-
to validate the effect of point cloud loss. By setting dif-
diance Fields to extract discriminative 3D features, point-
ferent loss weight λpc , we obtain the results in Tab. 7. We
guided sampling to efficiently reduce the number of valid
conclude that point cloud loss indeed promotes the mapping
samples, and a Fusion Decoder to synthesize realistic im-
process from point features to 3D attributes, thus improving
ages. Experiments and ablation studies demonstrate that
the performance. Interestingly, larger λpc does not neces-
our Point2Pix achieves state-of-the-art synthesized perfor-
sarily achieve better accuracy, which demonstrates the mi-
mance. Our Point2Pix can also be directly employed to up-
nor difference between point cloud attributes with NeRF’s
sample and in-paint the raw point cloud for indoor scenes.
attributes.
Limitation and Future Work. There are still common lim-
4.6. Application: Point Cloud Sampling itations in our proposed Point2Pix. First, the overall render-
Since our point encoder can extract multi-scale point fea- ing time is long compared with recent caching-based ren-
tures and predict the density and color attributes, we up- dering [53]. Second, apart from indoor scenes, it is still
sample the raw point cloud by dense sampling and pre- difficult to directly render photo-realistic images for arbi-
dict corresponding 3D attributes in the nearby area of ex- trary environments. In future work, we will accelerate the
isting points. As illustrated in Fig. 5, although there are rendering speed by combining it with other 3D scene repre-
no ground-truth dense points for us to perform supervised sentations, such as octrees. We will also extend our present
learning, our Point2Pix can still in-paint the missing points work to more real-world situations, like the human body.
and insert many details for input point clouds.

5. Conclusion Acknowledgments
In this paper, we have proposed a general point renderer, This work is partially supported by Shenzhen Science
which can be directly utilized to render photo-realistic im- and Technology Program KQTD20210811090149095.
References [18] Benjamin Graham and Laurens van der Maaten. Subman-
ifold sparse convolutional networks. arXiv, 2017. 1, 2, 7,
[1] Dejan Azinovic, Ricardo Martin-Brualla, Dan B. Goldman, 8
Matthias Nießner, and Justus Thies. Neural RGB-D surface
[19] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
reconstruction. In CVPR, 2022. 2
Stylenerf: A style-based 3d-aware generator for high-
[2] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.
resolution image synthesis. In ICLR, 2021. 2
Layer normalization. arXiv, 2016. 4
[20] Pedro Hermosilla, Tobias Ritschel, Pere-Pau Vázquez, Àlvar
[3] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry,
Vinacua, and Timo Ropinski. Monte carlo convolution for
Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe,
learning on non-uniformly sampled point clouds. ACM TOG,
Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK-
2018. 2
itscenes - a diverse real-world dataset for 3d indoor scene
understanding using mobile RGB-d data. In Thirty-fifth Con- [21] Yannick Hold-Geoffroy, Kalyan Sunkavalli, Sunil Hadap,
ference on Neural Information Processing Systems Datasets Emiliano Gambaretto, and Jean-François Lalonde. Deep out-
and Benchmarks Track (Round 1), 2021. 1, 5, 7 door illumination estimation. In CVPR, 2017. 2
[4] Dan Cernea. OpenMVS: Multi-view stereo reconstruction [22] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong
library. 2020. 1 Zhang. Headnerf: A real-time nerf-based parametric head
[5] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, model. In CVPR, 2021. 2, 7
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- [23] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
izable radiance field reconstruction from multi-view stereo. Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
In ICCV, 2021. 1, 3, 6 2020. 1, 6
[6] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao [24] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
Bao, Xu Jia, and Huchuan Lu. Animatable neural radiance Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
fields from monocular rgb videos. arXiv, 2021. 2 2020. 2
[7] Christopher B. Choy, JunYoung Gwak, and Silvio Savarese. [25] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu
4d spatio-temporal convnets: Minkowski convolutional neu- Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor:
ral networks. In CVPR, 2019. 1, 2, 3, 7, 8 Neural free-view synthesis of human actors with pose con-
[8] Ruihang Chu, Yukang Chen, Tao Kong, Lu Qi, and Lei Li. trol. ACM TOG, 2021. 2
Icm-3d: Instantiated category modeling for 3d instance seg- [26] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
mentation. IEEE Robotics and Automation Letters, 7(1):57– Pons-Moll, and Michael J. Black. SMPL: A skinned multi-
64, 2021. 2 person linear model. ACM TOG, 2015. 2
[9] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- [27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: regularization. In ICLR, 2019. 5
Richly-annotated 3d reconstructions of indoor scenes. In
[28] Nelson Max. Optical models for direct volume rendering.
CVPR, 2017. 1, 5, 6, 7
IEEE TVCG, 1995. 3
[10] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and
[29] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Bing Zeng. Neural point cloud rendering via multi-plane
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
projection. In CVPR, 2020. 1, 2, 5, 7
Representing scenes as neural radiance fields for view syn-
[11] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and
thesis. In ECCV, 2020. 1, 2, 3, 4, 5, 6
Bing Zeng. Neural point cloud rendering via multi-plane
projection. In CVPR, 2020. 2, 5, 7 [30] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
[12] Paul E. Debevec, Yizhou Yu, and George Borshukov. Effi- der Keller. Instant neural graphics primitives with a multires-
cient view-dependent image-based rendering with projective olution hash encoding. arXiv, 2022. 6
texture-mapping. In Rendering Techniques ’98, Proceedings [31] Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas
of the Eurographics Workshop in Vienna, Austria, June 29 - Kurz, Joerg H. Mueller, Chakravarty R. Alla Chaitanya, An-
July 1, 1998, 1998. 2 ton S. Kaplanyan, and Markus Steinberger. DONeRF: To-
[13] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- wards Real-Time Rendering of Compact Neural Radiance
manan. Depth-supervised nerf: Fewer views and faster train- Fields using Depth Oracle Networks. Computer Graphics
ing for free. In CVPR, 2022. 2 Forum, 2021. 1, 2
[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel [32] Michael Niemeyer and Andreas Geiger. GIRAFFE: Rep-
Urtasun. Vision meets robotics: The kitti dataset. Interna- resenting scenes as compositional generative neural feature
tional Journal of Robotics Research (IJRR), 2013. 1 fields. In CVPR, 2020. 2, 7
[15] Michael Goesele, Brian Curless, and Steven M Seitz. Multi- [33] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
view stereo revisited. In CVPR, 2006. 1 Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
[16] Dan B. Goldman, Brian Curless, Aaron Hertzmann, and Martin-Brualla. Nerfies: Deformable neural radiance fields.
Steven M. Seitz. Shape and spatially-varying brdfs from pho- In ICCV, 2021. 2
tometric stereo. IEEE TPAMI, 2010. 2 [34] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
[17] Ben Graham. Sparse 3d convolutional neural networks. In Zhu. Semantic image synthesis with spatially-adaptive nor-
BMVC, 2015. 2 malization. In CVPR, 2019. 4
[35] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, OpenGL, version 1.2. Addison-Wesley Longman Publishing
Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Co., Inc., 1999. 1
Implicit neural representations with structured latent codes [51] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu,
for novel view synthesis of dynamic humans. In CVPR, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-
2021. 2 based neural radiance fields. In CVPR, 2022. 2, 5, 6
[36] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and [52] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Leonidas J. Guibas. Pointnet: Deep learning on point sets Mvsnet: Depth inference for unstructured multi-view stereo.
for 3d classification and segmentation. In CVPR, 2017. 1, 2, ECCV, 2018. 1
7 [53] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
[37] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Guibas. Pointnet++: Deep hierarchical feature learning on Radiance fields without neural networks. In CVPR, 2022. 1,
point sets in a metric space. In NeurIPS, 2017. 1, 2, 7, 8 6, 8
[38] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor [54] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Angjoo Kanazawa. Plenoctrees for real-time rendering of
Accelerating 3d deep learning with pytorch3d. arXiv, 2020. neural radiance fields. In ICCV, 2021. 1, 5, 6
1, 2, 5, 6 [55] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Koltun. NERF++: Analyzing and improving neural radiance
Patrick Esser, and Björn Ommer. High-resolution image syn- fields. arXiv, 2020. 1
thesis with latent diffusion models. In CVPR, pages 10674– [56] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
10685. IEEE, 2022. 4 and Oliver Wang. The unreasonable effectiveness of deep
[40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: features as a perceptual metric. In CVPR, 2018. 5
Convolutional networks for biomedical image segmentation. [57] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. CIPS-
In MICCAI, 2015. 7, 8 3D: A 3d-aware generator of gans based on conditionally-
[41] Darius Rückert, Linus Franke, and Marc Stamminger. independent pixel synthesis. arXiv, 2021. 2, 7
ADOP: approximate differentiable one-pixel point render- [58] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3d: A
ing. ACM TOG, 2022. 2, 5, 7 modern library for 3d data processing. arXiv, 2018. 1, 2
[42] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point
Cloud Library (PCL). In ICRA, 2011. 1, 2
[43] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,
Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Wang. Real-time single image and video super-resolution
using an efficient sub-pixel convolutional neural network. In
CVPR, 2016. 4, 7, 8
[44] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo
tourism: exploring photo collections in 3d. ACM TOG, 2006.
2
[45] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
Splatnet: Sparse lattice networks for point cloud processing.
In CVPR, 2018. 2
[46] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini-
vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet:
Learning multi-view image-based rendering. In CVPR, 2021.
1, 3, 6
[47] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
CVPR, 2018. 5, 7
[48] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
Simoncelli. Image quality assessment: from error visibility
to structural similarity. IEEE TIP, 2004. 5
[49] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
tor Adrian Prisacariu. NeRF–: Neural radiance fields without
known camera parameters. arXiv, 2021. 1
[50] Mason Woo, Jackie Neider, Tom Davis, and Dave Shreiner.
OpenGL programming guide: the official guide to learning

You might also like