Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields
Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields
NeRF-based Rendering
𝟏
𝒇 𝒇𝟐 𝒇𝟑 𝒇𝟒
𝒇!
x2 x2 x2
𝜸 𝜷
Point Cloud ⊙ ⊕ Image
𝓕𝒍"𝟏 𝓕𝒍 𝟏 𝟐 𝟑 𝟒
𝓕 𝓕 𝓕 𝓕
LayerNorm PixelShuffle
(b) Fusion & Upsampling (c) Fusion Decoding
Figure 1. Overview of our proposed Point2Pix. (a) Multi-scale Radiance Fields. For an input point cloud, we first extract multiple 3D
feature volumes in four scales. Next, for any queried point, we first linearly interpolate the coarse features from these feature volumes and
infer the final features through MLP networks. (b) Fusion and Upsampling. Four 2D feature maps are respectively rendered through NeRF.
They are fused with the previous 2D CNN output and then are upsampled by 2 times. (c) Fusion Decoding. We finally design a neural
renderer to gradually synthesize target images from the projected feature maps.
mainly consists of ray sampling, implicit function, and vol- 3.2. Point-guided Sampling
ume rendering.
According to Eq. (4), increasing the number of samples
Ray Sampling. Starting from the camera center o, ray sam- N along each ray can generate more realistic results [29].
pling is the process that obtains a series of positions xi along However, the required computing resources and running
ray r with direction d as time will also linearly grow. Our model is based on a point
cloud with a relatively fine shape prior. Thus, we propose
\textbf {x}_i = \textbf {o} + z_i \cdot \textbf {d}, \quad i = 1, 2, ..., N, (2) point-guided sampling to achieve more efficient ray sam-
pling through the guidance of the point cloud.
where zi is the sampling depth and N is the number of sam- For any queried point xi , we first find the nearest neigh-
ples on each ray. bour point pi , then check whether xi is located in pi ’s ball
Implicit Function. An implicit function fθ is trained as a area with radius r or not, as
mapping from each queried location xi and direction d to
corresponding colors ci = (ri , gi , bi ) and density σi , as \parallel \textbf {p}_i - \textbf {x}_i \parallel _2 \leq r. \label {eq:sampling} (5)
If the above condition is satisfied, we treat the queried point
(\textbf {c}_i, \sigma _i) = f_\theta (\textbf {x}_i, \textbf {d}), \label {eq:implicit_function} (3) xi as a valid sample and obtain the point feature in Sec. 3.3.
If there is no valid sample along one ray, we adopt uniform
where fθ is an MLP network, and θ is its parameter. sampling from default near to far depth. As illustrated in
Volume Rendering. Each ray (or pixel) color ĉ is calcu- Fig. 2. Compared with previous uniform and coarse-to-fine
lated via volume rendering [28] as sampling [5, 29, 46], our sampling strategy reduces compu-
tation and memory costs.
\begin {aligned} \label {eq:volume_rendering} \hat {\textbf {c}} &= \sum _{i=1}^{N} T_i \alpha _i \textbf {c}_i,\\ \alpha _i &= 1 -\textbf { exp}(-\sigma _i \delta _i),\\ T_i &= \textbf {exp}(-\sum _{j=1}^{i-1}\sigma _j\delta _j), \end {aligned} 3.3. Multi-scale Radiance Fields
We extract discriminative 3D point and ray features via
(4) constructing Multi-scale Radiance Fields, including Point
Encoding and NeRF-based Feature Rendering.
Point Encoding. Point Encoding is to output a discrim-
inative 3D point feature for each valid sample xi . We
adopt 3D sparse UNet from the Minkowski Engine (ME)
where δj is the distance between neighbor samples along [7] as the backbone of our point encoder. ME is an auto-
the ray r. differentiation library to build a 3D CNN for sparse tensors,
Nearest input. This module is inspired by SPADE [34], while we use
Point Layer Normalization [2]. Specifically, as shown in Eq. (9),
𝒅 for the rendered feature map fl , we calculate the conditional
𝒑! 𝒙!
Valid
parameters, including scale γ and bias β, by a Conv2D mod-
Sample ule. Then for feature F l−1 from the previous stage, we nor-
Invalid malize it by Layer Normalization and scale it by γ. Finally,
𝒐 Sample the fused feature F l is obtained by adding the bias β as
Figure 2. The proposed point-guided sampling. For any queried (\gamma , \beta ) &= \textbf {Conv2D}(\textbf {f}^l), \\ \nonumber \mathcal {F}^{l-1} &= \gamma \cdot \textbf {LayerNorm}(\mathcal {F}^{l-1}) + \beta . \label {eq:fusion}
point xi , we find its nearest point pi in the point cloud. If xi is
located in the ball area (with radius r) of pi , it is a valid sample.
Invalid samples are omitted to improve sampling efficiency.
Upsampling. We adopt PixelShuffle [43] as our upsam-
pling modules that upsample the fused feature F l−1 by 2
that are converted from point clouds. As illustrated in Fig. times at each stage, instead of using bilinear or nearest in-
1, our point encoder extracts multiple 3D feature volumes terpolation. PixelShuffle [43] is frequently adopted in the
from the raw point cloud in L different scales. We select Fl super-resolution task, which utilizes a convolution layer to
at scale l to construct multi-scale radiance fields. extend the channel size and reshape them into the spatial
For each valid sample xi at each scale l, we query fea- size, as
ture in Fl to obtain the interpolated feature Fil and employ
an implicit function Φl to infer density σil and final point \mathcal {F}^{l}&= \textbf {Pixelshuffle}(\textbf {Conv2D}(\mathcal {F}^{l-1}), 2). (9)
feature fil for xi as
ToRGB. Finally, we introduce the decoder of the present
(\sigma _i^l, \textbf {f}_i^l) = \Phi _l(\textbf {F}_i^l) = \Phi _l(\textbf {F}^l[\textbf {x}_i]). \label {eq:feature_function} (6) large-scale generator, like [39], as a post-process to gener-
ate the final rendering image Iˆ for the whole point cloud
NeRF-based Feature Rendering. Then we render the renderer.
queried 3D point features to 2D feature maps and gener-
ates images at different scales. At each feature scale l, we 3.5. Loss Function
aggregate density σil and feature fli to generate 2D feature
For the NeRF-based rendering images and neural ren-
map fl by volume rendering of NeRF [29] as
dering images, their optimization goals are the ground-truth
images with target camera parameters. We employ point
\begin {aligned} \textbf {f}^l &= \sum _{i=1}^{N} \textbf {w}_i^l \textbf {f}_i ^l, \\ \textbf {w}_i^l &= \textbf {exp}(-\sum _{j=1}^{i-1}\sigma _j^l\delta _j) (1 - \textbf {exp}(-\sigma _i^l \delta _i)). \end {aligned} \label {eq:feature_rendering} cloud loss, NeRF rendering loss, neural rendering loss, and
perceptual loss to train the parameters of the proposed point
(7) encoder and fusion decoder as
\mathcal {L} = \lambda _{pc} \mathcal {L}_{pc} + \lambda _{nr} \mathcal {L}_{nr} + \lambda _{per} \mathcal {L}_{per}, (10)
So far, we obtained L rendering feature maps {fl } ∈ where λpc , λnr , and λper respectively control weights of
RCl ×Hl ×Wl , where Hl and Wl respectively represent the these losses.
feature height and width, and Cl is the number of channels. Point Cloud Loss. All points pk in raw point clouds pro-
vide groud-truth mapping from locations xk to densities σ̂k
3.4. Fusion Decoding and colors ĉk . Denoting the queried 3D densities and colors
Although we propose an efficient ray sampling strategy from Point2pix in point pk as ck and σk , the point cloud
to reduce memory consumption, it still requires more than loss can be represented as
20 GB GPU memory to render target images with the size
of 480 × 640, as shown in Tab. 4. In addition, there are still \mathcal {L}_{pc} = \sum _{k=1}^{K} ( \parallel \hat {\textbf {c}}_k - \textbf {c}_k \parallel ^2 + \frac {1}{D} \max (0, D-\sigma _k)). \label {eq:loss_pc} (11)
many holes to be filled in the 2D-rendered image space. To
address these issues, we design a Fusion Decoder as a neural
renderer that synthesizes final images from these rendered We encourage the predicted densities at pk to be greater
feature maps fl , l ∈ [1, L] by conditional convolution and than a threshold D.
upsampling modules. Neural Rendering Loss. Lnr is the MSE between rendered
Fusion. Our conditional convolution is to fuse the previous images I from fusion decoder and ground truths Iˆ as
layer’s feature F l−1 with the rendered features fl , which
treats the rendered feature at each scale l as the conditional \mathcal {L}_{nr} = \parallel \hat {I} - I \parallel ^2_2. (12)
Dataset ScanNet [9] ARKitScenes [3]
Metrics PSNR ↑ SSIM ↑ LPIPS ↓ PSNR↑ SSIM↑ LPIPS↓
Pytorch3D [38] 13.62 0.528 0.779 15.21 0.581 0.756
Pix2PixHD [47] 15.59 0.601 0.611 15.94 0.636 0.605
NPCR [10] 16.22 0.659 0.574 16.84 0.661 0.518
NPBG++ [11] 16.81 0.671 0.585 17.23 0.692 0.511
ADOP [41] 16.83 0.699 0.577 17.32 0.707 0.495
Point-NeRF [51] 17.53 0.685 0.517 17.61 0.715 0.508
Point2Pix (Ours) 18.47 0.723 0.484 18.84 0.734 0.471
Table 1. Comparing our method with different point renderers on the ScanNet [9] and ARkitScenes [3] datasets. There is no finetuning
process in this experiment, which demonstrates the generalization in novel scenes.
Perceptual Loss. Lper is a frequently used loss in image for the finetuning evaluation, methods can refine their re-
synthesis, which improves the realism of generation, as sults on specific scenes to improve performance. Since fine-
tuning evaluation in each case usually consumes much re-
sources and time, we randomly choose 8 testing scenes from
\mathcal {L}_{per} = \sum _{l=1}^{L} \parallel \phi (\hat {I}^l)- \phi (I^l )\parallel _1, (13) ScanNet datasets and the same number from the ArkitScene
dataset for finetuning evaluation.
where ϕ(·) means extracting VGG features. Implementation Details. We adopt MinkUnet14A as our
Point Encoder. The radius r for point-guided sampling is
4. Experiments 0.08 meters. The maximal number N of samples for each
ray is 128. We extract feature volumes with L = 4 scales.
In this section, we conduct experiments to demonstrate 1 1 1
Thus the scales are , , , and 1 respectively. The reso-
the effectiveness of our proposed method. First, we in- 8 4 2
troduce the indoor datasets and evaluation metrics. Then, lution of the final rendered images is 640 × 480. During
we quantitatively and qualitatively compare the proposed training, the initial learning rate is 0.004 with AdamW [27]
method with state-of-the-art point cloud renderers to show optimizer. It exponentially decays to 0.0004 till the end (by
our advantages. Next, ablation studies are performed to val- 500 epochs). We set λpc = 0.1, λnr = 1.0, and λper = 0.1
idate the effect of each proposed module, including point- empirically. The density threshold D = 10. We train our
guided sampling, point encoder, and fusion decoder. Fi- model on 4 NVIDIA Titan-V GPUs, and the batch size is 1
nally, we apply our method to point cloud applications. for each GPU.
Ours Ours
Ground Truth
(non-finetuning) (finetuning)
Figure 3. Qualitative comparison between different point renderers on the ScanNet [9].
Ours Ours
Ground Truth
(non-finetuning) (finetuning)
Figure 4. Qualitative comparison between different point renderers and NeRF-based methods on the ArkitScenes [3] dataset.
Render Time Train Memory In the literature of point cloud analysis, many point-based
Ray Sampling # Val. Sampl. PSNR (↑)
(seconds, ↓) (GB, ↓)
Uniform 128 17.96 3.13 8.54 networks [7, 18, 36, 37] were proposed. We compare differ-
Coarse-to-Fine 128 18.69 5.78 9.17 ent backbones in this experiment.
Point-Guide (Ours) 15.6 18.47 0.92 6.68 The candidate networks include PointNet++ [37], Spar-
seConvNet [18], and MinkUnet [7] (MinkUnet14A and
Table 3. Comparison between different ray sampling strategies on MinkUnet34C). We evaluate them by only changing the
ScanNet [9]. In our method, the mean sampling number for each
point encoder, and the results are shown in Tab. 5. Point-
ray is only 15.6, which is significantly less than the others.
Net++ [37] consumes the largest memory and takes the
longest rendering time, while the final synthesized accuracy
ods, like GIRAFFE [32], HeadNeRF [22], StyleNeRF, and is low. MinkUnet achieves the best results and is faster than
CIPS-3D [57], also render feature maps via NeRF, which SparseConvNet [18]. We select MinkUnet14A as our point
can effectively reduce the memory and rendering time. encoder since it is more efficient than MinkUnet34C.
However, different from ours, they only adopt a single scale Effect of Fusion Decoder. Previous neural point encoders
of radiance fields. We validate our Multi-scale Radiance usually adopt image-to-image translator [40, 47] to render
Fields in this experiment and show results in Tab. 4. images from projected feature maps. We conduct this ex-
We first study the effect of the number of rays. In the periment to analyze the effect of our Fusion Decoder. We
condition of a single scale, if we directly render the image construct different alternatives by combining different de-
at the final resolution by NeRF, the memory consumption coder and fusion strategies, as shown in Tab. 6.
is heavy, and the running time is long. Decreasing sampled The combination between U-Net [40] and concatenation
rays and increasing upsampling scale can promote the syn- strategy is the most frequently adopted [10, 11, 41], while
thesis quality and rendering efficiency. With the increasing its performance is not high. When replacing the decoder
number of NeRF scales, accuracy further improves, which with PixelShuffle [43], accuracy improves. It shows that
validates the effectiveness of our multi-scale radiance fields. the neural renderer does not require large respective fields
We finally adopt the combination in the last column to bal- as U-Net. When the concatenate strategy is replaced with
ance the accuracy and time. our proposed fusion model, the performance is further im-
Selection of Point Encoder. In our Point2Pix, the Point proved, showing the rationality of our design.
Encoder is the backbone to provide multi-scale 3D features. Effect of Point Cloud Loss. We perform this experiment
# Scales 1 1 1 2 4
# Rays 640 × 480 320 × 240 80 × 60 80 × 60 80 × 60
Upsampling ×1 ×2 ×8 ×8 ×8
PSNR (↑) 17.49 17.86 18.05 18.16 18.47
Rendering Time (seconds, ↓) 13.12 3.56 0.85 0.92 0.96
Training Memory (GB, ↓) 22.93 11.12 6.27 6.68 6.95
Table 4. Effect of different combinations in terms of the number of scales, number of rays and scale of upsampling. We choose the
combination in the last column, which achieves the best performance and is also efficient.
5. Conclusion Acknowledgments
In this paper, we have proposed a general point renderer, This work is partially supported by Shenzhen Science
which can be directly utilized to render photo-realistic im- and Technology Program KQTD20210811090149095.
References [18] Benjamin Graham and Laurens van der Maaten. Subman-
ifold sparse convolutional networks. arXiv, 2017. 1, 2, 7,
[1] Dejan Azinovic, Ricardo Martin-Brualla, Dan B. Goldman, 8
Matthias Nießner, and Justus Thies. Neural RGB-D surface
[19] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt.
reconstruction. In CVPR, 2022. 2
Stylenerf: A style-based 3d-aware generator for high-
[2] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.
resolution image synthesis. In ICLR, 2021. 2
Layer normalization. arXiv, 2016. 4
[20] Pedro Hermosilla, Tobias Ritschel, Pere-Pau Vázquez, Àlvar
[3] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry,
Vinacua, and Timo Ropinski. Monte carlo convolution for
Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe,
learning on non-uniformly sampled point clouds. ACM TOG,
Daniel Kurz, Arik Schwartz, and Elad Shulman. ARK-
2018. 2
itscenes - a diverse real-world dataset for 3d indoor scene
understanding using mobile RGB-d data. In Thirty-fifth Con- [21] Yannick Hold-Geoffroy, Kalyan Sunkavalli, Sunil Hadap,
ference on Neural Information Processing Systems Datasets Emiliano Gambaretto, and Jean-François Lalonde. Deep out-
and Benchmarks Track (Round 1), 2021. 1, 5, 7 door illumination estimation. In CVPR, 2017. 2
[4] Dan Cernea. OpenMVS: Multi-view stereo reconstruction [22] Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong
library. 2020. 1 Zhang. Headnerf: A real-time nerf-based parametric head
[5] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, model. In CVPR, 2021. 2, 7
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- [23] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
izable radiance field reconstruction from multi-view stereo. Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
In ICCV, 2021. 1, 3, 6 2020. 1, 6
[6] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao [24] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
Bao, Xu Jia, and Huchuan Lu. Animatable neural radiance Christian Theobalt. Neural sparse voxel fields. In NeurIPS,
fields from monocular rgb videos. arXiv, 2021. 2 2020. 2
[7] Christopher B. Choy, JunYoung Gwak, and Silvio Savarese. [25] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu
4d spatio-temporal convnets: Minkowski convolutional neu- Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor:
ral networks. In CVPR, 2019. 1, 2, 3, 7, 8 Neural free-view synthesis of human actors with pose con-
[8] Ruihang Chu, Yukang Chen, Tao Kong, Lu Qi, and Lei Li. trol. ACM TOG, 2021. 2
Icm-3d: Instantiated category modeling for 3d instance seg- [26] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard
mentation. IEEE Robotics and Automation Letters, 7(1):57– Pons-Moll, and Michael J. Black. SMPL: A skinned multi-
64, 2021. 2 person linear model. ACM TOG, 2015. 2
[9] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- [27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: regularization. In ICLR, 2019. 5
Richly-annotated 3d reconstructions of indoor scenes. In
[28] Nelson Max. Optical models for direct volume rendering.
CVPR, 2017. 1, 5, 6, 7
IEEE TVCG, 1995. 3
[10] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and
[29] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Bing Zeng. Neural point cloud rendering via multi-plane
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
projection. In CVPR, 2020. 1, 2, 5, 7
Representing scenes as neural radiance fields for view syn-
[11] Peng Dai, Yinda Zhang, Zhuwen Li, Shuaicheng Liu, and
thesis. In ECCV, 2020. 1, 2, 3, 4, 5, 6
Bing Zeng. Neural point cloud rendering via multi-plane
projection. In CVPR, 2020. 2, 5, 7 [30] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
[12] Paul E. Debevec, Yizhou Yu, and George Borshukov. Effi- der Keller. Instant neural graphics primitives with a multires-
cient view-dependent image-based rendering with projective olution hash encoding. arXiv, 2022. 6
texture-mapping. In Rendering Techniques ’98, Proceedings [31] Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas
of the Eurographics Workshop in Vienna, Austria, June 29 - Kurz, Joerg H. Mueller, Chakravarty R. Alla Chaitanya, An-
July 1, 1998, 1998. 2 ton S. Kaplanyan, and Markus Steinberger. DONeRF: To-
[13] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- wards Real-Time Rendering of Compact Neural Radiance
manan. Depth-supervised nerf: Fewer views and faster train- Fields using Depth Oracle Networks. Computer Graphics
ing for free. In CVPR, 2022. 2 Forum, 2021. 1, 2
[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel [32] Michael Niemeyer and Andreas Geiger. GIRAFFE: Rep-
Urtasun. Vision meets robotics: The kitti dataset. Interna- resenting scenes as compositional generative neural feature
tional Journal of Robotics Research (IJRR), 2013. 1 fields. In CVPR, 2020. 2, 7
[15] Michael Goesele, Brian Curless, and Steven M Seitz. Multi- [33] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
view stereo revisited. In CVPR, 2006. 1 Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
[16] Dan B. Goldman, Brian Curless, Aaron Hertzmann, and Martin-Brualla. Nerfies: Deformable neural radiance fields.
Steven M. Seitz. Shape and spatially-varying brdfs from pho- In ICCV, 2021. 2
tometric stereo. IEEE TPAMI, 2010. 2 [34] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
[17] Ben Graham. Sparse 3d convolutional neural networks. In Zhu. Semantic image synthesis with spatially-adaptive nor-
BMVC, 2015. 2 malization. In CVPR, 2019. 4
[35] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, OpenGL, version 1.2. Addison-Wesley Longman Publishing
Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Co., Inc., 1999. 1
Implicit neural representations with structured latent codes [51] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu,
for novel view synthesis of dynamic humans. In CVPR, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-
2021. 2 based neural radiance fields. In CVPR, 2022. 2, 5, 6
[36] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and [52] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.
Leonidas J. Guibas. Pointnet: Deep learning on point sets Mvsnet: Depth inference for unstructured multi-view stereo.
for 3d classification and segmentation. In CVPR, 2017. 1, 2, ECCV, 2018. 1
7 [53] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
[37] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Guibas. Pointnet++: Deep hierarchical feature learning on Radiance fields without neural networks. In CVPR, 2022. 1,
point sets in a metric space. In NeurIPS, 2017. 1, 2, 7, 8 6, 8
[38] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor [54] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Angjoo Kanazawa. Plenoctrees for real-time rendering of
Accelerating 3d deep learning with pytorch3d. arXiv, 2020. neural radiance fields. In ICCV, 2021. 1, 5, 6
1, 2, 5, 6 [55] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Koltun. NERF++: Analyzing and improving neural radiance
Patrick Esser, and Björn Ommer. High-resolution image syn- fields. arXiv, 2020. 1
thesis with latent diffusion models. In CVPR, pages 10674– [56] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
10685. IEEE, 2022. 4 and Oliver Wang. The unreasonable effectiveness of deep
[40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: features as a perceptual metric. In CVPR, 2018. 5
Convolutional networks for biomedical image segmentation. [57] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. CIPS-
In MICCAI, 2015. 7, 8 3D: A 3d-aware generator of gans based on conditionally-
[41] Darius Rückert, Linus Franke, and Marc Stamminger. independent pixel synthesis. arXiv, 2021. 2, 7
ADOP: approximate differentiable one-pixel point render- [58] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3d: A
ing. ACM TOG, 2022. 2, 5, 7 modern library for 3d data processing. arXiv, 2018. 1, 2
[42] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point
Cloud Library (PCL). In ICRA, 2011. 1, 2
[43] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,
Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Wang. Real-time single image and video super-resolution
using an efficient sub-pixel convolutional neural network. In
CVPR, 2016. 4, 7, 8
[44] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo
tourism: exploring photo collections in 3d. ACM TOG, 2006.
2
[45] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
Splatnet: Sparse lattice networks for point cloud processing.
In CVPR, 2018. 2
[46] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini-
vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-
Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet:
Learning multi-view image-based rendering. In CVPR, 2021.
1, 3, 6
[47] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Jan Kautz, and Bryan Catanzaro. High-resolution image syn-
thesis and semantic manipulation with conditional gans. In
CVPR, 2018. 5, 7
[48] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.
Simoncelli. Image quality assessment: from error visibility
to structural similarity. IEEE TIP, 2004. 5
[49] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
tor Adrian Prisacariu. NeRF–: Neural radiance fields without
known camera parameters. arXiv, 2021. 1
[50] Mason Woo, Jackie Neider, Tom Davis, and Dave Shreiner.
OpenGL programming guide: the official guide to learning