PGSR
PGSR
Fig. 1: PGSR representation. We present a Planar-based Gaussian Splatting Reconstruction representation for efficient and high-fidelity surface reconstruction
from multi-view RGB images without any geometric prior (depth or normal from pre-trained model). The courthouse reconstructed by our method demonstrates
that PGSR can recover geometric details, such as textual details on the building. From left to right: input SfM points, planar-based Gaussian ellipsoid, rendered
view, textured mesh, surface, and normal.
Abstract—Recently, 3D Gaussian Splatting (3DGS) has at- tion (PGSR) to achieve high-fidelity surface reconstruction while
tracted widespread attention due to its high-quality rendering, ensuring high-quality rendering. Specifically, we first introduce
and ultra-fast training and rendering speed. However, due to the an unbiased depth rendering method, which directly renders
unstructured and irregular nature of Gaussian point clouds, it the distance from the camera origin to the Gaussian plane
is difficult to guarantee geometric reconstruction accuracy and and the corresponding normal map based on the Gaussian
multi-view consistency simply by relying on image reconstruction distribution of the point cloud, and divides the two to obtain
loss. Although many studies on surface reconstruction based the unbiased depth. We then introduce single-view geometric,
on 3DGS have emerged recently, the quality of their meshes is multi-view photometric, and geometric regularization to preserve
generally unsatisfactory. To address this problem, we propose a global geometric accuracy. We also propose a camera exposure
fast planar-based Gaussian splatting reconstruction representa- compensation model to cope with scenes with large illumination
variations. Experiments on indoor and outdoor scenes show
H. Bao, G. Zhang, W. Ye and H. Li are with the State Key that our method achieves fast training and rendering while
Lab of CAD&CG, Zhejiang University. E-mails: {baohujun, zhang- maintaining high-fidelity rendering and geometric reconstruction,
guofeng}@zju.edu.cn. outperforming 3DGS-based and NeRF-based methods. Our code
D. Chen and W. Xie are with the State Key Lab of CAD&CG, Zhejiang will be made publicly available, and more information can be
University and SenseTime Research. D. Chen is also affiliated with Tetras.AI. found on our project page (https://zju3dv.github.io/pgsr/).
E-mails: [email protected], [email protected].
Y. Wang is with Shanghai AI Laboratory. Index Terms—Planar-Based Gaussian Splatting, Surface Re-
S. Zhai, N. Wang and H. Liu are with SenseTime Research.
† Corresponding author construction, Neural Rendering, Neural Radiance Fields.
2
Fig. 2: Unbiased depth rendering. (a) Illustration of the rendered depth: We take a single Gaussian, flatten it into a plane, and fit it onto the surface as an
example. Our rendered depth is the intersection point of rays and surfaces, matching the actual surface. In contrast, the depth from previous methods [11],
[24] corresponds to a curved surface and may deviate from the actual surface. (b) We use true depth to supervise two different depth rendering methods.
After optimization, we map the positions of all Gaussian points. Gaussians of our method fit well onto the actual surface, while the previous method results
in noise and poor adherence to the surface.
overall changes in image brightness, further improving re- either through triangulation [6] or implicit surface fitting [25],
construction quality. Finally, we validate the rendering and [26]. Despite being well-established and extensively utilized in
reconstruction quality on the MipNeRF360, the DTU [23] and academia and industry, these traditional methods are suscep-
the Tanks and Temples(TnT) [28] dataset. Experimental results tible to artifacts stemming from erroneous matching or noise
demonstrate that, while maintaining the original Gaussian introduced during the pipeline. In response, several approaches
rendering quality and rendering speed, our method achieves aim to enhance reconstruction completeness and accuracy by
state-of-the-art reconstruction accuracy. Moreover, our training integrating deep neural networks into the matching process
speed only requires one hour on a single GPU, while the state- [50], [54].
of-the-art method based on NeRF [33] requires eight GPUs
over two days. In summary, our method makes the following B. Neural Surface Reconstruction
contributions:
Numerous pioneering efforts have leveraged pure deep
• We propose a novel unbiased depth rendering method.
neural networks to predict surface models directly from single
Based on this rendering method, we can render the or multiple image conditions using point clouds [14], [34],
reliable plane parameters for each pixel, facilitating the voxels [12], [58], and triangular meshes [32], [55] or implicit
incorporation of various geometric constraints. fields [40], [47] in end-to-end manner. However, these methods
• We introduce single-view and multi-view regulariza-
often incur significant computational overhead during network
tions to optimize the plane parameters of each pixel, inference and demand extensively labeled training 3D models,
achieving high-precision global geometric consistency. hindering their real-time and real-world applicability.
• The exposure compensation simply and effectively en-
With the rapid advancement in neural surface reconstruction
hances reconstruction accuracy. tasks, a meticulously designed scene recovery method named
• Our method, while maintaining the high rendering ac-
NeRF [41] emerged. NeRF-based methods take 5D ray in-
curacy and speed of the original GS, achieves state-of- formation as input and predict density and color sampled in
the-art reconstruction accuracy, and our training time continuous space, yielding notably more realistic rendering
is near 100 times faster compared to state-of-the-art results. However, this representation falls short in capturing
reconstruction methods based on NeRF [33]. high-fidelity surfaces.
Consequently, several approaches have transformed NeRF-
based network architectures into surface reconstruction frame-
works by incorporating intermediate representations such as
occupancy [46] or signed distance fields [56], [60]. Despite the
potent surface reconstruction capabilities exhibited by NeRF-
based frameworks, the stacked multi-layer-perceptron (MLP)
layers impose constraints on inference time and representation
ability. To address this challenge, various following studies aim
Fig. 3: Rendered Depth. The original depth in 3DGS exhibits significant
noise, while our depth is smoother and more accurate. to reduce dependency on MLP layers by decomposing scene
information into separable structures, such as points [59] and
voxels [31], [33], [35].
II. R ELATED W ORK
C. Gaussian Splatting based Surface Reconstruction
Surface reconstruction is a cornerstone field in computer
graphics and computer vision, aimed at generating intricate SuGaR [19] proposed a method to extract Mesh from
and accurate surface representations from sparse or noisy 3DGS. They introduced regularization terms to encourage
input data. Obtaining high-fidelity 3D models from real-world Gaussian fitting to the scene surface. By sampling 3D point
environments is pivotal for enabling immersive experiences in clouds from the Gaussian using the density field, they utilized
augmented reality (AR) and virtual reality (VR). This paper Poisson reconstruction to extract a mesh from these sampled
focuses exclusively on surface reconstruction under given point clouds. While encouraging Gaussian fitting to the sur-
poses, which can be readily computed using SLAM [5], [7], face enhances geometric reconstruction accuracy, irregular 3D
[8] or SFM [43], [51], [57] methods. Gaussian shapes make modeling smooth geometric surfaces
challenging. Moreover, due to the discreteness and disorder
of the Gaussian, relying solely on image reconstruction loss
A. Traditional Surface Reconstruction can lead to overfitting, resulting in incomplete geometric
Traditional methods adhere to the universal multi-view information and surface mismatch. 2DGS [21] achieves view-
stereo pipeline, which can be roughly categorized based on consistent geometry by collapsing the 3D volume into a set
the intermediate representation they rely on, such as point of 2D oriented planar Gaussian disks. GOF [69] establishes
cloud [16], [30], volume [29], depth map [4], [17], [52], a Gaussian opacity field, enabling geometry extraction by
etc. The commonly used method separates the overall MVS directly identifying its level-set. However, these 3DGS-based
problem into several parts, by initially extracting dense point methods still produce biased depth and multi-view geometric ?
clouds from multi-view images through block-based match- consistency is not guaranteed. To address these issues, we
ing [1], followed by the construction of surface structures flattened the Gaussian into a planar shape, which is more
4
Fig. 4: PGSR Overview. We compress Gaussians into flat planes and render distance and normal maps, which are then transformed into unbiased depth maps.
Single-view and multi-view geometric regularization ensure high precision in global geometry. Exposure compensation RGB loss enhances reconstruction
accuracy.
suitable for modeling actual surfaces and facilitates rendering (SH) from the Gaussian Gi . Ti is the cumulative opacity. N
parameters such as normals and distances from the plane to is the number of Gaussians that the ray passes through.
the origin. Based on these plane parameters, we proposed The center µi of a Gaussian Gi . can be projected into the
unbiased depth estimation, allowing us to extract geometric camera coordinate system as:
parameters from the Gaussian. Then, we introduced geometric ⊤
xi , yi , zi , 1 = W [µi , 1]⊤ ,
regularization terms from single-view and multi-view to opti-
mize these geometric parameters, achieving globally consistent Previous Methods [11], [24] render depth under the current
high-precision geometric reconstruction. viewpoint: X
D= Ti αi zi .
III. P RELIMINARY OF 3D G AUSSIAN S PLATTING
i∈N
3DGS [27] explicitly represents 3D scenes with a set of
3D Gaussians {Gi }. Each Gaussian is defined by a Gaussian
function: IV. M ETHOD
1 ⊤
Σ−1 Given multi-view RGB images of static scenes, our goal
Gi (x|µi , Σi ) = e− 2 (x−µi ) i (x−µi ) , is to achieve efficient and high-fidelity scene geometry re-
where µi ∈ R3 and Σi ∈ R3×3 are the center of a point construction and rendering quality. Compared to 3DGS, we
pi ∈ P and corresponding 3D covariance matrix, respectively. achieve global consistency in geometry reconstruction while
The covariance matrix Σi is factorized into a scaling matrix maintaining similar rendering quality. Initially, we improve
Si ∈ R3×3 and a rotation matrix Ri ∈ R3×3 : the modeling of scene geometry attributes by compressing
3D Gaussians into a 2D flat plane representation, which
Σi = Ri Si Si⊤ Ri⊤ .
is used to generate plane distance and normal maps, and
3DGS allows fast α-blending for rendering. Given a trans- subsequently converted into unbiased depth maps. We then
formation matrix W and an intrinsic matrix K, µi and Σi introduce single-view geometric, multi-view photometric, and
can be transformed to camera coordinate corresponding to W geometric consistency loss to ensure global geometry consis-
and then projected to 2D coordinate: tency. Additionally, the exposure compensation model further
′ ′ improves reconstruction accuracy.
µi = KW [µi , 1]⊤ , Σi = J W Σi W ⊤ J ⊤ ,
where J is the Jacobian of the affine approximation for the A. Planar-based Gaussian Splatting Representation
projective transformation. Rendering color C ∈ R3 of a pixel
u can be obtained in a manner of α-blending: In this section, we will discuss how to transform 3D
Gaussians into a 2D flat plane representation. Based on this
i−1
X Y plane representation, we introduce an unbiased depth render-
C= Ti αi ci , Ti = (1 − αi ),
ing method, which will render plane-to-camera distance and
i∈N j=1
normal maps, and can then be converted into depth maps.
′ ′
where αi is calculated by evaluating Gi (u|µi , Σi ) multiplied With geometric depth, distance, and normal maps available,
with a learnable opacity corresponding to Gi , and the view- it becomes easier to introduce single-view regularization and
dependent color ci ∈ R3 is represented by spherical harmonics multi-view regularization in the following sections.
5
Fig. 5: The rendering and mesh reconstruction results in various indoor and outdoor scenes that we have achieved. PGSR achieves high-precision
geometric reconstruction from a series of RGB images without requiring any prior knowledge.
Fig. 8: Qualitative comparison on Tanks and Temples dataset. We visualize surface quality using a normal map generated from the reconstructed mesh.
PGSR outperforms other baseline approaches in capturing scene details, whereas baseline methods exhibit missing or noisy surfaces.
affect our final convergence. This is because the single-view pr to the neighboring frame patch Pn using the homography
regularization term and the use of sparse 3D Gaussians to matrix Hrn . Focusing on geometric details, we convert color
represent dense scenes will gradually propagate high-precision images into grayscale. Multi-view photometric regularization
geometry, eventually leading all Gaussians to converge to the requires that Pr and Pn should be as consistent as possible.
correct positions. V is the set of all pixels in the image We use the normalized cross correlation (NCC) [68] of patches
excluding those with high forward and backward projection in the reference frame and the neighboring frame to measure
error. the photometric consistency:
Multi-View Photometric Consistency: Drawing inspira- 1 X
Lmvrgb = (1 − N CC(Ir (pr ), In (Hrn pr ))), (10)
tion from multi-view Stereo (MVS methods) [4], [15], [51], we V
pr ∈V
employ photometric multi-view consistency constraints based
on plane patches. We map a 11x11 pixel patch Pr centered at where V is the set of all pixels in the image, excluding
8
TABLE I: Quantitative results of rendering quality for novel view synthesis on Mip-NeRF360 dataset. ”Red”, ”Orange” and ”Yellow” denote the best,
second-best, and third-best results. PGSR achieves results close to 3DGS and outperforms similar reconstruction method SuGaR.
Indoor scenes Outdoor scenes Average on all scenes
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
NeRF [41] 26.84 0.790 0.370 21.46 0.458 0.515 24.15 0.624 0.443
NeRF-based
Deep Blending [20] 26.40 0.844 0.261 21.54 0.524 0.364 23.97 0.684 0.313
INGP [44] 29.15 0.880 0.216 22.90 0.566 0.371 26.03 0.723 0.294
M-NeRF360 [2] 31.72 0.917 0.180 24.47 0.691 0.283 28.10 0.804 0.232
Neus [56] 25.10 0.789 0.319 21.93 0.629 0.600 23.74 0.720 0.439
GS-based
3DGS [27] 30.99 0.926 0.199 24.24 0.705 0.283 27.24 0.803 0.246
SuGaR [19] 29.44 0.911 0.216 22.76 0.631 0.349 26.10 0.771 0.283
2DGS [21] 30.39 0.923 0.183 24.33 0.709 0.284 27.03 0.804 0.239
GOF [69] 30.80 0.928 0.167 24.76 0.742 0.225 27.78 0.835 0.196
PGSR 30.41 0.930 0.161 24.45 0.730 0.224 27.43 0.830 0.193
ground truth image, while the SSIM loss requires the rendered
image to have similar structures to the ground truth image. To
enhance the robustness of exposure coefficient estimation, we
need to ensure that the rendered image and the ground truth
image have sufficient structural similarity before performing
the estimation. After training, Iir is required to be globally
consistent and maintain structural similarity with the ground
truth image, while Iia can adjust the brightness of images to
match the ground truth image perfectly.
D. Training
In summary, our final training loss L consists of the image
Fig. 9: Multi-view photometric and geometric loss. reconstruction loss Lrgb , the flattening 3D Gaussian loss Ls ,
the geometric loss Lgeo :
those with high forward and backward projection errors. L = Lrgb + λ1 Ls + Lgeo . (15)
3) Geometric Regularization Loss: Finally, the geomet- We set λ1 = 100. For the image reconstruction loss, we set
ric regularization loss includes single-view geometric, multi- λ = 0.2. For the geometric loss, we set λ2 = 0.01, λ3 = 0.2,
view geometric, and multi-view photometric consistency con- and λ4 = 0.05.
straints:
V. E XPERIMENTS
Lgeo = λ2 Lsvgeo + λ3 Lmvrgb + λ4 Lmvgeom . (11) Datasets: To validate the effectiveness of our method, we
conducted experiments on various real-world datasets, includ-
ing objects, and indoor and outdoor environments. We chose
C. Exposure Compensation Image Loss the widely used MiP-NeRF360 dataset [2] for evaluating novel
Due to changes in external lighting conditions, cameras may view synthesis performance. The large and complex scenes
have different exposure times during different shooting mo- of the TnT [28] and 15 object-centric scenes of the DTU
ments, leading to overall brightness variations in images. The dataset [23] were selected to assess reconstruction quality.
original 3DGS does not consider brightness changes, which Evaluation Criterion: We chose three widely used image
can result in floating artifacts in practical scenes. To model evaluation metrics to validate novel view synthesis: peak
the overall brightness variations at different times, we assign signal-to-noise ratio (PSNR), structural similarity index mea-
two exposure coefficients, a and b, to each image. Ultimately, sure (SSIM), and the learned perceptual image patch similarity
images with exposure compensation can be obtained by simply (LPIPS) [70]. For assessing surface quality, we employed the
computing with exposure coefficients: F1 score and chamfer distance.
Implementation Details: Our training strategy and hy-
Iia = exp(ai )Iir + bi , (12)
perparameters are generally consistent with 3DGS [27]. The
where Iiris the rendered image and Iia
is the exposure- training iterations for all scenes are set to 30,000. We adopt
adjusted image. We employ the following image loss: the densification strategy of AbsGS [67]. The learning rate
for the exposure coefficient is 0.001. We begin by rendering
Lrgb = (1 − λ)L1 (I˜ − Ii ) + λLSSIM (Iir − Ii ). (13) the depth for each training view, followed by utilizing the
(
Iia , if LSSIM (Iir − Ii ) < 0.5 TSDF Fusion algorithm [45] to generate the corresponding
I˜ = (14) TSDF field. Subsequently, we extract the mesh [38] from the
Iir , if LSSIM (Iir − Ii ) >= 0.5
TSDF field. We only utilize the exposure compensation on the
where Ii is the ground truth image. The L1 loss constraint Tanks and Temples dataset. All experiments in this paper are
ensures that the exposure-adjusted image is consistent with the conducted on Nvidia RTX 4090 GPU.
9
TABLE II: Quantitative results of chamfer distance(mm)↓ on DTU dataset [23]. PGSR achieves the highest reconstruction accuracy and is over 100 times
faster than the SDF method based on NeRF.
[21] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua the Asian Computer Vision Conference (ACCV 2012), pages 257–270.
Gao. 2d gaussian splatting for geometrically accurate radiance fields. Springer Berlin Heidelberg, 2012.
arXiv preprint arXiv:2403.17888, 2024. [44] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller.
[22] Chenxi Huang, Yuenan Hou, Weicai Ye, Di Huang, Xiaoshui Huang, Instant neural graphics primitives with a multiresolution hash encoding.
Binbin Lin, Deng Cai, and Wanli Ouyang. Nerf-det++: Incorporating ACM transactions on graphics (TOG), 41(4):1–15, 2022.
semantic cues and perspective-aware depth supervision for indoor multi- [45] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David
view 3d detection. arXiv preprint arXiv:2402.14464, 2024. Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie
[23] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time
Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of dense surface mapping and tracking. In 2011 10th IEEE international
the IEEE conference on computer vision and pattern recognition, pages symposium on mixed and augmented reality, pages 127–136. Ieee, 2011.
406–413, 2014. [46] Michael Niemeyer, Lars M. Mescheder, Michael Oechsle, and Andreas
[24] Yingwenqi Jiang, Jiadong Tu, Yuan Liu, Xifeng Gao, Xiaoxiao Long, Geiger. Differentiable volumetric rendering: Learning implicit 3D rep-
Wenping Wang, and Yuexin Ma. Gaussianshader: 3d gaussian splat- resentations without 3D supervision. In IEEE Conference on Computer
ting with shading functions for reflective surfaces. arXiv preprint Vision and Pattern Recognition, pages 3501–3512, 2020.
arXiv:2311.17977, 2023. [47] Jeong Joon Park, Peter Florence, Julian Straub, Richard A. Newcombe,
[25] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface and Steven Lovegrove. DeepSDF: Learning continuous signed distance
reconstruction. In Proceedings of the fourth Eurographics symposium functions for shape representation. In IEEE Conference on Computer
on Geometry processing, volume 7, 2006. Vision and Pattern Recognition, pages 165–174, 2019.
[26] Michael Kazhdan and Hugues Hoppe. Screened poisson surface recon- [48] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dream-
struction. ACM Transactions on Graphics (ToG), 32(3):1–13, 2013. fusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988,
[27] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George 2022.
Drettakis. 3d gaussian splatting for real-time radiance field rendering. [49] Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia.
ACM Transactions on Graphics, 42(4):1–14, 2023. Geonet: Geometric neural network for joint depth and surface normal
[28] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks estimation. In Proceedings of the IEEE Conference on Computer Vision
and temples: Benchmarking large-scale scene reconstruction. ACM and Pattern Recognition, pages 283–291, 2018.
Transactions on Graphics (ToG), 36(4):1–13, 2017. [50] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dym-
[29] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space czyk. From coarse to fine: Robust hierarchical localization at large scale.
carving. International journal of computer vision, 38:199–218, 2000. In CVPR, 2019.
[30] Maxime Lhuillier and Long Quan. A quasi-dense approach to surface [51] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion
reconstruction from uncalibrated images. IEEE transactions on pattern revisited. In Proceedings of the IEEE conference on computer vision
analysis and machine intelligence, 27(3):418–433, 2005. and pattern recognition, pages 4104–4113, 2016.
[31] Hai Li, Xingrui Yang, Hongjia Zhai, Yuqian Liu, Hujun Bao, and
[52] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-
Guofeng Zhang. Vox-surf: Voxel-based implicit surface representation.
Michael Frahm. Pixelwise view selection for unstructured multi-view
IEEE Transactions on Visualization and Computer Graphics, 2022.
stereo. In European Conference on Computer Vision (ECCV), 2016.
[32] Hai Li, Weicai Ye, Guofeng Zhang, Sanyuan Zhang, and Hujun Bao.
[53] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng.
Saliency guided subdivision for single-view mesh reconstruction. In
Dreamgaussian: Generative gaussian splatting for efficient 3d content
2020 International Conference on 3D Vision (3DV), pages 1098–1107.
creation. arXiv preprint arXiv:2309.16653, 2023.
IEEE, 2020.
[54] Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale,
[33] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias
and Marc Pollefeys. Patchmatchnet: Learned multi-view patchmatch
Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-
stereo. In Proceedings of the IEEE/CVF conference on computer vision
fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF
and pattern recognition, pages 14194–14203, 2021.
Conference on Computer Vision and Pattern Recognition, pages 8456–
8465, 2023. [55] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and
[34] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning efficient point Yu-Gang Jiang. Pixel2mesh: Generating 3D mesh models from single
cloud generation for dense 3D object reconstruction. In Conference on RGB images. In European Conference on Computer Vision, volume
Artificial Intelligence, pages 7114–7121, 2018. 11215, pages 55–71, 2018.
[35] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian [56] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Ko-
Theobalt. Neural sparse voxel fields. In Advances in Neural Information mura, and Wenping Wang. Neus: Learning neural implicit surfaces
Processing Systems, pages 15651–15663, 2020. by volume rendering for multi-view reconstruction. arXiv preprint
[36] Xiangyu Liu, Weicai Ye, Chaoran Tian, Zhaopeng Cui, Hujun Bao, arXiv:2106.10689, 2021.
and Guofeng Zhang. Coxgraph: multi-robot collaborative, globally [57] Changchang Wu. Towards linear-time incremental structure from mo-
consistent, online dense reconstruction system. In 2021 IEEE/RSJ tion. In 2013 International Conference on 3D Vision-3DV 2013, pages
International Conference on Intelligent Robots and Systems (IROS), 127–134. IEEE, 2013.
pages 8722–8728. IEEE, 2021. [58] Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen Zhou, and
[37] Xiaoxiao Long, Yuhang Zheng, Yupeng Zheng, Beiwen Tian, Cheng Lin, Shengping Zhang. Pix2Vox: Context-aware 3D reconstruction from
Lingjie Liu, Hao Zhao, Guyue Zhou, and Wenping Wang. Adaptive single and multi-view images. In IEEE/CVF International Conference
surface normal constraint for geometric estimation from monocular on Computer Vision, pages 2690–2698, 2019.
images. arXiv preprint arXiv:2402.05869, 2024. [59] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan
[38] William E Lorensen and Harvey E Cline. Marching cubes: A high Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radi-
resolution 3d surface construction algorithm. In Seminal graphics: ance fields. In Proceedings of the IEEE/CVF conference on computer
pioneering efforts that shaped the field, pages 347–353. 1998. vision and pattern recognition, pages 5438–5448, 2022.
[39] Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua [60] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume
Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering of neural implicit surfaces. In Advances in Neural Information
rendering. arXiv preprint arXiv:2312.00109, 2023. Processing Systems, pages 4805–4815, 2021.
[40] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian [61] Weicai Ye, Shuo Chen, Chong Bao, Hujun Bao, Marc Pollefeys,
Nowozin, and Andreas Geiger. Occupancy networks: Learning 3D Zhaopeng Cui, and Guofeng Zhang. IntrinsicNeRF: Learning Intrinsic
reconstruction in function space. In IEEE Conference on Computer Neural Radiance Fields for Editable Novel View Synthesis. In Proceed-
Vision and Pattern Recognition, pages 4460–4470, 2019. ings of the IEEE/CVF International Conference on Computer Vision,
[41] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T 2023.
Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes [62] Weicai Ye, Xinyu Chen, Ruohao Zhan, Di Huang, Xiaoshui Huang,
as neural radiance fields for view synthesis. Communications of the Haoyi Zhu, Hujun Bao, Wanli Ouyang, Tong He, and Guofeng Zhang.
ACM, 65(1):99–106, 2021. Dynamic-Aware Tracking Any Point for Structure from Motion in the
[42] Yuhang Ming, Weicai Ye, and Andrew Calway. idf-slam: End-to-end Wild. arXiv preprint, 2024.
rgb-d slam with neural implicit mapping and deep feature tracking. arXiv [63] Weicai Ye, Chenhao Ji, Zheng Chen, Junyao Gao, Xiaoshui Huang,
preprint arXiv:2209.07919, 2022. Song-Hai Zhang, Wanli Ouyang, Tong He, Cairong Zhao, and Guofeng
[43] Pierre Moulon, Pascal Monasse, and Renaud Marlet. Adaptive structure Zhang. DiffPano: Scalable and Consistent Text to Panorama Generation
from motion with a contrario model estimation. In Proceedings of with Spherical Epipolar-Aware Diffusion. arXiv preprint, 2024.
12
[64] Weicai Ye, Xinyue Lan, Shuo Chen, Yuhang Ming, Xingyuan Yu, Hujun
Bao, Zhaopeng Cui, and Guofeng Zhang. Pvo: Panoptic visual odometry.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 9579–9589, June 2023.
[65] Weicai Ye, Hai Li, Tianxiang Zhang, Xiaowei Zhou, Hujun Bao, and
Guofeng Zhang. SuperPlane: 3D plane detection and description from
a single image. In 2021 IEEE Virtual Reality and 3D User Interfaces
(VR), pages 207–215. IEEE, 2021.
[66] Weicai Ye, Xingyuan Yu, Xinyue Lan, Yuhang Ming, Jinyu Li, Hujun
Bao, Zhaopeng Cui, and Guofeng Zhang. Deflowslam: Self-supervised
scene motion decomposition for dynamic dense slam. arXiv preprint
arXiv:2207.08794, 2022.
[67] Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, and Yong Dou. Absgs:
Recovering fine details for 3d gaussian splatting. arXiv preprint
arXiv:2404.10484, 2024.
[68] Jae-Chern Yoo and Tae Hee Han. Fast normalized cross-correlation.
Circuits, systems and signal processing, 28:819–843, 2009.
[69] Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields:
Efficient and compact surface reconstruction in unbounded scenes. arXiv
preprint arXiv:2404.10772, 2024.
[70] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver
Wang. The unreasonable effectiveness of deep features as a perceptual
metric. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 586–595, 2018.
[71] Tianxiang Zhang, Chong Bao, Hongjia Zhai, Jiazhen Xia, Weicai Ye,
and Guofeng Zhang. Arcargo: Multi-device integrated cargo load-
ing management system with augmented reality. In 2021 IEEE Intl
Conf on Dependable, Autonomic and Secure Computing, Intl Conf on
Pervasive Intelligence and Computing, Intl Conf on Cloud and Big
Data Computing, Intl Conf on Cyber Science and Technology Congress
(DASC/PiCom/CBDCom/CyberSciTech), pages 341–348. IEEE, 2021.
13
PGSR
2DGS
GOF
PGSR
2DGS
GOF
PGSR
2DGS
GOF
Input
PGSR
2DGS
GOF
Fig. 15: Qualitative comparisons in surface reconstruction between PGSR, 2DGS, and GOF.
17
Fig. 16: PGSR achieves high-precision geometric reconstruction in various indoor and outdoor scenes from a series of RGB images without requiring any
prior knowledge.