Sparse Neu S
Sparse Neu S
2
Tencent Games
3
Texas A&M University
1 Introduction
a) Input images b) Ours (inference) c) MVSNerf (inference) d) Ours (fine-tuning 12 mins) e) NeuS (optimizing 15h)
Fig. 1: Our method can generalize across diverse scenes, and reconstruct neural
surfaces from only three input images (a) via fast network inference (b). The
reconstruction quality of the fast inference step is more accurate and faithful
than the result of MVSNerf [3] (c). Our inference result can be further improved
by a per-scene fine-tuning process. Compared to NeuS [44] (e), our per-scene
optimization result (d) not only achieves noticeably better reconstruction quality,
but also takes much less time to converge (12 minutes v.s. 15 hours).
to obtain the high-level geometry, and a fine volume guided by the coarse level
to refine the geometry. A per-scene fine-tuning process is further incorporated
into this scheme, which is conditioned on the inferred geometry to construct
subtle details to generate even finer-grained surfaces. This multi-level scheme
divides the task of high-quality reconstruction into several steps. Each step is
based upon the geometry from the preceding step and focuses on constructing a
finer level of details. Besides, due to the hierarchical nature of the scheme, the
reconstruction efficiency is significantly boosted, because numerous samples far
from the coarse surface can be discarded, so as not to burden the computation
in the fine-level geometry reasoning.
The second important strategy that we propose is a multi-scale color bend-
ing scheme for novel view synthesis. Given the limited information in the sparse
images, the network would struggle to directly regress accurate colors for render-
ing novel views. Thus, we mitigate this issue by predicting the linear blending
weights of the input image pixels to derive colors. Specifically, we adopt both
pixel-based and patch-based blending to jointly evaluate local and contextual
radiance consistency. This multi-scale blending scheme yields more reliable color
predictions when the input is sparse.
Another challenge in multi-view 3D reconstruction is that 3D surface points
often do not have consistent projections across different views, due to occlusion
or image noises. With only a small number of input views, the dependence of
geometry reasoning on each image further increases, which aggravates the prob-
lem and results in distorted geometry. To tackle this challenge, we propose a
consistency-aware fine-tuning scheme in the fine-tuning stage. This scheme au-
tomatically detects regions that lack consistent projections, and excludes these
regions in the optimization. This strategy proves effective in making the fine-
tuned surface less susceptible to occlusion and noises, thus more accurate and
cleaner, contributing to a high-quality reconstruction.
We evaluated our method on the DTU [11] and BlendedMVS [48] datasets,
and show that our method outperforms the state-of-the-art unsupervised neural
implicit surface reconstruction methods both quantitatively and qualitatively.
In summary, our main contributions are:
2 Related Work
3 Method
Given a few (i.e., three) views with known camera parameters, we present a
novel method that hierarchically recovers surfaces and generalizes across scenes.
As illustrated in Figure 2, our pipeline can be divided into three parts: (1) Ge-
ometry reasoning. SparseNeuS first constructs cascaded geometry encoding
volumes that encode local geometry surface information, and recover surfaces
Fast Generalizable Neural Surface Reconstruction from Sparse Views 5
color1
Volume Geo-guided
encoding encoding Refinement
blending
blending
blending
Color
Color
Color
color3
consistent point
color 𝜎 color 𝜎 color 𝜎 inconsistent point
Volume Volume Volume
Input views rendering rendering rendering
Consistency-aware rendering loss =
Rendering loss = 1 0
colorgt colorpred colorgt colorpred colorgt colorpred
from the volumes in a coarse-to-fine manner (see Section 3.1). (2) Appearance
prediction. SparseNeuS leverages a multi-scale color blending module to predict
colors by aggregating information from input images, and then combines the esti-
mated geometry with predicted colors to render synthesized views using volume
rendering (see Section 3.2). (3) Per-scene fine-tuning. Finally, a consistency-
aware fine-tuning scheme is proposed to further improve the obtained geometry
with fine-grained details (see Section 3.3).
denotes the projected pixel location of v on the feature map Fi . For simplicity,
we abbreviate Fi (πi (v)) as Fi (v).
The geometry encoding volume M is constructed using all the projected fea-
N −1
tures {Fi (v)}i=0 of each vertex. Following prior methods [46,3], we first calculate
the variance of all the projected features of a vertex to build a cost volume B,
and then apply a sparse 3D CNN Ψ to aggregate the cost volume B to obtain
the geometry encoding volume M :
where Var is the variance operation, which computes the variance of all the
N −1
projected features {Fi (v)}i=0 of each vertex v.
Surface extraction. Given an arbitrary 3D location q, an MLP network fθ
takes the combination of the 3D coordinate and its corresponding interpolated
features of geometry encoding volume M (q) as input, to predict the Signed Dis-
tance Function (SDF) s(q) for surface representation. Specially, positional en-
coding PE is applied on its 3D coordinates, and the surface extraction operation
is expressed as: s(q) = fθ (PE(q), M (q)).
Cascaded volumes scheme. For balancing the computational efficiency and
reconstruction accuracy, SparseNeuS constructs cascaded geometry encoding vol-
umes of two resolutions to perform geometry reasoning in a coarse-to-fine man-
ner. A coarse geometry encoding volume is first constructed to infer the funda-
mental geometry, which presents the global structure of the scene but is relatively
less accurate due to limited volume resolution. Guided by the obtained coarse
geometry, a fine level geometry encoding volume is constructed to further refine
the surface details. Numerous vertices far from the coarse surfaces can be dis-
carded in the fine-level volume, which significantly reduces the computational
memory burden and improves efficiency.
feature Fi′ (q). Next, we feed the new feature Fi′ (q), the viewing direction of the
query ray relative to the viewing direction of the ith input image ∆di = d−di , and
the trilinearly interpolated volume encoding feature M (q) into an MLP network
fc to generate blending weight: wiq = fc (Fi′ (q), M (q), ∆di ). Finally, blending
N −1
weights {wiq }i=0 are normalized using a Softmax operator.
Pixel-based color blending. With the obtained blending weights, the color
cq of a 3D location q is predicted as the weighted sum of its projected colors
N −1
{Ii (q)}i=0 on the input images. To render the color of the query ray, we first
predict the color and SDF values of 3D points sampled on the ray. The color
and SDF values of the sampled points are aggregated to obtain the final colors
of the ray using SDF based volume rendering [44]. Since the color of a query
ray corresponds to a pixel of the synthesized image, we name this operation
pixel-based blending. Although supervision on the colors rendered by pixel-based
blending already induces effective geometry reasoning, the information of a pixel
is local and lacks contextual information, thus usually leading to inconsistent
surface patches when input is sparse.
Patch-based color blending. Inspired by classical patch matching, we con-
sider enforcing the synthesized colors and ground truth colors to be contextually
consistent; that is, not only in pixel level but also in patch level. To render the
colors of a patch with size k × k, a naive implementation is to query the colors
of k 2 rays using volume rendering, which causes a huge amount of computa-
tion. We, therefore, leverage local surface plane assumption and homography
transformation to achieve a more efficient implementation.
The key idea is to estimate a local plane of a sampled point to efficiently
derive the local patch. Given a sampled point q, we leverage the property of
the SDF network s(q) to estimate the normal direction nq by computing the
spatial gradient, i.e., nq = ∇s(q). Then, we sample a set of points on the local
plane (q, nq ), project the sampled points to each view, and obtain the colors by
interpolation on each input image. All the points on the local plane share the
same blending weights with q, and thus only one query of the blending weights is
needed. Using local plane assumption, we consider the neighboring geometric in-
formation of a query 3D position, which encodes contextual information of local
patches and enforces better geometric consistency. By adopting patch-based vol-
ume rendering, synthesized regions contain more global information than single
pixels, thus producing more informative and consistent shape context, especially
in the regions with weak texture and changing intensity.
Volume rendering. To rendering the pixel-based color C (r) or patch-based
color P (r) of a ray r passing through the scene, we query the pixel-based colors
ci , patch-based colors pi and sdf values si of M samples on the ray, and then
utilize [44] to convert sdf values si into densities σi . Finally, the densities are
used to accumulate pixel-based and patch-based colors along the ray:
\label {eq_volume_rendering} U(r)=\sum _{i=1}^{M} T_{i}\left (1-\exp \left (-\sigma _{i}\right )\right ) u_{i}, \quad \text {where} \quad T_{i}=\exp \left (-\sum _{j=1}^{i-1} \sigma _{j}\right ). (2)
8 X. Long et al.
Here U (r) denotes C (r) or P (r), while ui denotes the pixel-based color ci or
patch-based color pi of the ith sample on the ray.
With the generalizable priors and effective geometry reasoning framework, given
sparse images from a new scene, SparseNeuS can already recover geometry sur-
faces via fast network inference. However, due to the limited information in the
sparse input views and the high diversity and complexity of different scenes,
the geometry obtained by the generic model may contain inaccurate outliers
and lack subtle details. Therefore, we propose a novel fine-tuning scheme, which
is conditioned on the inferred geometry, to reconstruct subtle details and gen-
erate finer-grained surfaces. Thanks to the initialization given by the network
inference, the per-scene optimization can fast converge to a high-quality surface.
Fine-tuning networks. In the fine-tuning, we directly optimize the obtained
fine-level geometry encoding volume and the signed distance function (SDF) net-
work fθ , while the 2D feature extraction network and 3D sparse CNN networks
are discarded. Moreover, the CNN based blending network used in the generic
setting is replaced by a tiny MLP network. Although the CNN based network
can be also used in per-scene fine-tuning, by experiments, we found that a new
tiny MLP can speed up the fine-tuning without loss of performance since the
MLP is much smaller than the CNN based network. The MLP network still
N −1
outputs blending weights {wiq }i=0 of a query 3D position q, but it takes the
\label {ca_color_loss} \resizebox {0.9\hsize }{!}{$ \begin {split} \mathcal {L}_{color} & = \sum _{r \in \mathbb {R}} O \left (r \right ) \cdot \mathcal {D}_{pix}\left (C \left (r \right ), \tilde {C}\left (r \right )\right ) + \sum _{r \in \mathbb {R}} O \left (r \right ) \cdot \mathcal {D}_{pat}\left (P\left (r \right ),\tilde {P}\left (r \right )\right ) \\ & + \lambda _{0} \sum _{r \in \mathbb {R}} log\left (O\left (r \right )\right ) + \lambda _{1} \sum _{r \in \mathbb {R}} log\left (1- O\left (r \right )\right ), \end {split} $ }
input as the combination of 3D coordinate q, the surface normal nq , the ray
direction d, the predicted SDF s(q), and the interpolated feature of the geome-
try encoding volume M (q). Specially, positional encoding PE is applied on the
3D position q and the ray direction d. The MLP network fc′ is defined as :
N −1 N −1
{wiq }i=0 = fc′ (PE(q), PE(d), nq , s(q), M (q)) , where {wiq }i=0 are the predicted
blending weights, and N is the number of input images.
Consistency-aware color loss. We observe that in multi-view stereo, 3D sur-
face points often do not have consistent projections across different views, since
the projections may be occluded or contaminated by image noises. As a result,
the errors of these regions suffer from sub-optima, and the predicted surfaces of
the regions are always inaccurate and distorted. To tackle this problem, we pro-
pose a consistency-aware color loss to automatically detect the regions lacking
consistent projections and exclude these regions in the optimization:
(3)
where r is a query ray, R is the set of all query rays, O (r) is the sum of accu-
mulated weights along the ray r obtained by volume rendering. From Eq. 2, we
Fast Generalizable Neural Surface Reconstruction from Sparse Views 9
PM
can easily derive O (r) = i=1 Ti (1 − exp (−σi )). C (r) and C̃ (r) are the ren-
dered and ground truth pixel-based colors of the query ray respectively, P (r)
and P̃ (r) are the rendered and ground truth patch-based colors of the query ray
respectively, and Dpix and Dpat are the loss metrics of the rendered pixel color
and rendered patch colors respectively. Empirically, we choose Dpix as L1 loss
and Dpat as Normalized Cross Correlation (NCC) loss.
The rationale behind this formulation is, the points with inconsistent pro-
jections always have relatively large color errors that cannot be minimized in
the optimization. Therefore, if the color errors are difficult to be minimized in
optimization, we force the sum of the accumulated weights O (r) to be zero, such
that the inconsistent regions will be excluded in the optimization. To control the
level of consistency, we introduce two logistic regularization terms: decreasing
the ratio λ0 /λ1 will lead to more regions being kept; otherwise, more regions are
excluded and the surfaces are cleaner.
\label {total_loss} \mathcal {L}= \mathcal {L}_{\text {color }}+\alpha \mathcal {L}_{\text {eik}} + \beta \mathcal {L}_{\text {sparse}}. (4)
We note that, in the early stage of generic training, the estimated geometry
is relatively inaccurate, and 3D surface points may have large errors, where the
errors do not provide clear clues on whether the regions are radiance consistent
or not. We utilize consistency-aware color loss in the per-scene fine-tuning, and
remove the last two consistence-aware logistic terms of Eq. 3 in the training of
the generic model.
An Eikonal term [9] is applied on the sampled points to regularize the SDF
values derived from the surface prediction network fθ :
\mathcal {L}_{eik}=\frac {1}{\left \|\mathbb {Q} \right \|} \sum _{q \in \mathbb {Q}}\left ({\left \|\nabla f_{\theta }\left (q\right )\right \|}_{2}-1\right )^{2}, (5)
where q is a sampled 3D point, Q is the set of all sampled points, ∇fθ (q) is
the gradient of network fθ relatively to sampled point q, and ∥·∥2 is l2 norm.
The Eikonal term enforces the network fθ to have unit l2 norm gradient, which
encourages fθ to generate smooth surfaces.
Besides, due to the property of accumulated transmittance in volume ren-
dering, the invisible query samples behind the visible surfaces lack supervision,
which causes uncontrollable free surfaces behind the visible surfaces. To enable
our framework to generate compact geometry surfaces, we adopt a sparseness
regularization term to penalize the uncontrollable free surfaces:
\label {sparse_lossterm} \mathcal {L}_{sparse}=\frac {1}{\left \|\mathbb {Q} \right \| } \sum _{q \in \mathbb {Q}} \exp \left (-\tau \cdot \left |s(q) \right |\right ), (6)
10 X. Long et al.
Datasets. We train our framework on the DTU [11] dataset to learn a gener-
alizable network. We use 15 scenes for testing, same as those used in IDR [50],
and the remaining non-overlapping 75 scenes for training. All the evaluation
results on the testing scenes are generated using three views with a resolution
of 600 × 800, and each scene contains two sets of three images. The foreground
masks provided by IDR [50] are used for evaluating the testing scenes. For mem-
ory efficiency, we use the center cropped images with resolution of 512 × 640
for training. We observe that the images of DTU dataset contain large black
backgrounds and the regions have considerable image noises, so we utilize a sim-
ple threshold based denoising strategy to alleviate the noises of such regions in
the training images. Optionally, the black backgrounds with zero RGB values
can be used as a simple dataset prior to encourage the geometry predictions of
such regions to be empty. We further tested on 7 challenging scenes from the
BlendedMVS [48] dataset. For each scene, we select one set of three images with
a resolution of 768 × 576 as input. Note that, in the per-scene fine-tuning stage,
we still use the three images for optimization without any new images.
Implementation details. Feature Pyramid Network [20] is used as the image
feature extraction network to extract multi-scale features from input images. We
implement the sparse 3D CNN networks using a U-Net like architecture, and use
torchsparse [41] as the implementation of 3D sparse convolution. The resolutions
of the coarse level and fine level geometry encoding volumes are 96 × 96 × 96
and 192 × 192 × 192 respectively. The patch size used in patch-based blending is
Fast Generalizable Neural Surface Reconstruction from Sparse Views 11
5 Experiments
We compare our method with the state-of-the-art approaches from three classes:
1) generic neural rendering methods, including PixelNerf [51], IBRNet [45] and
MVSNerf [3], where we use a density threshold to extract meshes from the
learned implicit field; 2) per-scene optimization based neural surface reconstruc-
tion methods, including IDR [50], NeuS [44], VolSDF [49], and UniSurf [32];
3) a widely used classic MVS method COLMAP [35], where we reconstruct a
mesh from the output point cloud of COLMAP with Screened Poisson Surface
Reconstruction [16]. All the methods take three images as input.
5.1 Comparisons
Quantitative comparisons. We perform quantitative comparisons with the
SOTA methods on DTU dataset. We measure the Chamfer Distances of the
predicted meshes with ground truth point clouds, and report the results in Ta-
ble 1. The results show that our method outperforms the SOTA methods by
a large margin in both generic setting and per-scene optimization setting. Our
results obtained by a per-scene fine-tuning with 10k iterations (20 mins) shows
remarkable improvements than those of per-scene optimization methods. Note
12 X. Long et al.
that IDR [50] needs extra object masks for per-scene optimization while others
do not need object masks, and we provide the results of IDR for reference.
We further perform a fine-tuning with 10k iterations for IBRNet and MVS-
Nerf with the three input images. With the fine-tuning, the results of IBRNet are
improved compared with its generic setting but still worse than our fine-tuned
results. MVSNerf fails to perform a fine-tuning with the three input images,
therefore, no meaningful geometries are extracted. Furthermore, we observe that
MVSNerf usually needs more than 10 images to perform a successful fine-tuning,
and thus the failure might be caused by the radiance ambiguity problem.
Qualitative comparisons. We conduct qualitative comparisons with MVS-
Nerf [3], COLMAP [35] and NeuS [44] on DTU [11] and BlendedMVS [48]
datasets. As shown in Figure 3, our results obtained via network inference are
much smoother and less noisy than those of MVSNerf. The extracted meshes
of MVSNerf are noisy since its representation of density implicit field does not
have sufficient constraint on level sets of 3D geometry surfaces.
Fast Generalizable Neural Surface Reconstruction from Sparse Views 13
After a short-time per-scene fine-tuning, our results are largely improved with
fine-grained details and become more accurate and cleaner. Compared with the
results of COLMAP, our results are more complete, especially for the objects
with weak textures. With only three input images, NeuS suffers from the radiance
ambiguity problem and its geometry surfaces are distorted and incomplete.
To validate the generalizability and robustness of our method, we further
perform cross dataset evaluation on BlendedMVS dataset. As shown in Figure 4,
although our method is not trained on BlendedMVS, our generic model shows
strong generalizability and produces cleaner and more complete results than
those of MVSNerf. Take the fourth scene in Figure 4 as an example, our method
successfully recovers subtle details like the hose, while COLMAP misses the fine-
grained geometry. For the scenes with weak textures, NeuS can only produces
rough shapes and struggles to recover the details of geometry.
detects the regions lacking photographic consistency and excludes these regions
in fine-tuning. As shown in (b) of Table 2 and Figure 5, result using consistency-
aware scheme is noticeably better than those that do not, which is cleaner and
gets rid of distorted geometries.
Per-scene optimization with or
without priors. Owing to the good
initialization provided by the learned
priors, the per-scene optimization of
our method converges much faster
and avoids sub-optimal caused by the
radiance ambiguity problem. To val-
idate the effectiveness of the learned
Reference image a) w/ priors b) w/o priors
priors, we directly perform an opti-
Chamfer distance 1.65 1.98
mization without using the learned
priors. As shown in Figure 6, the
Chamfer Distance of the result with Fig. 6: Per-scene optimization with pri-
priors is 1.65 while that without prior- ors or without priors.
based initialization is 1.98. Obviously,
the result with learned priors is more
complete and smooth, which shows a stark contrast to the direct optimization.
6 Conclusions
We propose SparseNeuS, a novel neural rendering based surface reconstruction
method to recover surfaces from multi-view images. Our method generalizes to
new scenes and produces high-quality reconstructions with sparse images, which
prior works [44,49,32] struggle with. To make our method generalize to new
scenes, we therefore introduce geometry encoding volumes to encode geometry
information for generic geometry reasoning. Moreover, a series of strategies are
proposed to handle the difficult sparse views setting. First, we propose a multi-
level geometry reasoning framework to recover the surfaces in a coarse-to-fine
manner. Second, we adopt a multi-scale color blending scheme, which jointly
evaluates local and contextual radiance consistency for more reliable color pre-
diction. Third, a consistency-aware fine-tuning scheme is used to control the
inconsistent regions caused by occlusion and image noises, yielding accurate and
clean reconstruction. Experiments show our method achieve better performance
than the state-of-the-arts in both reconstruction quality and computational ef-
ficiency. Due to signed distance field adopted, our method can only produce
closed-surfaces reconstructions. Possible future directions include utilizing other
representations like unsigned distance field to reconstruct open-surfaces objects.
Acknowlegements
We thank the valuable feedbacks of reviewers. Xiaoxiao Long is supported by
Hong Kong PhD Fellowship Scheme.
Fast Generalizable Neural Surface Reconstruction from Sparse Views 15
References
1. Atzmon, M., Lipman, Y.: Sal: Sign agnostic learning of shapes from raw data.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 2565–2574 (2020)
2. Campbell, N.D., Vogiatzis, G., Hernández, C., Cipolla, R.: Using multiple hy-
potheses to improve depth-maps for multi-view stereo. In: European Conference
on Computer Vision. pp. 766–779. Springer (2008)
3. Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: Mvsnerf: Fast
generalizable radiance field reconstruction from multi-view stereo. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision. pp. 14124–14133
(2021)
4. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 5939–5948 (2019)
5. Chibane, J., Bansal, A., Lazova, V., Pons-Moll, G.: Stereo radiance fields (srf):
Learning view synthesis for sparse views of novel scenes. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7911–
7920 (2021)
6. Darmon, F., Bascle, B., Devaux, J.C., Monasse, P., Aubry, M.: Improving neural
implicit surfaces geometry with patch warping. arXiv preprint arXiv:2112.09648
(2021)
7. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE
transactions on pattern analysis and machine intelligence 32(8), 1362–1376 (2009)
8. Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by
surface normal diffusion. In: Proceedings of the IEEE International Conference on
Computer Vision. pp. 873–881 (2015)
9. Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y.: Implicit geometric reg-
ularization for learning shapes. arXiv preprint arXiv:2002.10099 (2020)
10. Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for
high-resolution multi-view stereo and stereo matching. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2495–
2504 (2020)
11. Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view
stereopsis evaluation. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 406–413 (2014)
12. Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: Surfacenet: An end-to-end 3d neu-
ral network for multiview stereopsis. In: Proceedings of the IEEE International
Conference on Computer Vision. pp. 2307–2315 (2017)
13. Ji, M., Zhang, J., Dai, Q., Fang, L.: Surfacenet+: An end-to-end 3d neural network
for very sparse multi-view stereopsis. IEEE Transactions on Pattern Analysis and
Machine Intelligence 43(11), 4078–4093 (2020)
14. Jiang, Y., Ji, D., Han, Z., Zwicker, M.: Sdfdiff: Differentiable rendering of signed
distance fields for 3d shape optimization. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition. pp. 1251–1261 (2020)
15. Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. Advances in
neural information processing systems 30 (2017)
16. Kazhdan, M., Hoppe, H.: Screened poisson surface reconstruction. ACM Transac-
tions on Graphics (ToG) 32(3), 1–13 (2013)
16 X. Long et al.
17. Kellnhofer, P., Jebe, L.C., Jones, A., Spicer, R., Pulli, K., Wetzstein, G.: Neural
lumigraph rendering. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 4287–4297 (2021)
18. Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International
journal of computer vision 38(3), 199–218 (2000)
19. Lhuillier, M., Quan, L.: A quasi-dense approach to surface reconstruction from un-
calibrated images. IEEE transactions on pattern analysis and machine intelligence
27(3), 418–433 (2005)
20. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 2117–2125 (2017)
21. Liu, L., Gu, J., Zaw Lin, K., Chua, T.S., Theobalt, C.: Neural sparse voxel fields.
Advances in Neural Information Processing Systems 33, 15651–15663 (2020)
22. Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: Dist: Rendering deep
implicit signed distance function with differentiable sphere tracing. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
2019–2028 (2020)
23. Liu, Y., Peng, S., Liu, L., Wang, Q., Wang, P., Theobalt, C., Zhou, X.,
Wang, W.: Neural rays for occlusion-aware image-based rendering. arXiv preprint
arXiv:2107.13421 (2021)
24. Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.:
Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint
arXiv:1906.07751 (2019)
25. Long, X., Lin, C., Liu, L., Li, W., Theobalt, C., Yang, R., Wang, W.: Adaptive
surface normal constraint for depth estimation. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 12849–12858 (2021)
26. Long, X., Liu, L., Li, W., Theobalt, C., Wang, W.: Multi-view depth estimation
using epipolar spatio-temporal networks. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition. pp. 8258–8267 (2021)
27. Long, X., Liu, L., Theobalt, C., Wang, W.: Occlusion-aware depth estimation with
adaptive normal constraints. In: European Conference on Computer Vision. pp.
640–657. Springer (2020)
28. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy
networks: Learning 3d reconstruction in function space. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4460–
4470 (2019)
29. Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A.: Im-
plicit surface representations as layers in neural networks. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision. pp. 4743–4752 (2019)
30. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Eu-
ropean conference on computer vision. pp. 405–421. Springer (2020)
31. Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric
rendering: Learning implicit 3d representations without 3d supervision. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion. pp. 3504–3515 (2020)
32. Oechsle, M., Peng, S., Geiger, A.: Unisurf: Unifying neural implicit surfaces and
radiance fields for multi-view reconstruction. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 5589–5599 (2021)
Fast Generalizable Neural Surface Reconstruction from Sparse Views 17
33. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning
continuous signed distance functions for shape representation. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 165–
174 (2019)
34. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu:
Pixel-aligned implicit function for high-resolution clothed human digitization. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
2304–2314 (2019)
35. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings
of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113
(2016)
36. Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection
for unstructured multi-view stereo. In: European Conference on Computer Vision.
pp. 501–518. Springer (2016)
37. Seitz, S.M., Dyer, C.R.: Photorealistic scene reconstruction by voxel coloring. In-
ternational Journal of Computer Vision 35(2), 151–173 (1999)
38. Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein, G., Zollhofer, M.:
Deepvoxels: Learning persistent 3d feature embeddings. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2437–
2446 (2019)
39. Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Con-
tinuous 3d-structure-aware neural scene representations. Advances in Neural In-
formation Processing Systems 32 (2019)
40. Sun, J., Xie, Y., Chen, L., Zhou, X., Bao, H.: Neuralrecon: Real-time coherent 3d
reconstruction from monocular video. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition. pp. 15598–15607 (2021)
41. Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., Han, S.: Searching efficient
3d architectures with sparse point-voxel convolution. In: European conference on
computer vision. pp. 685–702. Springer (2020)
42. Tola, E., Strecha, C., Fua, P.: Efficient large-scale multi-view stereo for ultra high-
resolution image sets. Machine Vision and Applications 23(5), 903–920 (2012)
43. Trevithick, A., Yang, B.: Grf: Learning a general radiance field for 3d representa-
tion and rendering. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 15182–15192 (2021)
44. Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: Learn-
ing neural implicit surfaces by volume rendering for multi-view reconstruction.
Advances in Neural Information Processing Systems 34 (2021)
45. Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin-
Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based
rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 4690–4699 (2021)
46. Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc-
tured multi-view stereo. In: Proceedings of the European Conference on Computer
Vision (ECCV). pp. 767–783 (2018)
47. Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent mvsnet for high-
resolution multi-view stereo depth inference. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 5525–5534 (2019)
48. Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blended-
mvs: A large-scale dataset for generalized multi-view stereo networks. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 1790–1799 (2020)
18 X. Long et al.
49. Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit
surfaces. Advances in Neural Information Processing Systems 34 (2021)
50. Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., Lipman, Y.:
Multiview neural surface reconstruction by disentangling geometry and appear-
ance. Advances in Neural Information Processing Systems 33, 2492–2502 (2020)
51. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from
one or few images. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 4578–4587 (2021)
52. Zhang, J., Yao, Y., Quan, L.: Learning signed distance field for multi-view sur-
face reconstruction. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. pp. 6525–6534 (2021)
Supplementary Materials for
SparseNeuS: Fast Generalizable Neural Surface
Reconstruction from Sparse Views
1
The University of Hong Kong
2
Tencent Games
3
Texas A&M University
H=K\left (R_{i}+\frac {{t}_{i} n_{q}^{T} R_{r}^{T}}{n_{q}^{T}\left (q+R_{r}^{T} t_{r}\right )}\right ) K^{-1}, (1)
sampled in the query ray r to generate the final predicted patch colors of the
ray.
Using local plane assumption, we consider the neighboring geometric infor-
mation of a query 3D position, which encodes contextual information of local
patches and enforces better geometric consistency. By adopting patch-based vol-
ume rendering, synthesized regions contain more global information than single
pixels, thus producing more informative and consistent shape context, especially
in the regions with weak texture and changing intensity.
Network details. Feature Pyramid Network [?] is used as the image feature
extraction network to extract multi-scale features from input images. We im-
plement the sparse 3D CNN networks using a U-Net like architecture, and use
torchsparse [?] as the implementation of 3D sparse convolution. The signed dis-
tance function (SDF) fθ is modeled by an MLP consisting of 4 hidden layers with
a hidden size of 256. The blending network fc used in fine-tuning is modeled by
an MLP consisting of 3 hidden layers with a hidden size of 256. Positional en-
coding [?] is applied to 3D locations with 6 frequencies and to view directions
with 4 frequencies. Same as NeuS [?], we adopt a hierarchical sampling strategy
to sample points in the query ray for volume rendering, where the numbers of
the coarse and fine sampling are both 64.
Training parameters. The loss weights of total loss Eq.7 are set to α = 0.1, β =
0.02. The sdf scaling parameter τ of sparseness loss term Eq.10 is set to 100. For
the consistency-aware color loss term Eq.6 used in fine-tuning, by default, λ0 is
set to 0.01 and λ1 is set to 0.015. The ratio λ0 /λ1 sometimes needs to be tuned
for better reconstruction results for each scene: decreasing the ratio λ0 /λ1 will
lead to more regions being kept; otherwise, more regions are excluded and the
surfaces are cleaner.
Data preparation. We observe that the images of the DTU dataset contain
large black backgrounds and the regions have considerable image noises. Hence,
we utilize a simple threshold-based denoising strategy to clean the images of
training scenes. We first detect the pixels where intensities are smaller than a
threshold τ = 10 as the invalid black regions, and thus yielding a mask for each
image. The mask is then processed by image dilation and erosion operations to
reduce isolated outliers. Finally, we evaluate the areas of the connected compo-
nents in the masks, and only keep the connected components whose areas are
larger than s, where s is set to the 10% of the whole image. Given the masks,
the detected black invalid regions are set to 0. By the simple denoising opera-
tion, the noises of the black background regions in the DTU training images are
mostly removed.
Fast Generalizable Neural Surface Reconstruction from Sparse Views 3
3 More experiments