Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views10 pages

NeRF Regularization Framework

Uploaded by

andr234ibatera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

NeRF Regularization Framework

Uploaded by

andr234ibatera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

A generic and flexible regularization framework for NeRFs


2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) | 979-8-3503-1892-0/24/$31.00 ©2024 IEEE | DOI: 10.1109/WACV57701.2024.00306

Thibaud Ehret Roger Marí Gabriele Facciolo


Université Paris-Saclay, CNRS, ENS Paris-Saclay, Centre Borelli, 91190, Gif-sur-Yvette, France
[email protected]

Abstract

DiffNeRF (normals) DiffNeRF (depth)


Neural radiance fields, or NeRF, represent a breakthrough
in the field of novel view synthesis and 3D modeling of com-
plex scenes from multi-view image collections. Numerous
recent works have shown the importance of making NeRF
models more robust, by means of regularization, in order to
train with possibly inconsistent and/or very sparse data. In
this work, we explore how differential geometry can provide
elegant regularization tools for robustly training NeRF-like
models, which are modified so as to represent continuous and
RegNeRF [26]

infinitely differentiable functions. In particular, we present a


generic framework for regularizing different types of NeRFs
observations to improve the performance in challenging con-
ditions. We also show how the same formalism can also
be used to natively encourage the regularity of surfaces by Figure 1. We propose a generic regularization framework for
means of Gaussian or mean curvatures. NeRF that outperforms previous state-of-the-art methods when
training with only three input views. We compare here the pro-
1. Introduction posed DiffNeRF with depth regularization (top), DiffNeRF with
normals regularization (middle) and RegNeRF [26]. Left to right:
Realistic rendering of new views of a 3D scene or a given RGB prediction, depth map, map of normals.
volume is a long standing problem in computer graphics. The
interest in this problem has been rekindled by the growth of
augmented and virtual reality. Traditionally, 3D scenes were
estimated from a set of images using classic Structure-from- sient objects or different lightning conditions. Other works
Motion (SfM) and Multi-View Stereo (MVS) tools such as deal with dynamic or deformable scenes [19, 28–30, 43],
COLMAP [36] or [12, 25, 34, 39]. complex illumination models [3, 20, 40, 49] or very few train-
ing views [7, 16, 18, 47]. In other words, the goal is to make
Recently, Mildenhall et al. [24] have shown that differ-
NeRF more robust to be able to train reliably even in the
entiable volume rendering operations can be plugged into a
most adverse conditions. For example, imposing regularity
neural network to learn a neural radiance field (NeRF) volu-
constraints on the scene seems to be a promising way to
metric representation of a scene encoding its geometry and
reduce reliance on large datasets [26].
appearance. Starting from a sparse, yet nonetheless large,
set of views of the scene, NeRF learns in a self-supervised The objective of this work is to show how one can adapt
manner, by maximizing the photo-consistency across the pre- differential geometry concepts to elegantly incorporate reg-
dicted renderings corresponding to the available viewpoints. ularizers that make NeRF more robust. The advantages are
After convergence, the network is able to render realistic twofold: first, differential geometry is mathematically for-
novel views by querying the NeRF function at unseen view- malized and its literature is vast with many suitable tools
points. already available and, second, neural representations are
This breakthrough led to a very active research field fo- perfectly adapted to represent continuous infinitely differen-
cused on pushing back the limits of the initial models. No- tiable volumetric functions in which differential geometry
tably, Martin-Brualla et al. [21] showed that it is possible to operators are naturally defined.
learn scenes from unconstrained photos with moving tran- To this aim, we present a generic framework based on

2642-9381/24/$31.00 ©2024 IEEE 3076


DOI 10.1109/WACV57701.2024.00306
Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
differential geometry to regularize different types of NeRFs Following this volume rendering logic, the NeRF function
observations. We derive the two specific cases of depth F is optimized by minimizing the squared error between the
regularization, thus linking to the previously proposed Reg- rendered color and the real colors of a batch of rays R that
NeRF [26], as well as normals regularization in Section 3. project onto a set of training views
of the scene taken from
We also show in Section 4 that this approach is not only different viewpoints: LRGB = r∈R c(r) − cGT (r)22 ,
competitive with the state of the art for the problem training where cGT (r) is the observed color of the pixel intersected
a NeRF model when few images (for example only three) by the ray r, and c(r) is the NeRF prediction (2). The rays in
are available but also that it produces smoother and more ac- each batch R are chosen randomly from the available views,
curate depth maps. Finally, we straightforwardly extends the encouraging gradient flow where rays casted from different
proposed framework to surfaces regularization in Section 5 viewpoints intersect with consistent scene content.
showing that generality of the proposed approach.
mip-NeRF: The original NeRF approach casts a single ray
2. Related Work per pixel [24]. When the training images observe the scene
at different resolutions, this can lead to blurred or aliased
2.1. Fundamentals of Neural Radiance Fields renderings. To prevent such situation, the mip-NeRF for-
mulation [2] can be adopted, which casts a cone per pixel
NeRF was originally introduced as a continuous vol-
instead. As a result, mip-NeRF is queried in terms of conical
umetric function F, learned by a multi-layer perceptron
frustums and not discrete points, yielding a continuous and
(MLP), to model the geometry and appearance of a 3D
natively multiscale representation of regions of the volume
scene [24, 42]. Given a 3D point x ∈ R3 of the scene and a
space.
viewing direction v ∈ R2 , NeRF predicts an associated RGB
color c ∈ [0, 1]3 and a scalar volume density σ ∈ [0, ∞), i.e. 2.2. Regularization in Few-shot Neural Rendering
F : (x, v) → (c, σ). (1) The original NeRF methodology was demonstrated using
20 to 62 views for real world static scenes, and 100 views
The value of σ defines the geometry of the scene and is
or more for synthetic static scenes [24]. In the absence of
learned exclusively from the spatial coordinates x, while
large datasets, the MLP usually overfits to each training
the value of c is also dependent on the viewing direction d,
image when only a few are available, resulting in unrealistic
which allows to recreate non-Lambertian surface reflectance.
interpolation for novel view synthesis and poor geometry
NeRF models are trained based on a classic differentiable
estimates.
volume rendering operation [22], which establishes the re-
A number of NeRF variants have been recently proposed
sulting color of any ray passing through the scene volume
to address few-shot neural rendering. The use of regular-
and projected onto a camera system. Each ray r(t) = o + tv
ization techniques is common in these variants, to achieve
with t ∈ R+ , defined by a point of origin o and a direction
smoother results in unobserved areas of the scene volume or
vector v, is discretized into N 3D points {xi }Ni=1 , where radiometrically inconsistent observations [18].
xi = o + ti v with ti sampled between the minimum and
maximum depth. The sampling depends on the method. The Implicit/indirect regularization methods rely on geometry
rendered color c(r) of a ray r is obtained as and appearance priors learned by pre-trained models. Pixel-
NeRF [47] introduced a framework that can be trained across

N
c(r) = T i α i ci (2) multiple scenes, thus acquiring the ability to generalize to
i=1
unseen environments. The MLP learns generic operations
where while keeping the output conditioned to scene-specific con-
i−1
Ti = j=1 (1 − αj ) and αi = 1 − exp(−σi (ti+1 − ti )). (3) tent thanks to an additional input feature vector, extracted
by a pre-trained convolutional neural network (CNN). Simi-
In (2), ci and σi correspond to the color and volume larly, DietNeRF [16] complements the NeRF loss (2.1) with
density output by the MLP at the i-th point of the ray, i.e. a secondary term that encourages similarity between pre-
F(xi , v) as per (1). The transmittance Ti and opacity αi are trained CNN high-level features in renderings of known and
two factors that weight the contribution of the i-th point to unknown viewpoints. Other approaches like GRAF [37], π-
the rendered color according to the geometry described by GAN [5], Pix2NeRF [4] or LOLNeRF [31] combine NeRF
σi and σj : j < i, to handle occlusions. with generative models: latent codes are mapped to an in-
Using the transmittance Ti and opacity αi , the observed stance of a radiance field of a certain pre-learned category
depth d(r) in the direction of a ray r can be rendered in a (e.g. faces, cars), thus providing a reasonable 3D model
similar manner to (2) [11, 32] as regardless of the number of available of views.

N
d(r) = T i α i ti . (4) Explicit/direct regularization methods can be divided into
i=1
externally supervised and self-supervised. Self-supervised

3077

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
variants incorporate constraints to enforce smoothness be- the gradient of the model, or even higher order operators as
tween neighboring samples in space, such as RegNeRF [26] shown later.
(see Section 3). InfoNeRF [18] prevents inconsistencies Traditionally, NeRFs are defined in terms of rays, which
due to insufficient viewpoints by minimizing a ray entropy are characterized by an origin and a viewing direction (o, v).
model and the KL-divergence between the normalized ray Consequently d from (4) is parameterized by (o, v) instead
density obtained from neighbor viewpoints. In contrast, ex- of the image coordinates (x, y) as d˜ in (5). Let C : R2 → R3
ternally supervised regularization methods usually penalize be the transformation that converts the image coordinates
differences with respect to extrinsic geometric cues. Depth- ˜ y) = d(o, C(x, y)) .
into the equivalent ray so that d(x,
supervised NeRF [11] encourages the rendered depth (4) to Then the corresponding gradients are
be consistent with a sparse set of 3D surface points obtained
by structure from motion. A similar strategy is used in [20], ˜ y) = JC (x, y)∇v d(o, v),
∇(x,y) d(x, (6)
based on a set of 3D points refined by bundle adjustment; where v = C(x, y) and JC is the Jacobian matrix of C.
or [32], where a sparse point cloud is converted into dense This way, Eq. (5) can be expressed in terms of rays with the
depth priors by means of a depth completion network. exception of JC that could be computed at the same time as
the corresponding rays during the dataloading process. In
3. A generic regularization framework practice, we use a simplified regularization loss that avoids
computing JC (see Eq. (11)).
One of the major challenges when training a NeRF with
insufficient data is to learn a consistent scene geometry so Link with RegNeRF. In order to improve the robustness of
that the model extrapolates well on unseen views. In that NeRFs when training with few data, Niemeyer et al. [26]
case, it is common to add additional priors to the model to proposed RegNeRF, which also uses an additional term in
improve the quality of the learned models. the loss function to regularize the predicted depth map. This
A classic hypothesis in depth and disparity estimation is work additionally proposed an appearance regularization
that the target is smooth [15, 35]. The same a priori can be term using a normalizing flow network trained to estimate the
applied to the scene modeled by the NeRF. Due to the ability likelihood of a predicted patch compared to normal patches
of NeRFs to model transparent surfaces and volumes, the from the JFT-300M dataset [41]. While the later is not
predicted weights can be highly irregular. As a consequence, studied here, we show that their depth regularization term
it is easier to regularize across different rendered viewpoints is simply an approximation of the more generic differential
(i.e. after projection onto a given camera) rather than directly loss presented in Eq. (5).
regularizing the 3D scene itself. This means that instead of Consider the depth map d˜ and the set of coordinates (x, y)
using the depth function d from Eq. (4), it is more appropriate that corresponds to the pixels of the depth map. RegNeRF
to work with the depth map d˜ produced by the NeRF model regularizes depth by encouraging that neighboring pixels
from a given viewpoint. This depth map d˜ is then indexed (x + i, y + j) for i, j ∈ {0, 1}2 and i + j = 1 have the same
by its 2D coordinate (x, y) instead of a ray in the 3D space. depth as the pixel (x, y) such as
In image processing, a classic way of enforcing smooth-  
ness is to add a regularization term in the loss function based Ldepth = ˜ + i, y + j) − d(x,
(d(x ˜ y))2 , (7)
on the gradients of the image. For example, penalizing the (x,y) (i,j)∈{0,1}2
i+j=1
squared L2 norm of the gradients has the effect of remov-
ing high gradients in the depth map thus enforcing it to be ˜
smooth, as desired. In addition, it does not penalize slanted which is a finite difference expression of the gradient of d.
surfaces (since they have null Laplacian) as it would happen Thus the major difference between (7) and our approach
in the case of using a total variation regularization [33]. The is that (7) approximates the gradient with finite differences
proposed regularization term thus reads while we take advantage of automatic differentiation.
In practice, RegNeRF regularization is not done on the
 entire depth maps but rather by sampling patches. The loss
Ldepth = ˜ y)2 , gmax ).
clip(∇d(x, (5)
(7) is computed not only for all patches corresponding to a
(x,y)
view in the training dataset, but also for rendered patches
In practice, we add a differentiable clipping to Ldepth , whose observation is not available. Indeed, all views should
parametrized by gmax , to preserve sharp edges that could verify this depth regularity property, not only those in the
otherwise be over-smoothed. training data. As a result, RegNeRF requires modifying the
By only changing the ReLU activation function to a Soft- dataloaders to incorporate patch-based sampling and rays
plus activation function, the MLP used in NeRF can be corresponding to unseen views. Note that our depth regular-
transformed into a continuous and infinitely differentiable ization term (5) does not require patches and therefore can be
function similarly to [14]. This allows to train directly using directly applied using single rays sampling as traditionally

3078

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
Orthographic Perspective and ∇d(x,˜ y)2 = ∇o d(o, v), i 2 + ∇o d(o, v), j 2 . Since
projection projection
(i, j, v) is, by construction, an orthonormal basis of the
space, we also have that ∇o d(o, v)2 = ∇o d(o, v), i 2 +
∇o d(o, v), j 2 + ∇o d(o, v), v 2 thus

Ldepth = ∇o d(o, v)2 − ∇o d(o, v), v 2 (10)
(o,v)∈R
o 
= ∇o d(o, v) − ∇o d(o, v), v v2 . (11)
(o,v)∈R
Figure 2. All perspective projection rays originate at the same
center of projection o, located at a finite distance from the image Note how Eq. (11) does not depend on the choice of (i, j),
plane. The center of projection in orthographic projection is at is entirely defined by the knowledge of the ray (o, v) and is
infinity, which can be represented by using a different origin for
independent from JC (x, y).
each ray, so that the origin points are parallel to the image plane.
4. Experimental results
done to train NeRFs. This makes the proposed framework
We test the impact of the proposed differential regular-
compatible with previous single ray regularization methods,
ization on the task of scene estimation using only three
such as InfoNeRF [18], and non-uniform ray sampling, im-
input views. This an extreme test case and, as such, it is
portant when working with 360° images [27]. It also does
highly reliant on the quality of the regularization to avoid
not regularize unseen views as explained in Section 4.
catastrophic collapse as shown by Niemeyer et al. [26] for
Normals regularization. The regularization term (5) re- mip-NeRF [2]. In order to compare the proposed formaliza-
lies on depth maps. However, differential geometry also tion of RegNeRF [26] to its original version, we modified
allows us to regularize other geometry-related features when the code of the authors and replaced their depth loss by the
training a NeRF. For example, consider n, the function that one in (11). We refer to our approach as DiffNeRF. The code
returns the scene normals for a given ray, whose projection, to reproduce the results presented in this section is available
or map of normals, is denoted ñ. In that case, the regulariza- at https://github.com/tehret/diffnerf.
tion of the normals of the scene becomes
 Results on LLFF [23]. In Table 1, we compare the re-
Lnormals = Jñ (x, y)2F (8) sults of the original RegNeRF (using the models trained by
(x,y) the authors) with our DiffNeRF formalization (11). Since
the code released by the authors does not contain the addi-
where Jñ is the Jacobian of the map of normals. This regu-
tional appearance loss, we added another comparison that
larizer was applied to generate one of the results in Fig. 1.
corresponds to RegNeRF without the additional appearance
Simplified regularization loss. The main problem with regularization (i.e. training from scratch using the available
the loss presented in Eq. (5) is that it does not depend only code). The proposed DiffNeRF not only improves by 1dB
on each individual ray, but also requires additional camera the PSNR of reconstructed unseen views compared to the
information to compute JC . Since this can be unpractical equivalent RegNeRF version, it also outperforms RegNeRF
depending on the camera model, we propose to use a differ- with appearance regularization by almost 0.5dB. This is also
ent and fixed local camera model only for the regularization the case for other metrics such as SSIM and LPIPS.
process. Instead of using the usual perspective projection Visual results on two examples of the LLFF dataset are
models associated with the training data, it is possible to shown in Fig. 3. In both cases, we compare the proposed
regularize the scene as if the ray being processed originated version with the models trained by Niemeyer et al. [26]. The
from an orthographic projection camera, as illustrated in horns scene in Fig. 3 shows a first example where our formal-
Fig. 2. ization outperforms RegNeRF across all evaluation metrics.
Consider a ray defined by its origin o and its direction v). The proposed method is able to learn a better geometry of
Let (i, j) be a local orthonormal basis of the plane defined the image, leading to a more complete reconstruction of the
by o and v. Using an orthographic projection camera, the triceratops skull (see the horn on the right), but also of the
direction is fixed and only the origin changes to obtain other rest of the scene, such as the sign panel in the foreground or
rays from the same camera. Therefore C, defined such that the handrails in the background. Similar improvements can
˜ y) = d(C(x, y), v), is explicit and C(x, y) = xi + yj.
d(x, be observed in the trex scene.

i Fig. 1 shows another result, with the room scene of the
This leads to JC (x, y) = ∈ R2×3 . Therefore
j LLFF dataset where the PSNR obtained with DiffNeRF is
 
˜ y) = ∇o d(o, v), i
∇(x,y) d(x, (9)
worse with respect to RegNeRF with appearance regulariza-
∇o d(o, v), j tion. However, the depth map estimated by our formalism

3079

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
fern flower fortress horns leaves orchids room trex avg.
PixelNeRF ft [47] - - - - - - - - 16.17
SRF ft [8] - - - - - - - - 17.07
PSNR

MVSNeRF ft [6] - - - - - - - - 17.88


RegNeRF (w/o app. reg.) 19.85 19.64 22.28 13.05 16.46 15.43 20.62 20.37 18.46
DiffNeRF (ours) 20.15 20.27 23.68 17.80 16.88 15.50 21.04 20.58 19.49
RegNeRF [26] 19.87 19.93 23.32 15.65 16.60 15.56 21.53 20.17 19.08
RegNeRF (w/o app. reg.) 0.694 0.677 0.706 0.486 0.599 0.483 0.843 0.774 0.658
SSIM

DiffNeRF (ours) 0.703 0.707 0.761 0.680 0.645 0.487 0.864 0.791 0.705
RegNeRF [26] 0.697 0.688 0.743 0.610 0.613 0.502 0.861 0.766 0.685
RegNeRF (w/o app. reg.) 0.323 0.243 0.294 0.341 0.229 0.259 0.204 0.197 0.261
LPIPS

DiffNeRF (ours) 0.290 0.223 0.219 0.293 0.186 0.247 0.171 0.166 0.224
RegNeRF [26] 0.304 0.234 0.258 0.356 0.222 0.251 0.185 0.197 0.251

Table 1. Quantitative comparison of novel view synthesis for different NeRF regularization on the LLFF dataset. All models were trained
using only three input views. RegNeRF (w/o app. reg.) corresponds to the original RegNeRF without appearance regularization, while the
proposed framework is DiffNeRF. The results using RegNerf with appearance regularization are also provided for reference. The proposed
regularization almost systematically achieves the best results across all metrics without requiring any additional appearance regularization.
The LPIPS metric is computed using the official implementation provided by Zhang et al. [48]. Best results are shown in bold.

8 21 30 31 34 38 40 41 45 55 63 82 103 110 114 avg

PixelNeRF ft [47] - - - - - - - - - - - - - - - 18.95


SRF ft [8] - - - - - - - - - - - - - - - 15.68
MVSNeRF ft [6] - - - - - - - - - - - - - - - 18.54
RegNeRF
19.06 12.42 22.45 16.35 18.13 16.92 18.63 15.97 16.29 17.75 20.57 17.54 22.10 17.97 21.31 18.23
(w/o app. reg.)
DiffNeRF (ours) 15.47 13.63 23.18 16.74 18.66 17.28 18.57 15.53 16.45 17.94 21.65 15.19 23.69 20.32 21.41 18.38
RegNeRF [26] 19.45 12.76 22.92 16.84 18.24 17.12 19.09 18.41 16.44 17.61 22.91 19.42 22.95 18.06 21.52 18.92

Table 2. Quantitative comparison of novel view synthesis for different NeRF regularization on the DTU dataset. All models were trained
using only three input views. RegNeRF (w/o app. reg.) corresponds to the original RegNeRF without appearance regularization, while the
proposed framework is DiffNeRF. The results using RegNerf with appearance regularization are also provided for reference. The case of
scenes 41 and 82 are discussed in Section 4. Best results are shown in bold.

is still much smoother without losing details. In addition, gmax = 5.


as in the triceratops example, we can see that some details
Parameters study. We illustrate in Fig. 5 the impact of the
are also better reconstructed, like the audio conferencing
two parameters of the proposed regularization: the weight
system in the middle of the table. The LPIPS metric in Ta-
of the regularization term in the loss and the value of the
ble 1 also seems to confirm that the DiffNeRF results present
clipping. When the regularization is too weak, the surface
less visual artifacts than RegNeRF. Both Fig. 1 and Fig. 3
exhibits irregular patterns. On the contrary, a regularization
show that the DiffNeRF depth maps are better regularized
that is too strong can make details disappear (for example
than the original RegNeRF. In DiffNeRF we only use the
when parts of the pumpkin are merged with the background).
input views at training time, without regularizing unseen
A strong clipping allows to get back some details but can
views or requiring patch-based dataloaders with a predefined
lead to visual artifacts such as staircasing. The proposed
patch size (as in RegNeRF). This shows that the proposed
set of parameters leads to a smooth surface while keeping
formalism yields a better generalization. All experiments
details.
with LLFF were computed using a weight of 2e−4 for the
regularization term with a clipping value gmax = 20. Limitations. During the experiments, we observed two
main limitations. The first one is the apparition of "floaters",
Results on DTU [17]. Table 2 and Fig. 4 present results on groups of points with a non zero density and disjoint from
the DTU dataset. Again, DiffNeRF produces results with the scene. These floaters can hide portions of the scene
a smoother scene geometry. All experiments with DTU when synthesizing novel views (see Fig. 6). In the DTU
were computed using a weight of 2e−4 with a clipping value dataset, we find these floating artifacts to be related to the

3080

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
RegNeRF [26] RegNeRF (w/o app. reg.) DiffNeRF (ours) Ground truth

horns
trex

Figure 3. Visual examples of novel view synthesis for the horns (top) and trex (bottom) sequences of the LLFF dataset after training with
three views. The depth map produced by the proposed DiffNeRF is more regular than those produced RegNeRF. It also recovers more details
both in the foreground (see the sign panel on the left or the triceratops’ left horn) but also in the background (see the glass panels and the
handrails).

large textureless background regions or areas observed by a Rays/s Batch size


single camera (i.e. when the problem is not well defined, note
that these regions are still regularized). We did not observe RegNeRF [26] ∼ 6000 ∼ 2000
such floaters in the LLFF dataset. This also explains why DiffNeRF (depth reg.) ∼ 5000 ∼ 1000
regularizing unseen views, i.e. without any data attachment DiffNeRF (normals reg.) ∼ 1100 ∼ 250
term, is not a good idea since it encourages the creation of
such floaters. Table 3. Computation speed (in rays per seconds) for the different
methods on a NVIDIA V100 16Go GPU. Because diffNeRF does
The second limitations is the computation performance.
not require sampling additional rays from unseen views, the compu-
Since the proposed regularization requires computing gradi-
tation is barely slower than RegNeRF [26] (∼ 16%). Higher order
ents, it is expected to be slower and require more memory. regularization (such as normals regularization) are much slower
Nevertheless, since there is no need to regularize unseen though.
views, the proposed method remains competitive (for the
depth regularization). A comparison is shown in Table 3.
Section 2.1. In particular, IDR [46] and VolSDF [45] both
5. Extension to surface regularization using learn the surface by means of a signed distance function
mean and Gaussian curvatures (SDF). This SDF can then be used in a direct manner or as a
guide to sample points as done in NeRF to learn the surface.
Another trend with NeRF-like models is to directly learn Since this SDF can be seen as an implicit
 function F defining

a surface model instead of a density function as shown in the surface S as the set of points x ∈ R3 | F (x) = 0 , it

3081

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
RegNeRF [26] RegNeRF (w/o app. reg.) DiffNeRF (ours) Ground truth
scan30
scan40

Figure 4. Visual example of novel view synthesis for scenes 30 and 40 of the DTU dataset after training with three views. The depth map
produced by the proposed DiffNeRF is more regular than those produced RegNeRF. It also separates better the object from the background.

(a) (b)

(c) (d)

Figure 5. Visual impact of the two parameters of the regularization Figure 6. Failure cases for scenes 41 and 82 of the DTU dataset
(reconstructions from three views). (a) strong regularization and reconstructed from three views. "Floaters" (groups of points with a
clipping, (b) strong regularization and little clipping, (c) medium non zero density and disjoint from the scene) hide portions of the
regularization and clipping, (d) little regularization and clipping. scene when synthesizing novel views.

3082

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
is possible to compute other differential quantities related
to surface regularity, such as the curvature. This allows to
directly regularize the surface instead of regularizing the
projections of the scene as shown in Section 3. We propose
in this section to look at the Gaussian curvature γgauss and
the mean curvature γmean , since they both have an analytical
Figure 7. Visual example of a regularized reconstruction of the
expression that can be easily implemented using the existing
scene 40 of the DTU dataset. From left to right: regularized recon-
deep learning frameworks. struction using Gaussian curvature (13), original VolSDF results
These curvatures are respectively defined as and ground truth.
 
∇F
γmean = −div (12)
∇F 
6. Conclusions
and
∇F × H ∗ (F ) × ∇F t With DiffNeRF, a variant of NeRF that relies on differen-
γgauss = , (13) tial geometry to regularize the depth or the normals of the
∇F 4
learned scene, we demonstrated that it is possible to achieve
where H ∗ is the adjoint of the Hessian of F . Derivation
state-of-the-art novel view synthesis and depth estimation
details of these two curvatures can be found in [13]. Using
in few-shot neural rendering with a simple yet flexible reg-
(12) and (13), we can define a regularization loss similar to
ularization framework. This is made possible by modern
the one presented in Section 3 as
deep learning frameworks, which already provide the nec-
Lcurv (κcurv ) = Ex∈S [min (|γ(x)|, κcurv )] , (14) essary tools to implement differential geometry operators,
thus facilitating their use in practice. However, the use of
where γ can be either γmean or γgauss , depending on the differential geometry is still subject to certain limitations.
preferred behavior, and κcurv is a clipping value. The final Higher-order operators can be costly both in memory and in
loss to train a regularized VolSDF model using (14) becomes computation time so a careful choice of the regularization
term is essential. Operators should be chosen differently
depending on the problem at hand. For example, a Gaussian
L = LRGB + λSDF LSDF + λcurv Lcurv (κcurv ) (15)
curvature regularization may be appropriate for flat surfaces
with with strong edges, such as buildings, but could fill holes in ir-
LSDF = Ex∈R3 (∇F (x) − 1)2 . (16) regular surfaces. The vast literature on differential geometry
opens up many exciting opportunities to define new regular-
As in [45], the LSDF term enforces the Eikonal constraint
ization tools with the appropriate mathematical formalism,
on the implicit function F , thus learning a signed distance
which we hope pushes the limits of neural rendering even
function. Note that (15) makes it possible to regularize the
further. Additional studies to understand the impact of the ac-
surface directly during training instead of doing it in different
tivation function (such as softplus, squareplus [1], sine [38],
stages as in [44].
Gaussian [9], etc.) on the results are also necessary.
The regularization is characterized by the same two pa-
rameters, the regularization weight and the clipping value,
than the regularization presented in Section 3. To understand
Acknowledgements
the impact of these parameters, we refer to the definition of Work partly financed by Office of Naval research grant
the mean and Gaussian curvature in terms of the minimum N00014-17-1-2552, MENRT, and Kayrros. It was also per-
curvature γmin and maximum curvature γmax of the surface formed using HPC resources from GENCI–IDRIS (grants
at a given point AD011012453R2 and AD011011801R3) and from the “Mé-
γmin +γmax socentre” computing center of CentraleSupélec and ENS
γmean = 2 and γgauss = γmin γmax . (17)
Paris-Saclay supported by CNRS and Région Île-de-France
Although this is not a practical definition of the curvature, (http://mesocentre.centralesupelec.fr). Centre Borelli is
since it does not allow for direct computation, it shows that also with Université Paris Cité, SSA and INSERM.
minimizing the mean curvature leads to surface smooth-
ing [10]. On the other hand, minimizing the Gaussian cur- References
vature forces the minimum curvature to be zero, resulting
[1] Jonathan T. Barron. Squareplus: A Softplus-Like Algebraic
in flat surfaces with sharp straight edges. The visual impact
Rectifier, 2021. 8
on the reconstructed surfaces is shown in the supplementary
[2] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter
material. An example of a regularized reconstruction using Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan.
Gaussian curvature is shown in Fig 7.

3083

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
Mip-NeRF: A multiscale representation for anti-aliasing neu- [15] Heiko Hirschmuller. Stereo processing by semiglobal match-
ral radiance fields. In Proceedings of the IEEE/CVF Interna- ing and mutual information. IEEE Transactions on pattern
tional Conference on Computer Vision (ICCV), pages 5855– analysis and machine intelligence, 30(2):328–341, 2007. 3
5864, 2021. 2, 4 [16] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting NeRF
[3] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Bar- on a diet: Semantically consistent few-shot view synthesis.
ron, Ce Liu, and Hendrik Lensch. NeRD: Neural reflectance In Proceedings of the IEEE/CVF International Conference
decomposition from image collections. In Proceedings of on Computer Vision (ICCV), pages 5885–5894, 2021. 1, 2
the IEEE/CVF International Conference on Computer Vision, [17] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola,
pages 12684–12694, 2021. 1 and Henrik Aanæs. Large scale multi-view stereopsis evalu-
[4] Shengqu Cai, Anton Obukhov, Dengxin Dai, and Luc ation. In Proceedings of the IEEE conference on computer
Van Gool. Pix2NeRF: Unsupervised conditional π-GAN vision and pattern recognition, pages 406–413, 2014. 5
for single image to neural radiance fields translation. Pro- [18] Mijeong Kim, Seonguk Seo, and Bohyung Han. InfoNeRF:
ceedings of the IEEE/CVF Conference on Computer Vision Ray entropy minimization for few-shot neural volume ren-
and Pattern Recognition (CVPR), 2022. 2 dering. In Proceedings of the IEEE/CVF Conference on
[5] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, Computer Vision and Pattern Recognition (CVPR), 2022. 1,
and Gordon Wetzstein. pi-GAN: Periodic implicit generative 2, 3, 4
adversarial networks for 3D-aware image synthesis. In Pro- [19] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang.
ceedings of the IEEE/CVF Conference on Computer Vision Neural scene flow fields for space-time view synthesis of
and Pattern Recognition (CVPR), pages 5799–5809, 2021. 2 dynamic scenes. In Proceedings of the IEEE/CVF Conference
[6] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, on Computer Vision and Pattern Recognition, pages 6498–
Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- 6508, 2021. 1
izable radiance field reconstruction from multi-view stereo. [20] Roger Marí, Gabriele Facciolo, and Thibaud Ehret. Sat-NeRF:
In Proceedings of the IEEE/CVF International Conference Learning multi-view satellite photogrammetry with transient
on Computer Vision, pages 14124–14133, 2021. 5 objects and shadow modeling using RPC cameras. Proceed-
[7] Di Chen, Yu Liu, Lianghua Huang, Bin Wang, and Pan Pan. ings of the IEEE/CVF Conference on Computer Vision and
GeoAug: Data Augmentation for Few-Shot NeRF with Geom- Pattern Recognition (CVPR) Workshops, 2022. 1, 3
etry Constraints. In Computer Vision – ECCV 2022, Lecture [21] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi,
Notes in Computer Science, pages 322–337. Springer Nature Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck-
Switzerland, 2022. 1 worth. NeRF in the wild: Neural radiance fields for uncon-
[8] Julian Chibane, Aayush Bansal, Verica Lazova, and Gerard strained photo collections. In Proceedings of the IEEE/CVF
Pons-Moll. Stereo radiance fields (srf): Learning view syn- Conference on Computer Vision and Pattern Recognition
thesis for sparse views of novel scenes. In Proceedings of (CVPR), pages 7210–7219, 2021. 1
the IEEE/CVF Conference on Computer Vision and Pattern [22] Nelson Max. Optical models for direct volume rendering.
Recognition, pages 7911–7920, 2021. 5 IEEE Transactions on Visualization and Computer Graphics,
[9] Shin-Fang Chng, Sameera Ramasinghe, Jamie Sherrah, and 1(2):99–108, 1995. 2
Simon Lucey. Gaussian activated neural radiance fields for [23] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon,
high fidelity reconstruction and pose estimation. In European Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and
Conference on Computer Vision, pages 264–280. Springer, Abhishek Kar. Local light field fusion: Practical view synthe-
2022. 8 sis with prescriptive sampling guidelines. ACM Transactions
[10] Ulrich Clarenz, Udo Diewald, and Martin Rumpf. Anisotropic on Graphics (TOG), 38(4):1–14, 2019. 4
geometric diffusion in surface processing. In Proceedings of [24] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
the IEEE Visualization Conference, 2000. 8 Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
[11] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Representing scenes as neural radiance fields for view syn-
Depth-supervised NeRF: Fewer views and faster training for thesis. In European Conference on Computer Vision, pages
free. In Proceedings of the IEEE/CVF Conference on Com- 405–421, 2020. 1, 2
puter Vision and Pattern Recognition (CVPR), 2022. 2, 3 [25] Pierre Moulon, Pascal Monasse, Romuald Perrot, and Renaud
[12] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and Marlet. OpenMVG: Open multiple view geometry. In In-
robust multi-view stereopsis (pmvs). In IEEE Computer Soci- ternational Workshop on Reproducible Research in Pattern
ety Conference on Computer Vision and Pattern Recognition, Recognition, pages 60–74. Springer, 2016. 1
volume 2, page 3, 2007. 1 [26] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall,
[13] Ron Goldman. Curvature formulas for implicit curves and Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Reg-
surfaces. Computer Aided Geometric Design, 22(7):632–658, NeRF: Regularizing neural radiance fields for view synthesis
2005. 8 from sparse inputs. In Proceedings of the IEEE/CVF Confer-
[14] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and ence on Computer Vision and Pattern Recognition (CVPR),
Yaron Lipman. Implicit geometric regularization for learning 2022. 1, 2, 3, 4, 5, 6, 7
shapes. In International Conference on Machine Learning, [27] Takashi Otonari, Satoshi Ikehata, and Kiyoharu Aizawa. Non-
pages 3789–3799. PMLR, 2020. 3 uniform Sampling Strategies for NeRF on 360° images. In

3084

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.
33rd British Machine Vision Conference 2022, {BMVC} 2022, [41] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
London, UK, November 21-24, 2022. BMVA Press, 2022. 4 nav Gupta. Revisiting unreasonable effectiveness of data in
[28] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien deep learning era. In Proceedings of the IEEE international
Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo conference on computer vision, pages 843–852, 2017. 3
Martin-Brualla. Nerfies: Deformable neural radiance fields. [42] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitz-
In Proceedings of the IEEE/CVF International Conference mann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-
on Computer Vision (ICCV), pages 5865–5874, 2021. 1 Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al.
[29] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. State of the art on neural rendering. Computer Graphics
Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Forum, 39(2):701–727, 2020. 2
Brualla, and Steven M. Seitz. HyperNeRF: A higher- [43] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
dimensional representation for topologically varying neural Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
radiance fields. ACM Trans. Graph., 40(6), 2021. 1 rigid neural radiance fields: Reconstruction and novel view
[30] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and synthesis of a dynamic scene from monocular video. In
Francesc Moreno-Noguer. D-NeRF: Neural radiance fields Proceedings of the IEEE/CVF International Conference on
for dynamic scenes. In Proceedings of the IEEE/CVF Confer- Computer Vision (ICCV), pages 12959–12970, 2021. 1
ence on Computer Vision and Pattern Recognition (CVPR), [44] Guandao Yang, Serge Belongie, Bharath Hariharan, and
pages 10318–10327, 2021. 1 Vladlen Koltun. Geometry processing with neural fields. Ad-
[31] Daniel Rebain, Mark Matthews, Kwang Moo Yi, Dmitry vances in Neural Information Processing Systems, 34:22483–
Lagun, and Andrea Tagliasacchi. LOLNeRF: Learn from one 22497, 2021. 8
look. Proceedings of the IEEE/CVF Conference on Computer [45] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
Vision and Pattern Recognition (CVPR), 2021. 2 ume rendering of neural implicit surfaces. Advances in Neural
[32] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Information Processing Systems, 34, 2021. 6, 8
Srinivasan, and Matthias Nießner. Dense depth priors for [46] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
neural radiance fields from sparse input views. In Proceedings Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural
of the IEEE/CVF Conference on Computer Vision and Pattern surface reconstruction by disentangling geometry and appear-
Recognition (CVPR), 2022. 2, 3 ance. Advances in Neural Information Processing Systems,
[33] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear 33, 2020. 6
total variation based noise removal algorithms. Physica D: [47] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
nonlinear phenomena, 60(1-4):259–268, 1992. 3 pixelNeRF: Neural radiance fields from one or few images.
[34] Ewelina Rupnik, Mehdi Daakir, and Marc Pierrot Deseilligny. In Proceedings of the IEEE/CVF Conference on Computer
Micmac–a free, open-source solution for photogrammetry. Vision and Pattern Recognition (CVPR), pages 4578–4587,
Open Geospatial Data, Software and Standards, 2(1):1–9, 2021. 1, 2, 5
2017. 1
[48] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
[35] Daniel Scharstein and Richard Szeliski. A taxonomy and eval- and Oliver Wang. The unreasonable effectiveness of deep
uation of dense two-frame stereo correspondence algorithms. features as a perceptual metric. In CVPR, 2018. 5
International journal of computer vision, 47(1):7–42, 2002. 3
[49] Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul
[36] Johannes L Schonberger and Jan-Michael Frahm. Structure-
Debevec, William T Freeman, and Jonathan T Barron. NeR-
from-motion revisited. In Proceedings of the IEEE confer-
Factor: Neural factorization of shape and reflectance under
ence on computer vision and pattern recognition, pages 4104–
an unknown illumination. ACM Transactions on Graphics
4113, 2016. 1
(TOG), 40(6):1–18, 2021. 1
[37] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas
Geiger. Graf: Generative radiance fields for 3d-aware im-
age synthesis. Advances in Neural Information Processing
Systems, 33:20154–20166, 2020. 2
[38] Vincent Sitzmann, Julien N. P. Martel, Alexander W.
Bergman, David B. Lindell, and Gordon Wetzstein. Implicit
Neural Representations with Periodic Activation Functions.
volume 33, pages 7462–7473. Advances in neural information
processing systems, 2020. 8
[39] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo
tourism: exploring photo collections in 3d. In ACM siggraph
2006 papers, pages 835–846. ACM Press, 2006. 1
[40] Pratul P Srinivasan, Boyang Deng, Xiuming Zhang, Matthew
Tancik, Ben Mildenhall, and Jonathan T Barron. NeRV:
Neural reflectance and visibility fields for relighting and view
synthesis. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages
7495–7504, 2021. 1

3085

Authorized licensed use limited to: UNIVERSIDADE FEDERAL FLUMINENSE. Downloaded on October 23,2024 at 14:02:21 UTC from IEEE Xplore. Restrictions apply.

You might also like