0% found this document useful (0 votes)

64 views15 pages

Neus 2

NeuS2 is a novel approach for fast neural surface reconstruction that significantly reduces training time from 8 hours to just minutes for static scenes and 20 seconds per frame for dynamic scenes, without sacrificing reconstruction quality. The method utilizes multi-resolution hash encodings and a lightweight calculation of second-order derivatives to enhance efficiency and stability during training. NeuS2 outperforms existing state-of-the-art methods in both accuracy and speed, making it suitable for applications in various computer vision and graphics fields.

Uploaded by

Minh Lưu Nguyễn Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views15 pages

Neus 2

Uploaded by

Minh Lưu Nguyễn Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction

Yiming Wang∗1 Qin Han∗1 Marc Habermann2 Kostas Daniilidis3

Christian Theobalt2 Lingjie Liu2
1 2 3
Peking University Max Planck Institute for Informatics University of Pennsylvania
{wym12416,hanqin}@pku.edu.cn, {mhaberma,theobalt,lliu}@mpi-inf.mpg.de, [email protected]
arXiv:2212.05231v1 [cs.CV] 10 Dec 2022

Static Scene Reconstruction

Abstract

Recent methods for neural surface representation and

rendering, for example NeuS [60], have demonstrated
remarkably high-quality reconstruction of static scenes.
However, the training of NeuS takes an extremely long NeuS (8 hours) NeuS2 (5 mins)
time (8 hours), which makes it almost impossible to apply Dynamic Scene Reconstruction
them to dynamic scenes with thousands of frames. We pro-
pose a fast neural surface reconstruction approach, called
NeuS2, which achieves two orders of magnitude improve-
ment in terms of acceleration without compromising recon-
struction quality. To accelerate the training process, we in- Frame 0 (40s) Frame 120 (20s) Frame 400 (20s)

tegrate multi-resolution hash encodings [36] into a neural NeuS2

surface representation [60] and implement our whole algo-
Figure 1. We present NeuS2, a fast neural scene reconstruction
rithm in CUDA. We also present a lightweight calculation
method. Given a set of multi-view images, NeuS2 can accurately
of second-order derivatives tailored to our networks (i.e., reconstruct the scene geometry and appearance in the order of
ReLU-based MLPs), which achieves a factor two speed up. minutes. This is in stark contrast to previous work [60], which
To further stabilize training, a progressive learning strat- only recovers medium-scale details at significantly increased time
egy is proposed to optimize multi-resolution hash encodings (about 8 hours). Moreover, we demonstrate that NeuS2 can also be
from coarse to fine. In addition, we extend our method for applied to dynamic scene reconstruction from multi-view videos,
reconstructing dynamic scenes with an incremental train- where we recover per-frame reconstruction in about 20 seconds.
ing strategy. Our experiments on various datasets demon-
strate that NeuS2 significantly outperforms the state-of-the-
arts in both surface reconstruction accuracy and training
Recently, 3D reconstruction with neural implicit repre-
speed. The video is available at https://vcai.mpi-
sentations has become a promising alternative to traditional
inf.mpg.de/projects/NeuS2/.
methods [5, 6, 11, 13, 14, 51, 53], because of its high achiev-
able spatial resolution and highly detailed reconstruction.
1. Introduction NeuS [60], as a representative work, models geometry sur-
Reconstructing the dynamic 3D world from 2D images faces as a neural network encoded Signed Distance Field
is crucial for many Computer Vision and Graphics applica- (SDF), and renders an image via differentiable volume ren-
tions, such as AR/VR, 3D movies, games, telepresence, and dering. While NeuS [60] produces high-quality reconstruc-
3D printing. Classical stereo algorithms employ computer tion results, its training process is extremely slow, i.e. about
vision methods, e.g. feature matching, to capture the ge- 8 hours for a static object. This makes it nearly impossi-
ometry and appearance of 3D contents from multi-view 2D ble to reconstruct dynamic scenes. Instant-NGP [36] has
images. Despite great progress, these methods are compu- explored the training acceleration of neural radiance fields
tationally expensive and struggle to reconstruct high-quality (NeRF) [35] by utilizing multi-resolution hash tables to
results. augment neural network-encoded radiance fields. Though
Instant-NGP [36] synthesizes impressive novel view syn-
∗ Equal contribution thesis results, the extracted geometry from the learned den-

1
sity fields contains discernible noise as the 3D representa- • We design an incremental learning method with a
tion lacks surface constraints. novel global transformation prediction component for
To overcome these drawbacks, we propose NeuS2, a new reconstructing dynamic scenes with large movements
method for fast training of highly-detailed neural implicit in an efficient and stable manner.
surfaces (see Fig. 1). NeuS2 can reconstruct a static ob-
ject in minutes, and a moving object sequence in up to 20 2. Related Work
seconds per frame (excluding the first frame), on a single
GPU. To achieve fast reconstruction, we leverage multi- Multi-view Stereo. Traditional multi-view 3D recon-
resolution hash tables of learnable feature vectors [36] to struction methods can be categorized into depth-based and
enhance a neural network-encoded SDF and implement the voxel-based methods. Depth-based methods [5, 13, 14, 51]
whole system in CUDA. Notably, in this design, a surface reconstruct a point cloud by identifying point correspon-
constraint and the rendering formulation require calculat- dences across images. But the reconstruction quality is
ing the second-order derivatives. The main challenge is to heavily affected by the accuracy of correspondence match-
have a simple and memory-efficient calculation to achieve ing. Voxel-based methods [6, 11, 53] side-step the difficul-
the highest possible GPU computing performance. There- ties in explicit correspondence matching by recovering oc-
fore, we derive a simple formula of the second-order deriva- cupancy and color in a voxel grid from multi-view images
tives tailored to ReLU-based MLPs, which enables an effi- with a photometric consistency criterion. However, the re-
cient CUDA implementation with a small memory footprint construction of these methods is limited to low resolution
at a significantly reduced computational cost. To further en- due to the high memory consumption of voxel grids.
force and accelerate the training convergence, we introduce Classical Multi-view 4D Reconstruction. A large body
an efficient progressive training strategy for learning multi- of works [3, 7, 10, 17, 59, 65] in multi-view 4D reconstruc-
resolution hash encodings [36], which updates the hash tation utilize a precomputed deformable model and deform
ble features in a coarse-to-fine manner. the model to fit the multi-view images. Different from these
We further extend our method to multi-view dynamic works, our method does not need a precomputed model, can
scene reconstruction. Instead of training each frame in the reconstruct detailed results and handle topology changes.
sequence separately, we propose a new incremental learning The most relevant to our work is [9], which is also a model-
strategy to efficiently learn a neural dynamic representation free method. [9] first uses RGB and depth inputs to recon-
of objects with large movements and deformations. Specif- struct high-quality point clouds for each frame, and then
ically, we exploit the similarity of the shape and appearance tracks over sequences of frames to produce temporally co-
information shared in two consecutive frames by first train- herent sequences. While [9] can produce impressive re-
ing the first frame and sequentially fine-tuning the subse- sults, it requires both RGB and depth as input and the
quent frames. While generally, this strategy works well, we whole pipeline consists of many sophisticated and time-
observed that when the movement between two consecutive consuming procedures. In this paper, we only focus on how
frames is relatively large, the predicted SDF of the occluded to efficiently get high-quality reconstruction per frame and
regions that are not observed in most images may get stuck do not consider the temporal mesh tracking since it is out
in the learned SDF of the previous frame. To address this, of the scope of this paper. Different from [9], we only need
we predict a global transformation to roughly align these RGB as input and can learn high-quality geometry and ap-
two frames before learning the representation of the new pearance for each frame in an end-to-end manner in 20 sec-
frame. onds per frame.
In summary, our technical contributions are: Neural Implicit Representations. Neural implicit rep-
resentations have made remarkable achievements in novel
• We propose a new method, NeuS2, for fast learn- view synthesis [22,30,33,35,54,55] and 3D/4D reconstruc-
ing of neural surface representations from multi-view tion [21, 23, 32, 37, 39, 47, 48, 60, 66, 67]. NeRF [35] has
RGB input for both static and dynamic scenes, which shown high-quality results in the novel view synthesis task,
achieves a significant speed up over the state-of-the-art but it cannot extract high-quality surfaces since the geome-
while achieving an unprecedented reconstruction qual- try representation lacks surface constraints. NeuS [60] rep-
ity. resents the 3D surface as an SDF for high-quality geometry
reconstruction. However, the training of NeuS is very slow,
• A simple formulation of the second-order derivatives
and it only works on static scene reconstruction. Instead,
tailored to ReLU-based MLPs is presented to enable
our method significantly accelerates the training of NeuS
efficient parallelization of GPU computation.
by 100 times, and the training can be further accelerated to
• A progressive training strategy for learning multi- 20 seconds per frame when applied to dynamic scene re-
resolution hash encodings from coarse to fine is pro- construction.
posed to enforce better and faster training convergence. There are other follow-up works of NeRF [35]. Some

2
works [36, 49, 57] introduce voxel-grid features to repre- 3. Background
sent 3D properties for fast training. However, these meth-
ods cannot extract high-quality surfaces as they inherit the NeuS. Given calibrated multi-view images of a static
volume density field as the geometry representation from scene, NeuS [60] implicitly represents the surface and ap-
NeRF [35]. In contrast, our method can achieve high- pearance of a scene as a signed distance field f (x) : R3 →
quality surface reconstruction as well as fast training. For R and a radiance field c(x, v) : R3 × S2 → R3 , where x
dynamic scene modeling, many works [27, 28, 40, 41, 45, denotes a 3D position and v ∈ S2 is a viewing direction.
58, 63] propose to disentangle a 4D scene into a shared The surface S of the object can be obtained by extracting
canonical space and a deformable field per frame. While the zero-level set of the SDF S = {x ∈ R3 |f (x) = 0}. To
these works make the training for dynamic scenes more ef- render an object into an image, NeuS leverages volume ren-
ficient, the training is still time-consuming. Furthermore, dering. Specifically, for each pixel of an image, we sample
these methods are not able to handle large movements and n points {p(ti ) = o + ti v|i = 0, 1, . . . , n − 1} along its
only reconstruct medium-quality surfaces. Some works in camera ray, where o is the center of the camera and v is the
human performance modeling [8, 20, 24, 31, 38, 43, 44, 56, view direction. By accumulating the SDF-based densities
61, 64] can model large movements by introducing a de- and colors of the sample points, we can compute the color Ĉ
formable template as a prior. In contrast, our method can of the ray. As the rendering process is differentiable, NeuS
handle large movements, does not require a deformable can learn the signed distance field f and the radiance field c
template and thus is not restricted to a specific dynamic ob- from the multi-view images. However, the training process
ject. Moreover, we can learn high-quality surfaces of dy- is very slow, taking about 8 hours on a single GPU.
namic scenes for 20 seconds per frame. Multi-resolution Hash Encoding. To overcome the
slow training time of deep coordinate-based MLPs, which
is also a main reason for the slow performance of NeuS, re-
cently, Instant-NGP [36] proposed a multi-resolution hash
encoding and has proven its effectiveness. Specifically,
Concurrent Work. [12] represents a 4D scene with a
Instant-NGP assumes that the object to be reconstructed is
time-aware voxel feature. [29] proposes a static-to-dynamic
bounded in multi-resolution voxel grids. The voxel grids
learning paradigm for fast dynamic scene learning. [26]
at each resolution are mapped to a hash table with a fixed-
presents a grid based method for efficiently reconstructing
size array of learnable feature vectors. For a 3D position
radiance fields frame by frame. These three methods focus
x ∈ R3 , it obtains a hash encoding at each level hi (x) ∈ Rd
on the novel view synthesis and, thus, are not designed to
(d is the dimension of a feature vector, i = 1, ..., L) by in-
reconstruct high-quality surfaces, which is different from
terpolating the feature vectors assigned at the surrounding
our goal of achieving high-quality surface geometry and
voxel grids at this level. The hash encodings at all L levels
appearance models. Voxurf [62] proposes a voxel-based
are then concatenated to be the multi-resolution hash en-
surface representation for fast multi-view 3D reconstruc-
coding h(x) = {hi (x)}L i=1 ∈ R
L×d
. While the runtime is
tion. While it enables 20x speedup over the baselines (i.e.,
significantly improved, Instant-NGP still does not reach the
NeuS [60]), our proposed method is over 3x faster than Vox-
quality of NeuS in terms of geometry reconstruction accu-
urf and achieves better geometry quality when compared
racy.
with Voxurf’s results reported in their paper. Also, Voxurf
is not designed for dynamic scene reconstruction. [69] pro-
poses a method for human modeling and rendering. It first 4. Static Neural Surface Reconstruction
reconstructs a neural surface representation for each frame;
then it applies non-rigid deformation to obtain a temporally We first present how our formulation can effectively
coherent mesh sequence. Our work focuses on the first learn the signed distance field of a static scene from cali-
part, that is, fast reconstruction of dynamic scenes, where brated multi-view images (see Fig. 2 a). To accelerate the
we exploit the temporal consistency between two consecu- training process, we first demonstrate how to incorporate
tive frames to accelerate the learning of dynamic represen- multi-resolution hash encodings [36] for representing the
tation. Therefore, our work is orthogonal to [69], and can SDF of the scene, and how volume rendering can be applied
be integrated into [69] as the first step. Last, Anonymous et to render the scene into an image (Sec. 4.1). Next, we derive
al. [25] propose a monocular dynamic surface reconstruc- a simplified expression of second-order derivatives tailored
tion approach by extending the NeuS formulation for bend- to ReLU-based MLPs, which can be efficiently parallelized
ing rays. In stark contrast to our approach, their focus lies in custom CUDA kernels (Sec. 4.2). Finally, we adopt a
on proving that unbiasedness also holds in the case of ray progressive training strategy for learning multi-resolution
bending and the challenging monocular setting, and less on hash encodings, which leads to faster training convergence
the highest possible quality at the fastest speeds. and better reconstruction quality (Sec. 4.3).

3
Pos SDF
n = ∇' SDF
Second-order
derivative

Volume
SDF Network RGB Network RGB
Rendering

Ray Dir
Hash Encoding Rendered Rendered
(a) NeuS2: static scene reconstruction Image Mesh

X % = R!% (X ! + T&! )

X "#$ = R! (X ! + T ! )
X % = R!#$
% (X
!#$
+ T&!#$ )

X% X "#$ X"
NeuS2
X!

Incremental
Training
Frame 0 Frame t-1 Frame t

Global Transformation Prediction

Rendered Sequence
(b) NeuS2: dynamic scene reconstruction
Figure 2. Method overview. (a) Static Scene Reconstruction: Given a 3D point x, we concatenate its queried feature from the multi-
resolution hash grid and its 3D position as the input to the SDF network. The SDF network outputs the SDF value and geometry features,
which are combined with the viewing direction and further fed into our RGB network to generate the RGB value. Notably, during backward
propagation the second-order derivates are efficiently computed using our CUDA implementation. (b) Dynamic Scene Reconstruction:
Given a sequence of multi-view images, we first construct the first frame following our static reconstruction method. For every subsequent
frame, we predict its global transformation with respect to the previous frame and accumulate the transformation to convert it into the
canonical space (i.e. the first frame). Then, we fine-tune the parameters of NeuS2 for incremental training to generate the rendered results.

4.1. Volume Rendering of a Hash-encoded SDF Volume Rendering. To render an image, we apply the
unbiased volume rendering of NeuS [60]. Additionally, we
For each 3D position x, we map it to its multi-resolution
adopt a ray marching acceleration strategy used in Instant-
hash encodings hΩ (x) with learnable hash table entries Ω.
NGP [36]. More details are provided in the supplementary
As hΩ (x) is an informative encoding of spatial position,
document.
the MLPs for mapping x to its SDF d and color c can be
Supervision. To train NeuS2, we minimize the color dif-
very shallow, which results in more efficient rendering and
ference between the rendered pixels Ĉi with i ∈ {1, ..., m}
training without compromising quality.
and the corresponding ground truth pixels Ci without any
SDF Network. In more detail, our SDF network
3D supervision. Here, m denotes the batch size during
training. We also employ an Eikonal term [15] to regularize
\label {eqn:sdf_net} (d, \mathbf {g}) = f_{\Theta }(\mathbf {e}), \qquad \mathbf {e} = (\mathbf {x}, h_\Omega (\mathbf {x})). (1)
the learned signed distance field. Our final loss is defined as
is a shallow MLP with weights Θ, which takes the 3D po-
sition x along with its hash encoding hΩ (x) as input and \label {eqn:loss} \mathcal {L}=\mathcal {L}_{\mathrm {color}}+\beta \mathcal {L}_{\mathrm {eikonal}}, (4)
outputs the SDF value d and a geometry feature vector where Lcolor = m 1
P
i R(P Ĉi , Ci ), and R is the Huber
g ∈ R15 . Here, we concatenate the point position as input loss [18]. Leikonal = mn 1 2
k,i (||nk,i || − 1) , and k in-
to the SDF network in order to apply a geometry initializa- dexes the kth sample along the ray with k ∈ {1, ..., n}, and
tion [4] for more stable SDF learning. n is the number of sampled points. nk,i is the normal of a
Color Network. The normal of x can be computed as sampled point (see Eq. 2).
\label {eqn:normal} \mathbf {n}=\nabla _{\mathbf {x}}d. (2) 4.2. Efficient Handling of Second-order Derivatives

where ∇x d denotes the gradient of the SDF with respect to To avoid the computational overhead during training, we
x. We then feed the normal with the geometry feature g, implement our whole system in CUDA.
the SDF d, the point x, and the ray direction v to our color Second-order Derivatives. In contrast to Instant-
network NGP [36], which only requires the first-order derivatives
\label {eqn:color_net} \mathbf {c} = c_{\Upsilon }(\textbf {x}, \textbf {n}, \textbf {v}, d, \mathbf {g}), (3) during the optimization, we must calculate the second-order
derivatives for the parameters associated with the normal
which denotes the color c of x. term n = ∇x d (Eq. 2), which is used as an input to the

4
∂ ∂e
color network cΥ (Eq. 3) and the Eikonal loss term Leikonal . ∂e
is irrelevant to Θ, we have ∂Θ
∂x
= 0. This results in the
∂x
To accelerate this computation, we directly calculate them following simplified form for the second-order derivatives.
using simplified formulas instead of applying the computa-
tional graph of PyTorch [42]. Specifically, we calculate the \label {eqn:simplified_dloss_denc} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Omega } &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Omega } \end {aligned} (9)
second-order derivatives of the hash table parameters Ω and
the SDF network parameters Θ using the chain rule as
\label {eqn:simplified_dloss_dmlp} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Theta } &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Theta } \end {aligned} (10)
\label {eqn:dloss_denc} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Omega } = \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \Omega } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Omega } ) \end {aligned} (5)
∂L ∂L
To further simply ∂Ω and ∂Θ , we can substitute the terms
∂e
∂ ∂x ∂ ∂d
\label {eqn:dloss_dsdf} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Theta } &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Theta } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Theta } ) \end {aligned} of ∂Ω and using Eq. 8, respectively. We show in
∂Θ
∂e

(6) Fig. 6 that our computation of the second-order derivatives

is much more efficient than using the computational graph
Note that the color network cΥ only takes n as input, so in PyTorch.
we do not need to calculate its second-order gradients of the
color network parameters Υ. The derivation of Eq. 5 and 6 4.3. Progressive Training
is included in the supplemental document.
To further accelerate the training process, we propose a
To speed up the computation of Eqs. 5 and 6, we sim-
progressive training strategy to enforce the training conver-
plify the second-order derivatives of ReLU-based MLPs. In
gence of our model. We empirically observed that using
the following, we elaborate on the proposition of the simple
grids at low resolutions results in underfitting of the data,
formula in detail and provide proof of this proposition in
where the geometry reconstruction is smooth and lacks de-
the supplementary material. We first introduce some useful
tails, whereas using grids at high resolutions induces over-
definitions.
fitting, leading to increased noise and artifacts of the results.
In light of this observation, we introduce a training strategy
Definition 1 Given a ReLU based MLP f with L hidden
to gradually increase the bandwidth of our spatial grid en-
layers taking x ∈ Rd as input, it computes the output y =
coding hΩ (x, λ):
HL g(HL−1 . . . g(H1 x)), where Hl ∈ Rnl × Rnl−1 , l ∈
{1, . . . , L} is the layer index, and g is the ReLU function. h_\Omega (\mathbf {x}, \lambda ) = \big (w_1(\lambda ) h_\Omega ^1(\mathbf {x}), \dots , w_{L}(\lambda ) h_\Omega ^L(\mathbf {x}) \big ), (11)
We define Plj ∈ Rnl−1 × R1 and Sli ∈ R1 × Rnl as
where hiΩ is the hash encoding at level i and the weight wi
\begin {aligned} P_l^j &= G_{l-1} H_{l-1} \dots G_1 H_1^{(\_,j)} \\ S_l^i &= H_L^{(i,\_)} G_{L-1} \dots H_{l+1} G_{l+1} \end {aligned} for each grid encoding level is defined by
(7)

( ,j) (i, ) w_i(\lambda )= \begin {cases} 0, \text { if } i>\lambda \\ 1, \text { otherwise} \end {cases}, (12)
where H1 (jth column of H1 , HL is the ith row
is the
1, Hl−1 . . . g(H1 x) > 0
of HL , and Gl = . and the parameter λ modulates the bandwidth of the low-
0, otherwise
pass filter applied to the multi-resolution hash encodings,
which gradually increases during the training process.
Now the second-order derivates of a ReLU-based MLP with
respect to its input and intermediate layers can be defined.
5. Dynamic Neural Surface Reconstruction
Theorem 1 (Second-order derivative of ReLU-based MLP) We have explained how NeuS2 can produce highly accu-
Given a ReLU-based MLP f with L hidden layers with rate and fast reconstructions of static scenes. Next, we ex-
the same definition in Definition 1. The second-order tend NeuS2 to dynamic scene reconstruction. That is, given
derivative of the MLP f is: multi-view videos of a moving object and camera param-
eters of each view, our goal is to learn the neural implicit
\label {eqn:mlp_2ord_derivative_formula} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial H_l} = (P_l^j S_l^i)^\top ,\qquad \frac {\partial ^2 y}{\partial \mathbf {x}^2} = 0 surfaces of the object in each video frame (see Fig. 2 b)).
(8)
5.1. Incremental Training
∂y ∂y
where ∂x (i,j) is the matrix element (i,j) of ∂x , and Sli and Even though our reconstruction method of static objects
Plj are defined in Definition 1. can achieve promising efficiency and quality, constructing
dynamic scenes by training every single frame indepen-
Coming back to our original second-order derivatives dently is still time-consuming. However, scene changes
∂ ∂d
(Eqs. 5 and 6), by Theorem 1, we obtain ∂e∂e
= 0. Since from one frame to the other are typically small. Thus, we

5
propose an incremental training strategy to exploit the sim- COLMAP NeuS Instant-NGP Ours
ilarity in geometry and appearance information shared be-
CD ↓ 1.36 0.77 1.84 0.70
tween two consecutive frames, which enables faster con- PSNR ↑ - 28.00 28.86 28.82
vergence of our model. In detail, we train the first frame
from scratch as presented in our static scene reconstruction, Runtime 1h 8h 5 min 5 min
and then fine-tune the model parameters for the subsequent
Table 1. The quantitative comparison on DTU dataset. Our
frames based on the learned hash grid representation of the method outperforms other baselines for geometry reconstruction
preceding frame. Using this strategy, the model is able to regarding to the Chamfer Distance (CD) and is on par with Instant-
produce a good initialization of the neural representation of NGP of novel view synthesis in terms of PSNR.
the target frame and, thus, significantly accelerates its con-
vergence speed.
D-NeRF Ours
5.2. Global Transformation Prediction Dataset PSNR↑ CD↓ PSNR↑ CD↓
As we observed during the incremental training process, Lego 24.25 59.0 29.5 17.1
the predicted SDF may get stuck in the local minima of the Lion 31.45 - 33.60 -
learned SDF of the previous frame, especially when the ob- Human 29.33 5.73 33.20 1.86
ject’s movement between adjacent frames is relatively large. Runtime 20h 20s per frame
For instance, when our model reconstructs a walking se-
quence from muli-view images, the reconstructed surface Table 2. Quantitative comparisons on synthetic scenes. The
appears to have many holes, as shown in Fig. 7. To address Chamfer Distance of Lion sequence is omitted since the ground
this issue, we propose a global transformation prediction to truth geometry is not provided. Compared to D-NeRF, our method
roughly transform the target SDF into a canonical space be- achieves much better appearance and geometry reconstruction re-
sults while requiring significantly less training time.
fore the incremental training. Specifically, we predict the
rotation R and transition T of the object between two adja-
cent frames. For any given 3D position xi in the coordinate D-NeRF Ours
space of the frame i, it is transformed back to the coordinate Dataset PSNR↑ LPIPS↓ PSNR↑ LPIPS↓
space of the previous frame i − 1, denoted as xi−1
D1 17.47 0.154 27.76 0.037
D2 20.80 0.155 27.25 0.042
\label {eqn:globalmove_predict} \mathbf {x}_{i-1}=R_i(\mathbf {x}_{i}+T_i). (13)
D5 21.48 0.122 26.41 0.036
The transformations can then be accumulated to transform mean 19.92 0.144 27.14 0.038
the point xi back to xc in the first frame’s coordinate space Runtime 50h 20s per frame
\mathbf {x}_c=R_{i-1}^c(\mathbf {x}_{i-1}+T_{i-1}^c)=R_{i}^c(\mathbf {x}_{i}+T_{i}^c), (14) Table 3. Quantitative comparisons on real scenes. We found that
our method outperforms D-NeRF in all metrics.
where Ric = Ri−1
c
Ri and Tic = Ti + Ri−1 Ti−1
c
.
The global transformation prediction also allows us to
only update a small portion of the scene in the current frame when processing a new frame, we first predict the global
which is different from that in the previous frame, rather transformation, and then fine-tune the model’s parameters
than updating the entire scene in the current frame. In this and global transformation together to efficiently learn the
way, we can obtain more accurate reconstruction and reduce neural representation.
memory costs.
Notably, our method can handle large movements and 6. Experiments
deformations which are challenging for existing dynamic
scene reconstruction approaches [58], [45], thanks to the All experiments are conducted on a single GeForce
following designs of our approach: (1) Global transforma- RTX3090 GPU. Implementation details, additional results,
tion prediction that accounts for large global movements and video results are provided in the supplementary mate-
in the sequence; (2) Incremental training that learns rela- rial.
tively small deformable movements between two adjacent
6.1. Static Scene Reconstruction
frames rather than learns relatively large movements from
each frame to a common canonical space. For the static scene reconstruction, we use 15 scenes of
We implement the incremental training strategy com- the DTU dataset [19] for evaluation. There are 49 or 64 im-
bined with global transformation prediction in an end-to- ages with a resolution of 1600 × 1200, and we test each
end learning scheme, as illustrated in Fig. 2(b). Specifically, scene with foreground masks provided by IDR [68]. To

6
scan 24
scan 63
scan 118

Reference image Ours (5 min) NeuS (8 h) Instant-NGP (5 min)

Figure 3. Qualitative comparisons on DTU dataset for static scene geometry reconstruction and novel view synthesis. Our method demon-
strates high-quality rendering quality, superior to NeuS and comparable to Instant-NGP in terms of complex texture reconstruction. In
addition, it outperforms all baselines regarding to 3D geometry reconstruction, with fine details without inducing noise.

D1
Lego
Lion

D2
Human

Reference image Ours D-NeRF

Reference image Ours (20s per frame) D-NeRF (20h)

Figure 4. Qualitative comparisons of synthetic scene for geom- Figure 5. Qualitative comparisons of real scenes for geometry re-
etry reconstruction and novel view synthesis. Compared to D- construction and novel view synthesis. Our method outperforms
NeRF, which is trained for 20 hours for each scene, our method D-NeRF for both tasks, demonstrating sharp and accurate recon-
produces photo-realistic rendering results and accurate geometry struction quality, while D-NeRF fails to handle complex transfor-
reconstructions with only 20 seconds of training time per frame. mations in real scenes.

12
50
Computing Time (ms)

Methods w/o GTP + w/o PT w/o PT Full model 10

40
8 30
PSNR ↑ 27.78 27.72 28.18 6 20
LPIPS↓ 0.0595 0.0559 0.0474 4
pytorch 10 pytorch
ours ours
2 0
1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 22
MLP layer number Input Batch Size (log2)
Table 4. Ablation study of our design choices. Note that, both, the
Global Transformation Prediction (GTP) as well as the Progressive Figure 6. Ablation study for second-order derivative backward
Training (PT) improve the overall results. computation. We compare the speedup of our second-order deriva-
tive backward implementation with PyTorch for different numbers
of MLP layers and input batchsizes.
demonstrate the performance of NeuS2 on the static scene
reconstruction task, we compare our approach with the- tio (PSNR) between the reference images and the synthe-
state-of-art methods NeuS [60] and Instant-NGP [36]. sized images. The quantitative comparison results are re-
For quantitative measurement, we use the Chamfer Dis- ported in Tab. 1. The results demonstrate that our method
tance [19] to evaluate the geometry reconstruction quality, significantly outperforms Instant-NGP [36] on the geome-
in the same way as NeuS [60] did. We also measure the try reconstruction task and can be trained significantly faster
novel view synthesis quality by the peak signal-to-noise ra- than NeuS [60]. Meanwhile, our method achieves compa-

7
synthesis and geometry reconstruction results compared to
D-NeRF. Notably, our method takes less than 1 hour to com-
plete the learning of a sequence, with 40 seconds (80 sec-
onds for the Lego sequence) of training time for the first
frame and 20 seconds for each subsequent frame; while the
training time of D-NeRF for each scene is about 20 hours.
The qualitative results are provided in Fig. 4, showing that
our method outperforms D-NeRF in terms of novel view
synthesis and geometry reconstruction.
Real Scenes. To further evaluate the effectiveness of our
method on real scenes with large and non-rigid movement,
we select three sequences from the Dynacap [16] dataset.
Each sequence contains 500 frames with about 50 to 100
(a) Reference image (b) w/o GTP (c) w/o GTP (d) Full model
w/o PT camera views for training and about 5 to 10 camera views
for testing. More details are provided in the supplemen-
Figure 7. Ablation study for Global Transformation Prediction tary material. Tab. 3 summarizes the quantitative compar-
(GTP) and progressive training (PT). Ablated models perform isons between our method and D-NeRF [45]. Since the real
poorly on both novel view synthesis and geometry reconstruction. scene datasets do not have ground truth geometry, we can
only calculate the PSNR and LPIPS score of novel view
rable performance on the novel view synthesis task to the
synthesis results to evaluate the rendering quality. For long
baseline methods.
sequences with 500 frames consisting of challenging move-
A qualitative comparison of geometry reconstruction and
ments, D-NeRF struggles to reconstruct the dynamic real
novel view synthesis results for all methods is presented in
scenes. Even when training D-NeRF for 50 hours, our
Fig. 3. As shown in Fig. 3 for the 3D geometry recon-
method achieves significantly better scores, taking only 20
struction results, NeuS [60] exhibits limited performance in
seconds per frame. Also, according to the qualitative eval-
terms of reconstructed details with excessively smooth sur-
uation results shown in Fig. 5, D-NeRF shows blurred ren-
faces. The extracted meshes of Instant-NGP [36]’s results
dering results and inaccurate geometry reconstruction. In
are noisy since it lacks surface constraints in the geometry
contrast, our approach produces photo-realistic renderings
representation – volume density field. Regarding the novel
and detailed geometry.
view synthesis results, our method outperforms NeuS [60],
presenting detailed rendering results on par with Instant- 6.3. Ablation
NGP. To conclude, our method achieves high-quality geom-
etry and appearance reconstruction without inducing noise, We first compare our efficient second-order deriva-
e.g., it can recover the complex structures of the windows tive backward computation (Theorem 1) implemented in
and render detailed textures in Scan 24 of the DTU dataset. CUDA with a baseline where we automatically compute the
second-order derivative in PyTorch [42] using their compu-
6.2. Dynamic Scene Reconstruction tational graph. As shown in Fig. 6, our method achieves
We conduct experiments on both synthetic and real dy- a faster speed for second-order derivative backpropagation
namic scenes for the tasks of novel view synthesis and ge- than the PyTorch implementation.
ometry reconstruction. We compare our method with the Second, we evaluate the performance of the individual
state-of-the-art neural-based method for dynamic scenes, components of NeuS2: Global Transformation Prediction
D-NeRF [46] quantitatively and qualitatively. D-NeRF [46] (GTP) and Progressive Training strategy (PT) in Fig. 7 and
models general scenes for dynamic novel view synthesis Tab. 4. The geometry quality and appearance reconstruc-
by combining NeRF [35] with deformation fields, which is tion quality of the full model are better than other ablated
learned from all the frames simultaneously. models, both, quantitatively and qualitatively. The ablated
Synthetic Scenes. We choose three different types model shows noisy holes on the surface and blurred ren-
of datasets for synthetic scene reconstruction, including a derings since it gets stuck in local minima during the in-
Lego scene shared by NeRF [35] (150 frames), a Lion se- cremental training, which can be alleviated by the Global
quence provided by Artemis [34] (177 frames), and a hu- Transformation Prediction and Progressive Training.
man character in the RenderPeople [2] dataset (100 frames). 7. Conclusion
For the quantitative evaluation, the Chamfer Distances and Limitations. While our method reconstructs each frame
PSNR scores are calculated and averaged over all frames of a dynamic scene in high quality, there is no temporal
and all testing views of all frames, respectively. As shown in correspondence across the frames. A possible way could
Tab. 2, our method shows significantly improved novel view be to deform a mesh template to fit our learned neural sur-

8
faces for each frame, like the mesh tracking used in [9, 69]. and Steve Sullivan. High-quality streamable free-viewpoint
Currently, we also need to save the network parameters of video. 34(4), jul 2015. 2, 9
each frame (25M). As future work, a compression of such [10] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed
parameters encoding the dynamic scene could be explored. Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Perfor-
We proposed a learning-based method for accurate mance capture from sparse multi-view video. ACM Trans.
multi-view reconstruction of both static and dynamic scenes Graph., 27(3):1–10, aug 2008. 2
at an unprecedented runtime performance. To achieve this, [11] Jeremy S De Bonet and Paul Viola. Poxels: Probabilistic
voxelized volume reconstruction. In Proceedings of Interna-
we integrated multi-resolution hash encodings into neural
tional Conference on Computer Vision (ICCV), pages 418–
SDF and introduced a simple calculation of the second-
425, 1999. 1, 2
order derivatives tailored to our dedicated network architec-
[12] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xi-
ture. To enhance the training convergence, we presented a aopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian.
progressive training strategy to learn multi-resolution hash Fast dynamic radiance fields with time-aware neural voxels.
encodings. For dynamic scene reconstruction, we proposed arxiv:2205.15285, 2022. 3
an incremental training strategy with a global transfor- [13] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and
mation prediction component, which leverages the shared robust multiview stereopsis. IEEE transactions on pattern
geometry and appearance information in two consecutive analysis and machine intelligence, 32(8):1362–1376, 2009.
frames. 1, 2
[14] Silvano Galliani, Katrin Lasinger, and Konrad Schindler.
Acknowledgement Gipuma: Massively parallel multi-view stereo reconstruc-
Christian Theobalt and Marc Habermann were supported tion. Publikationen der Deutschen Gesellschaft für Pho-
togrammetrie, Fernerkundung und Geoinformation e. V,
by ERC Consolidator Grant 4DReply (770784). Lingjie Liu
25(361-369):2, 2016. 1, 2
was supported by Lise Meitner Postdoctoral Fellowship.
[15] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
Yaron Lipman. Implicit geometric regularization for learning
References shapes. arXiv preprint arXiv:2002.10099, 2020. 4
[1] Blender. http : / / aiweb . techfak . uni - [16] Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zoll-
bielefeld . de / content / bworld - robot - hoefer, Gerard Pons-Moll, and Christian Theobalt. Real-time
control-software/. 13 deep dynamic characters. ACM Transactions on Graphics
[2] Renderpeople. https://renderpeople.com/3d- (TOG), 40(4):1–16, 2021. 8, 13
people/, 2018. 8, 13 [17] Marc Habermann, Weipeng Xu, Michael Zollhöfer, Gerard
[3] Naveed Ahmed, Christian Theobalt, Petar Dobrev, Hans- Pons-Moll, and Christian Theobalt. Livecap: Real-time
Peter Seidel, and Sebastian Thrun. Robust fusion of dynamic human performance capture from monocular video. ACM
shape and normal capture for high-quality reconstruction of Trans. Graph., 38(2):14:1–14:17, Mar. 2019. 2
time-varying geometry. In 2008 IEEE Conference on Com- [18] Peter J. Huber. Robust Estimation of a Location Parameter.
puter Vision and Pattern Recognition, pages 1–8, 2008. 2 The Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
[4] Matan Atzmon and Yaron Lipman. Sal: Sign agnos- 4
tic learning of shapes from raw data. In Proceedings of [19] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola,
the IEEE/CVF Conference on Computer Vision and Pattern and Henrik Aanæs. Large scale multi-view stereopsis eval-
Recognition, pages 2565–2574, 2020. 4 uation. In Proceedings of the IEEE conference on computer
[5] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and vision and pattern recognition, pages 406–413, 2014. 6, 7,
Dan B Goldman. Patchmatch: A randomized correspon- 13
dence algorithm for structural image editing. ACM Trans. [20] Zhang Jiakai, Liu Xinhang, Ye Xinyi, Zhao Fuqiang, Zhang
Graph., 28(3):24, 2009. 1, 2 Yanshun, Wu Minye, Zhang Yingliang, Xu Lan, and Yu
[6] Adrian Broadhurst, Tom W Drummond, and Roberto Jingyi. Editable free-viewpoint video using a layered neu-
Cipolla. A probabilistic framework for space carving. In ral representation. In ACM SIGGRAPH, 2021. 3
Proceedings Eighth IEEE International Conference on Com- [21] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker.
puter Vision. ICCV 2001, volume 1, pages 388–393. IEEE, Sdfdiff: Differentiable rendering of signed distance fields
2001. 1, 2 for 3d shape optimization. In The IEEE/CVF Conference
[7] Joel Carranza, Christian Theobalt, Marcus A. Magnor, and on Computer Vision and Pattern Recognition (CVPR), June
Hans-Peter Seidel. Free-viewpoint video of human actors. 2020. 2
ACM Trans. Graph., 22(3):569–577, jul 2003. 2 [22] Srinivas Kaza et al. Differentiable volume rendering using
[8] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao signed distance functions. PhD thesis, Massachusetts Insti-
Bao, and Huchuan Lu. Animatable neural radiance fields tute of Technology, 2019. 2
from monocular rgb video, 2021. 3 [23] Petr Kellnhofer, Lars Jebe, Andrew Jones, Ryan Spicer, Kari
[9] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- Pulli, and Gordon Wetzstein. Neural lumigraph rendering.
nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, arXiv preprint arXiv:2103.11571, 2021. 2

9
[24] Youngjoong Kwon, Dahun Kim, Duygu Ceylan, and Henry [38] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya
Fuchs. Neural human performer: Learning generalizable ra- Harada. Neural articulated radiance field. In International
diance fields for human performance rendering. NeurIPS, Conference on Computer Vision, 2021. 3
2021. 3 [39] Michael Oechsle, Songyou Peng, and Andreas Geiger.
[25] SEE COVER LETTER. See cover letter, SEE COVER LET- Unisurf: Unifying neural implicit surfaces and radiance
TER. 3 fields for multi-view reconstruction. In International Con-
[26] Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping ference on Computer Vision (ICCV), 2021. 2
Tan. Streaming radiance fields for 3d video synthesis. arXiv [40] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
preprint arXiv:2210.14831, 2022. 3 Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
[27] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Martin-Brualla. Nerfies: Deformable neural radiance fields.
Green, Christoph Lassner, Changil Kim, Tanner Schmidt, In International Conference on Computer Vision (ICCV),
Steven Lovegrove, Michael Goesele, Richard Newcombe, 2021. 3
et al. Neural 3d video synthesis from multi-view video. In [41] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T.
Proceedings of the IEEE/CVF Conference on Computer Vi- Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-
sion and Pattern Recognition, pages 5521–5531, 2022. 3 Brualla, and Steven M. Seitz. Hypernerf: A higher-
[28] Z. Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neu- dimensional representation for topologically varying neural
ral scene flow fields for space-time view synthesis of dy- radiance fields. ACM Trans. Graph., 40(6), dec 2021. 3
namic scenes. ArXiv, abs/2011.13084, 2020. 3 [42] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[29] Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
Qie, and Mike Zheng Shou. Devrf: Fast deformable son, Andreas Kopf, Edward Yang, Zachary DeVito, Mar-
voxel radiance fields for dynamic scenes. arXiv preprint tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
arXiv:2205.15723, 2022. 3 Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch:
An imperative style, high-performance deep learning library.
[30] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
In Advances in Neural Information Processing Systems 32,
Christian Theobalt. Neural sparse voxel fields. Advances in
pages 8024–8035. Curran Associates, Inc., 2019. 5, 8
Neural Information Processing Systems, 33, 2020. 2
[43] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan
[31] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu
Zhang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Animat-
Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor:
able neural radiance fields for human body modeling. ICCV,
Neural free-view synthesis of human actors with pose con-
2021. 3
trol. ACM Trans. Graph., 40(6), dec 2021. 3
[44] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang,
[32] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body:
Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit Implicit neural representations with structured latent codes
signed distance function with differentiable sphere tracing. for novel view synthesis of dynamic humans. CVPR,
In Proceedings of the IEEE/CVF Conference on Computer 1(1):9054–9063, 2021. 3
Vision and Pattern Recognition, pages 2019–2028, 2020. 2 [45] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
[33] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- for Dynamic Scenes. In Proceedings of the IEEE/CVF Con-
umes: Learning dynamic renderable volumes from images. ference on Computer Vision and Pattern Recognition, 2020.
ACM Transactions on Graphics (TOG), 38(4):65, 2019. 2 3, 6, 8, 13
[34] Haimin Luo, Teng Xu, Yuheng Jiang, Chenglin Zhou, QIwei [46] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
Qiu, Yingliang Zhang, Wei Yang, Lan Xu, and Jingyi Yu. Francesc Moreno-Noguer. D-nerf: Neural radiance fields
Artemis: Articulated neural pets with appearance and motion for dynamic scenes. In Proceedings of the IEEE/CVF Con-
synthesis. arXiv preprint arXiv:2202.05628, 2022. 8, 13 ference on Computer Vision and Pattern Recognition, pages
[35] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, 10318–10327, 2021. 8
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [47] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-
Representing scenes as neural radiance fields for view syn- ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned
thesis. In European Conference on Computer Vision, pages implicit function for high-resolution clothed human digitiza-
405–421. Springer, 2020. 1, 2, 3, 8, 13, 15 tion. In Proceedings of the IEEE/CVF International Confer-
[36] Thomas Müller, Alex Evans, Christoph Schied, and Alexan- ence on Computer Vision, pages 2304–2314, 2019. 2, 13
der Keller. Instant neural graphics primitives with a multires- [48] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul
olution hash encoding. arXiv preprint arXiv:2201.05989, Joo. Pifuhd: Multi-level pixel-aligned implicit function for
2022. 1, 2, 3, 4, 7, 8, 13, 15 high-resolution 3d human digitization. In Proceedings of
[37] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and the IEEE/CVF Conference on Computer Vision and Pattern
Andreas Geiger. Differentiable volumetric rendering: Learn- Recognition, pages 84–93, 2020. 2
ing implicit 3d representations without 3d supervision. In [49] Sara Fridovich-Keil and Alex Yu, Matthew Tancik, Qinhong
Proceedings of the IEEE/CVF Conference on Computer Vi- Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
sion and Pattern Recognition, pages 3504–3515, 2020. 2 Radiance fields without neural networks. In CVPR, 2022. 3

10
[50] Johannes Lutz Schönberger and Jan-Michael Frahm. [65] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge
Structure-from-motion revisited. In Conference on Com- Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian
puter Vision and Pattern Recognition (CVPR), 2016. 13 Theobalt. Monoperfcap: Human performance capture from
[51] Johannes L Schönberger, Enliang Zheng, Jan-Michael monocular video. ACM Trans. Graph., 37(2):27:1–27:15,
Frahm, and Marc Pollefeys. Pixelwise view selection for May 2018. 2
unstructured multi-view stereo. In European Conference on [66] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
Computer Vision, pages 501–518. Springer, 2016. 1, 2 ume rendering of neural implicit surfaces. In Advances in
[52] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, Neural Information Processing Systems (NeurIPS), 2021. 2
and Jan-Michael Frahm. Pixelwise view selection for un- [67] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
structured multi-view stereo. In European Conference on Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
Computer Vision (ECCV), 2016. 13 ral surface reconstruction by disentangling geometry and ap-
[53] Steven M Seitz and Charles R Dyer. Photorealistic scene pearance. Advances in Neural Information Processing Sys-
reconstruction by voxel coloring. International Journal of tems, 33, 2020. 2
Computer Vision, 35(2):151–173, 1999. 1, 2 [68] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
[54] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
Nießner, Gordon Wetzstein, and Michael Zollhofer. Deep- ral surface reconstruction by disentangling geometry and ap-
voxels: Learning persistent 3d feature embeddings. In Pro- pearance. Advances in Neural Information Processing Sys-
ceedings of the IEEE/CVF Conference on Computer Vision tems, 33:2492–2502, 2020. 6, 13
and Pattern Recognition, pages 2437–2446, 2019. 2 [69] Fuqiang Zhao, Yuheng Jiang, Kaixin Yao, Jiakai Zhang,
[55] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet- Liao Wang, Haizhao Dai, Yuhui Zhong, Yingliang Zhang,
zstein. Scene representation networks: Continuous 3d- Minye Wu, Lan Xu, et al. Human performance modeling
structure-aware neural scene representations. In Advances in and rendering via neural animated mesh. arXiv preprint
Neural Information Processing Systems, pages 1119–1130, arXiv:2209.08468, 2022. 3, 9
2019. 2
[56] Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge
Rhodin. A-nerf: Surface-free human 3d pose refinement via
neural rendering. arXiv preprint arXiv:2102.06199, 2021. 3
[57] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel
grid optimization: Super-fast convergence for radiance fields
reconstruction. In CVPR, 2022. 3
[58] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
rigid neural radiance fields: Reconstruction and novel view
synthesis of a dynamic scene from monocular video. In IEEE
International Conference on Computer Vision (ICCV). IEEE,
2021. 3, 6
[59] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan
Popović. Articulated mesh animation from multi-view sil-
houettes. ACM Trans. Graph., 27(3):1–9, aug 2008. 2
[60] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. Neus: Learning neural implicit
surfaces by volume rendering for multi-view reconstruction.
NeurIPS, 2021. 1, 2, 3, 4, 7, 8, 13
[61] Yiming Wang, Qingzhe Gao, Libin Liu, Lingjie Liu, Chris-
tian Theobalt, and Baoquan Chen. Neural novel actor:
Learning a generalized animatable neural representation for
human actors. 2022. 3
[62] Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian
Theobalt, Ziwei Liu, and Dahua Lin. Voxurf: Voxel-based
efficient and accurate neural surface reconstruction, 2022. 3
[63] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
Kim. Space-time neural irradiance fields for free-viewpoint
video. In Computer Vision and Pattern Recognition (CVPR),
2021. 3
[64] Hongyi Xu, Thiemo Alldieck, and Cristian Sminchisescu.
H-nerf: Neural radiance fields for rendering and temporal
reconstruction of humans in motion, 2021. 3

11
Supplementary Material We can then calculate
In the following, we provide more details about our \begin {aligned} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial H_l} = \big (\frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_l P_l^j)}\big )^\top \frac {\partial (H_l P_l^j)}{\partial H_l}. \end {aligned}
(19)
method. First, we present the derivation of Eqs. 5 and 6
(Sec. A), and the proof of Theorem 1 (Sec. B), Second, we
introduce the dataset (Sec. C) we used in the experiments On the one hand, since Plj is independent of Hl , we have
and show more quantitative and qualitative results (Sec. D)
\begin {aligned} \frac {\partial (H_l P_l^j)}{\partial H_l} = (P_l^j)^\top . \end {aligned}
to further demonstrate the performance of our model. Fi- (20)
nally, we provide implementation details of our method
(Sec. E). On the other hand, we have
\begin {aligned} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_l P_l^j)} = \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_{L-1} P_{L-1}^j)} \frac {\partial (H_{L-1} P_{L-1}^j)}{\partial (H_{L-2} P_{L-2}^j)} \cdots \frac {\partial (H_{l+1} P_{l+1}^j)}{\partial (H_{l} P_{l}^j)}. \end {aligned}
A. Derivation of Equation 5 and Equation 6
We calculate the second-order derivatives with respect to (21)
the hash table parameters Ω as We can further derive that
\frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_{L-1} P_{L-1}^j)} = H_L^{(i,\_)} G_{L},
\label {eqn:dloss_denc_proof} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Omega } & = \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial \frac {\partial d}{\partial \mathbf {x}}}{\partial \Omega } \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial (\frac {\partial d}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \mathbf {x}})}{\partial \Omega } \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Omega } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \textbf {x}}}{\partial \Omega } ) \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \Omega } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Omega } ) \end {aligned} (22)

and
(15) \begin {aligned} \frac {\partial (H_{l} P_{l}^j)}{\partial (H_{l-1} P_{l-1}^j)} &= \frac {\partial (H_{l} G_{l} H_{l-1} P_{l-1}^j)}{\partial (H_{l-1} P_{l-1}^j)} \\ &= H_{l} G_{l} + H_{l} \frac {\partial G_{l}}{\partial (H_{l-1} P_{l-1}^j)} \\ &= H_{l} G_{l}, \end {aligned}

(23)

and derive those for the SDF network parameters Θ as

∂Gl
since j
∂(Hl−1 Pl−1 )
= 0. Thus we have
\label {eqn:dloss_dsdf_proof} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Theta } & = \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial \frac {\partial d}{\partial \mathbf {x}}}{\partial \Theta } \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial (\frac {\partial d}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \mathbf {x}})}{\partial \Theta } \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Theta } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Theta } ) \end {aligned}
\begin {aligned} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_l P_l^j)} = H_L^{(i,\_)} G_{L} H_{L-1} \cdots H_{l+1} G_{l+1} = S_l^{i}. \end {aligned}
(24)
(16)
Consequently,
\begin {aligned} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial H_l} = \big (\frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_l P_l^j)}\big )^\top \frac {\partial (H_l P_l^j)}{\partial H_l} = (P_l^j S_l^i)^\top . \end {aligned}
(25)
B. Derivation of Theorem 1
Now, we have proven the first part in Eq. 8.
First, by applying the chain rule, we have
Regarding the second part in Eq. 8, we first define xl =
\begin {aligned} \frac {\partial y}{\partial x} &= \frac {\partial (H_L g(H_{L-1}P_{L-1}))}{\partial x} \\ &= H_L \frac {\partial g(H_{L-1}P_{L-1})}{\partial x} \\ &= H_L \frac {\partial g}{\partial (H_{L-1}P_{L-1})} H_{L-1} \frac {\partial P_{L-1}}{\partial x} \\ &= H_L G_L H_{L-1} \frac {\partial P_{L-1}}{\partial x} \\ &= H_L G_L H_{L-1} G_{L-1} H_{L-2} \frac {\partial P_{L-2}}{\partial x} \\ &= H_L G_L H_{L-1} \cdots G_1 H_1 \end {aligned} g(Hl−1 xl−1 ), where xl ∈ Rnl , l ∈ {1, . . . , L − 1}. Specif-
ically, x1 = g(H1 x). Then we have
y = H_L x_{L-1}. (26)
Then we calculate
(17) \begin {aligned} \frac {\partial ^2 y}{\partial \mathbf {x}^2} &= \frac {\partial \frac {\partial y}{\partial x_1} \frac {\partial x_1}{\partial x}}{\partial x} &= \frac {\partial \frac {\partial y}{\partial x_1}}{\partial x} \frac {\partial x_1}{\partial x} + \frac {\partial \frac {\partial x_1}{\partial x}}{\partial x} \frac {\partial y}{\partial x_1}. \end {aligned} (27)

∂x1
∂
Note that ∂x
∂x
= 0. So we have

\begin {aligned} \frac {\partial ^2 y}{\partial \mathbf {x}^2} &= \frac {\partial \frac {\partial y}{\partial x_1}}{\partial x} \frac {\partial x_1}{\partial x} \\ &= \frac {\partial ^2 y}{\partial x_{1}^2} (\frac {\partial x_1}{\partial x})^2 \\ &= \frac {\partial ^2 y}{\partial x_{L-1}^2} (\frac {\partial x_{L-1}}{\partial x_{L-2}})^2 \cdots (\frac {\partial x_1}{\partial x})^2. \end {aligned}

∂y
And, the element (i, j) of ∂x ∈ RnL × Rn1 is (28)

\frac {\partial y}{\partial x}_{(i,j)} = H_L^{(i, \_)} G_{L} \cdots G_1 H_1^{(\_, j)}.
(18)

12
∂2y
Since y = HL xL−1 , we have ∂x2L−1
= 0. Thus we get D. Additional Result
Video Results. We provide a supplementary video to
\begin {aligned} \qquad \frac {\partial ^2 y}{\partial \mathbf {x}^2} &= 0 \end {aligned} better demonstrate the qualitative results of our method. We
(29)
highly encourage the readers to check our video.
Static Scene Reconstruction. In Tab. 5, we provide the
C. Datasets
per-scene breakdown analysis of the quantitative compar-
Dataset for Static Scene Reconstruction. For static isons on the DTU dataset presented in the main paper (Tab.
scene reconstruction, we use 15 scenes from the DTU 1). We also present additional qualitative comparisons on
dataset [19], same as those used in NeuS [60]. These scenes the DTU dataset in Fig. 8.
cover a wide variety of materials, appearance and geom-
etry, and include challenging cases for reconstruction al- E. Implementation Details
gorithms, such as non-Lambertian surfaces and fine struc-
E.1. Baselines
tures. Each scene contains 49 or 64 images with an image
resolution of 1600 × 1200. We split each scene into train- COLMAP [50, 52]. We directly refer to COLMAP’s re-
ing and testing parts following NeuS [60]. Specifically, we sults on the DTU dataset reported in NeuS [60].
set images indexed at 8, 13, 16, 21, 26, 31, 34 and 56 (if NeuS [60]. The Chamfer Distance scores of NeuS shown
available) for testing and the other images for training. We in the paper are directly referred to the results reported in the
train and test each scene with foreground masks provided original paper. The geometry reconstruction results are pro-
by IDR [68]. duced using the officially released pre-trained models1 with
Dataset for Dynamic Synthetic Scene Reconstruction. mask supervision. The PSNR scores and novel view syn-
We use three synthetic scenes with various types of defor- thesis results are obtained by training the officially released
mations and motions to evaluate our method. The Lego code1 on the DTU training dataset with mask supervision,
scene is shared by NeRF [35] in form of Blender [1] file. and testing it on the DTU testing dataset.
We transform the Lego object into different poses and po- Instant-NGP [36]. We use the officially released code2
sitions and render images at the resolution of 400 × 400 to train the model on the DTU dataset for 50k iterations.
in Blender. The Lego scene contains 150 frames with 40 The training takes about 5 minutes.
training camera views and 40 test camera views. The hu- D-NeRF [45]. We use the officially released code3 to
man scene is provided by RenderPeople [2]. We render im- train the model for 800k iterations. The training time of D-
ages at the resolution of 512 × 512 following [47], and the NeRF on real scenes is longer than that on synthetic scenes,
whole sequence contains 100 frames with 48 camera views taking about 50 and 20 hours, respectively. This is be-
for training and 12 camera views for testing. The Lion scene cause the length of real sequences is longer than that of the
is shared by Artemis [34], which has 177 frames with 30 synthetic sequences, which are around 500 frames and 150
camera views for training and 6 camera views for testing. frames, respectively. Moreover the number of camera views
Dataset for Dynamic Real Scene Reconstruction. We for real scenes is greater than that for synthetic scenes. For
use three sequences from the Dynacap dataset [16], denoted long sequences with dense camera views, the model cannot
as D1, D2 and D5, for real-scene reconstruction. These se- upload all the images at once due to the GPU memory lim-
quences are captured under a dense camera setup at a reso- itation, so extra time is needed to load the images during
lution of 1285 × 940. We crop the images with a 2D bound- training.
ing box which is estimated from the foreground masks
E.2. Network Architecture
to obtain the target images at a resolution of 512 × 512.
For each sequence, we choose 500 frames containing large As shown in Fig 9, the network architecture of NeuS2
movements for evaluation to show our advantages (D1: consists of the following components: (a) a multi-resolution
6,095 to 6,595, D2: 3,450 to 3,950, D5: 17,760 to 18,260). hash grid with 14 levels of different resolutions ranging
The D1 sequence has 50 camera views, from which we pick from 16 to 2048; (b) an SDF network modeled by a 1-layer
5 camera views (7, 17, 27, 37, 47) for testing and the rest MLP with 64 hidden units; (c) an RGB network modeled
of the views for training. The D2 sequence has 101 camera by a 2-layer MLP with 64 hidden units.
views, from which we pick 10 camera views (7, 17, 27, 37,
E.3. Training Details
47, 57, 67, 77, 87, 97) for testing and the rest of the views
for training. The D5 sequence has 94 camera views, from Unbiased Volume Rendering. To render an image, we
which we pick 9 camera views (7, 17, 27, 37, 47, 57, 67, apply the unbiased volume rendering of NeuS [60]. That
77, 87) for testing and the rest of the views for training. We 1 https://github.com/Totoro97/NeuS
train and test each scene with foreground masks provided 2 https://github.com/NVlabs/instant-ngp

by Dynacap [16]. 3 https://github.com/albertpumarola/D-NeRF

13
Table 5. The quantitative comparison on the DTU dataset. Our method outperforms other baselines for geometry reconstruction regarding
the Chamfer Distance (CD) and is on par with Instant-NGP of novel view synthesis in terms of PSNR.

COLMAP NeuS Instant-NGP Ours

Runtime 1h 8h 5 min 5 min
ScanID CD ↓ PSNR↑ CD ↓ PSNR↑ CD↓ PSNR↑ CD↓
scan24 0.81 26.49 0.83 28.32 1.68 28.44 0.56
scan37 2.05 26.17 0.98 27.19 1.93 27.14 0.76
scan40 0.73 27.66 0.56 30.45 1.57 29.70 0.49
scan55 1.22 27.78 0.37 29.81 1.16 29.67 0.37
scan63 1.79 30.63 1.13 31.22 2.00 31.75 0.92
scan65 1.58 27.42 0.59 27.78 1.56 27.83 0.71
scan69 1.02 25.83 0.60 24.79 1.81 24.84 0.76
scan83 3.05 30.00 1.45 31.23 2.33 31.24 1.22
scan97 1.40 26.40 0.95 26.96 2.16 26.86 1.08
scan105 2.05 29.63 0.78 30.62 1.88 30.57 0.63
scan106 1.00 25.87 0.52 25.62 1.76 26.05 0.59
scan110 1.32 28.82 1.43 28.6 2.32 28.93 0.89
scan114 0.49 28.80 0.36 29.5 1.86 28.98 0.40
scan118 0.78 27.36 0.45 27.91 1.80 27.82 0.48
scan122 1.17 31.19 0.45 32.93 1.72 32.48 0.55
mean 1.36 28.00 0.77 28.86 1.84 28.82 0.70
scan 40
scan 105
scan 110

Reference image Ours (5 min) NeuS (8 h) Instant-NGP (5 min)

Figure 8. Qualitative comparisons on the DTU dataset for static scene geometry reconstruction and novel view synthesis. Our method
demonstrates high-quality rendering quality, superior to NeuS and comparable to Instant-NGP in terms of complex texture reconstruction.
In addition, it outperforms all the baselines regarding 3D geometry reconstruction, with fine details without inducing noise.

is, we first transform the signed distance field into a volume weight function in the volume rendering equation. Specif-
density field ϕs (f (x)), where ϕs (x) = se−sx /(1 + e−sx )2 ically, for each pixel of an image, we sample n points
is the logistic density distribution, which is the derivative {p(ti ) = o + ti v|i = 0, 1, . . . , n − 1} along its camera ray,
of the Sigmoid function Φs (x) = 1/(1 + e−sx ) and s where o is the center of the camera and v is the view direc-
is a learnable parameter. Next, we construct an unbiased tion. By accumulating the SDF-based densities and colors

14
copy
3

Normal
3
3
∇! (.)
Global Hash encoding 3
POS 3 Transformation 3 28
28 1 SDF
64 16 64 64 3
15
copy RGB
3 Geometry
Feature 16

Spherical Harmonics
Ray Dir 3 16

SDF Network RGB Network

Figure 9. A visualization of the network architecture of NeuS2.

of the sample points, we can compute the color Ĉ of the ray global transformation independently for the first 100 itera-
with the same approximation scheme as used in NeRF [35] tions and we fine-tune the network parameters and global
as deformation together for the remaining iterations.
\hat {C}(\mathbf {o}, \mathbf {v}) = \sum _{i=0}^{n-1} T({t_i})\alpha ({t_i})c(\mathbf {p}({t_i}), \mathbf {v}) \label {eqn:neus_volume_rendering} (30)

where T (ti ) is the discrete accumulated transmittance de-

Qi−1
fined by T (ti ) = j=0 (1 − α(tj )), and α(ti ) is a discrete
density value defined by

\label {eqn:neus_rendering_alpha} \alpha (t_i) = \max \Big (\frac {\Phi _s(f(p(t_{i})))-\Phi _s(f(p(t_{i+1})))}{\Phi _s(f(p(t_{i})))}, 0\Big ). (31)

As the rendering process is differentiable, we can learn the

signed distance field f and the radiance field c from the
multi-view images.
Ray Marching Strategy. We adopt a ray marching ac-
celeration strategy used in Instant-NGP [36]. That is, we
maintain an occupancy grid that roughly marks each voxel
grid as empty or non-empty. The occupancy grid can effec-
tively guide the marching process by preventing sampling
in empty spaces and, thus, accelerate the volume rendering
process. We periodically update the occupancy grid based
on the SDF value predicted by our model. In detail, we first
use the SDF value d to calculate the density value ϕs (d) for
each grid, where ϕs (x) = se−sx /(1 + e−sx )2 is the logis-
tic density distribution and s is a learnable parameter. We
then use the density value to update the occupancy grid in
the same way as Instant-NGP [36]. Additionally, to achieve
faster convergence, we sample 90% rays on the foreground
pixels and 10% rays on the background pixels.
Hyperparameters. For static scene reconstruction, we
train our model for 15k iterations, which takes around 5
minutes. For dynamic scene reconstruction, we train the
first frame from scratch for 2k iterations (4k iterations in
particular for the Lego sequence due to its complex geom-
etry), which takes about 40 seconds; then train each sub-
sequent frame for 1.1k iterations, where we optimize the

Neus: Learning Neural Implicit Surfaces by Volume Rendering For Multi-View Reconstruction
No ratings yet
Neus: Learning Neural Implicit Surfaces by Volume Rendering For Multi-View Reconstruction
23 pages
From Transparent To Opaque: Rethinking Neural Implicit Surfaces With - Neus
No ratings yet
From Transparent To Opaque: Rethinking Neural Implicit Surfaces With - Neus
19 pages
Sparse Neu S
No ratings yet
Sparse Neu S
22 pages
Neural RGB-D 3D Reconstruction
No ratings yet
Neural RGB-D 3D Reconstruction
12 pages
MonoSDF: Monocular Cues for 3D Reconstruction
No ratings yet
MonoSDF: Monocular Cues for 3D Reconstruction
28 pages
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
No ratings yet
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
15 pages
Neural Recon
No ratings yet
Neural Recon
10 pages
Singh 2020
No ratings yet
Singh 2020
5 pages
Progressive Learning of 3D Reconstruction Network From 2D GAN Data
No ratings yet
Progressive Learning of 3D Reconstruction Network From 2D GAN Data
12 pages
3D-Adapter: Geometry-Consistent Multi-View Diffusion
No ratings yet
3D-Adapter: Geometry-Consistent Multi-View Diffusion
22 pages
High-Fidelity Neural Surface Reconstruction
No ratings yet
High-Fidelity Neural Surface Reconstruction
18 pages
Neural Watertight Manifold Meshes
No ratings yet
Neural Watertight Manifold Meshes
16 pages
Freetimegs: Free Gaussians at Anytime and Anywhere For Dynamic Scene Reconstruction
No ratings yet
Freetimegs: Free Gaussians at Anytime and Anywhere For Dynamic Scene Reconstruction
17 pages
Neuralangelo: High-Fidelity Neural Surface Reconstruction
No ratings yet
Neuralangelo: High-Fidelity Neural Surface Reconstruction
10 pages
3D Scene Rendering for Researchers
No ratings yet
3D Scene Rendering for Researchers
28 pages
GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
No ratings yet
GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
29 pages
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
No ratings yet
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
19 pages
Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars
No ratings yet
Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars
11 pages
Efficient Geometry-Aware 3D Generative Adversarial Networks
No ratings yet
Efficient Geometry-Aware 3D Generative Adversarial Networks
27 pages
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
No ratings yet
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
12 pages
Fates GS
No ratings yet
Fates GS
16 pages
HUGS
No ratings yet
HUGS
10 pages
3D Gaussians for Adaptive Rendering
No ratings yet
3D Gaussians for Adaptive Rendering
11 pages
Mescheder Occupancy Networks Learning 3D Reconstruction in Function Space CVPR 2019 Paper
No ratings yet
Mescheder Occupancy Networks Learning 3D Reconstruction in Function Space CVPR 2019 Paper
11 pages
AutoRecon 自动检测物体并重建
No ratings yet
AutoRecon 自动检测物体并重建
10 pages
Paper 4 CVPR
No ratings yet
Paper 4 CVPR
12 pages
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
No ratings yet
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
11 pages
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
No ratings yet
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
14 pages
Recollection From Pensieve: Novel View Synthesis Via Learning From Uncalibrated Videos
No ratings yet
Recollection From Pensieve: Novel View Synthesis Via Learning From Uncalibrated Videos
13 pages
Xu 等 - 2023 - 4K4D Real-Time 4D View Synthesis at 4K Resolution
No ratings yet
Xu 等 - 2023 - 4K4D Real-Time 4D View Synthesis at 4K Resolution
17 pages
3d Gaussian Splatting High
No ratings yet
3d Gaussian Splatting High
14 pages
Sg-Nerf: Sparse-Input Generalized Neural Radiance Fields For Novel View Synthesis
No ratings yet
Sg-Nerf: Sparse-Input Generalized Neural Radiance Fields For Novel View Synthesis
13 pages
Nerf Paper IA 3D
No ratings yet
Nerf Paper IA 3D
8 pages
044 1060 Ivrissimtzis-NeuralMesh
No ratings yet
044 1060 Ivrissimtzis-NeuralMesh
8 pages
3D-Aware Generation with F3D-Gaus
No ratings yet
3D-Aware Generation with F3D-Gaus
15 pages
Instant Neural Graphics Primitives With A Multiresolution Hash Encoding
No ratings yet
Instant Neural Graphics Primitives With A Multiresolution Hash Encoding
13 pages
Fast Single-View 3D Object Reconstruction With Fine Details Through Dilated Downsample and Multi-Path Upsample Deep Neural Network
No ratings yet
Fast Single-View 3D Object Reconstruction With Fine Details Through Dilated Downsample and Multi-Path Upsample Deep Neural Network
5 pages
Rfnet-4D++: Joint Object Reconstruction and Flow Estimation From 4D Point Clouds With Cross-Attention Spatio-Temporal Features
No ratings yet
Rfnet-4D++: Joint Object Reconstruction and Flow Estimation From 4D Point Clouds With Cross-Attention Spatio-Temporal Features
14 pages
Neural Parts: 3D Shape Abstractions with INNs
No ratings yet
Neural Parts: 3D Shape Abstractions with INNs
33 pages
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
No ratings yet
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
11 pages
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
No ratings yet
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
11 pages
SWinGS - Sliding Windows For Dynamic 3D Gaussian Splatting
No ratings yet
SWinGS - Sliding Windows For Dynamic 3D Gaussian Splatting
18 pages
SUDS
No ratings yet
SUDS
11 pages
3D from Single Image: Zero-1-to-3
No ratings yet
3D from Single Image: Zero-1-to-3
13 pages
Luiten 等 - 2023 - Dynamic 3D Gaussians Tracking by Persistent Dynam
No ratings yet
Luiten 等 - 2023 - Dynamic 3D Gaussians Tracking by Persistent Dynam
11 pages
High-Res 3D Creation via Gaussian Model
No ratings yet
High-Res 3D Creation via Gaussian Model
20 pages
Matdecompsdf: High-Fidelity 3D Shape and PBR Material Decomposition From Multi-View Images
No ratings yet
Matdecompsdf: High-Fidelity 3D Shape and PBR Material Decomposition From Multi-View Images
12 pages
3D Aware Synthesis Via Learning Textural and Structural Representations
No ratings yet
3D Aware Synthesis Via Learning Textural and Structural Representations
13 pages
Realfusion 360 Reconstruction of Any Object From A Single Image
No ratings yet
Realfusion 360 Reconstruction of Any Object From A Single Image
20 pages
3D Gaussian Splatting for NeRF
No ratings yet
3D Gaussian Splatting for NeRF
17 pages
Geo4D: Leveraging Video Generators For Geometric 4D Scene Reconstruction
No ratings yet
Geo4D: Leveraging Video Generators For Geometric 4D Scene Reconstruction
16 pages
(DiffusionGS) 2503 10860v1
No ratings yet
(DiffusionGS) 2503 10860v1
14 pages
Incremental Dense Reconstruction From Monocular Video With Guided Sparse Feature Volume Fusion
No ratings yet
Incremental Dense Reconstruction From Monocular Video With Guided Sparse Feature Volume Fusion
8 pages
Depr: Depth Guided Single-View Scene Reconstruction With Instance-Level Diffusion
No ratings yet
Depr: Depth Guided Single-View Scene Reconstruction With Instance-Level Diffusion
14 pages
VolumeGAN: High-Fidelity 3D Image Synthesis
No ratings yet
VolumeGAN: High-Fidelity 3D Image Synthesis
12 pages
Li Spacetime Gaussian Feature Splatting For Real-Time Dynamic View Synthesis CVPR 2024 Paper
No ratings yet
Li Spacetime Gaussian Feature Splatting For Real-Time Dynamic View Synthesis CVPR 2024 Paper
13 pages
Patch Based Non Rigid 3D Reconstruction From A Single Depth Stream
No ratings yet
Patch Based Non Rigid 3D Reconstruction From A Single Depth Stream
10 pages
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
No ratings yet
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
17 pages
DPP 1 Quadratic Equation by Om Sir
No ratings yet
DPP 1 Quadratic Equation by Om Sir
5 pages
Six Sigma Method and 5s Method
No ratings yet
Six Sigma Method and 5s Method
12 pages
Plant Parts: Plants Have Many Parts
100% (1)
Plant Parts: Plants Have Many Parts
6 pages
Mastery 2 (Etech)
No ratings yet
Mastery 2 (Etech)
4 pages
Directing and Producing A Stage Play
No ratings yet
Directing and Producing A Stage Play
5 pages
Horary Astrology - : Horarya Strologyor Prasnasashtra
No ratings yet
Horary Astrology - : Horarya Strologyor Prasnasashtra
3 pages
Kombolcha
No ratings yet
Kombolcha
22 pages
Exterior Render Settings (Vray 3.4 For Sketchup)
No ratings yet
Exterior Render Settings (Vray 3.4 For Sketchup)
14 pages
Sandplay
0% (1)
Sandplay
21 pages
According To Official Estimates, About 330,000 Houses Were Damaged
No ratings yet
According To Official Estimates, About 330,000 Houses Were Damaged
5 pages
General Physics 2 Module 2
No ratings yet
General Physics 2 Module 2
9 pages
Civil Engineering MCQ'S: Ans. (C)
No ratings yet
Civil Engineering MCQ'S: Ans. (C)
2 pages
Ecology and Evolution Notes
No ratings yet
Ecology and Evolution Notes
3 pages
EFL CM2 DecidingWhatSkills
100% (1)
EFL CM2 DecidingWhatSkills
7 pages
C080MA
No ratings yet
C080MA
1 page
Mental Maths Grade 1 Workbook Solutions
No ratings yet
Mental Maths Grade 1 Workbook Solutions
130 pages
Organization and Management: Schools Division of Dipolog City Dipolog City Government
No ratings yet
Organization and Management: Schools Division of Dipolog City Dipolog City Government
20 pages
Class Syllabus For Grade 9 English 1 Honors Gifted and Honors 2024-2025
No ratings yet
Class Syllabus For Grade 9 English 1 Honors Gifted and Honors 2024-2025
4 pages
Civil Engineering Broucher
No ratings yet
Civil Engineering Broucher
13 pages
Alfa Laval P615
No ratings yet
Alfa Laval P615
654 pages
Western and Eastern Perspectives On Consciousness
No ratings yet
Western and Eastern Perspectives On Consciousness
3 pages
Rural Electric Motor Skills
100% (2)
Rural Electric Motor Skills
16 pages
Life Skills Complete
No ratings yet
Life Skills Complete
23 pages
NCM 103 Fundamentals
No ratings yet
NCM 103 Fundamentals
6 pages
CAD Mock Preparation
No ratings yet
CAD Mock Preparation
5 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
5 pages
De-FOA-0003512 NOFO Part 1 Circular Supply Chains Accelerator
No ratings yet
De-FOA-0003512 NOFO Part 1 Circular Supply Chains Accelerator
38 pages
splat mover 斯坦福数字孪生解决方案
No ratings yet
splat mover 斯坦福数字孪生解决方案
23 pages
Manual de Comissionamento REC650 Iec
No ratings yet
Manual de Comissionamento REC650 Iec
116 pages

Neus 2

Uploaded by

Neus 2

Uploaded by

NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction

Yiming Wang∗1 Qin Han∗1 Marc Habermann2 Kostas Daniilidis3

Static Scene Reconstruction

Recent methods for neural surface representation and

tegrate multi-resolution hash encodings [36] into a neural NeuS2

Global Transformation Prediction

(6) Fig. 6 that our computation of the second-order derivatives

Reference image Ours (5 min) NeuS (8 h) Instant-NGP (5 min)

Reference image Ours D-NeRF

Methods w/o GTP + w/o PT w/o PT Full model 10

and derive those for the SDF network parameters Θ as

by Dynacap [16]. 3 https://github.com/albertpumarola/D-NeRF

COLMAP NeuS Instant-NGP Ours

Reference image Ours (5 min) NeuS (8 h) Instant-NGP (5 min)

SDF Network RGB Network

Figure 9. A visualization of the network architecture of NeuS2.

where T (ti ) is the discrete accumulated transmittance de-

As the rendering process is differentiable, we can learn the

You might also like