Neus 2
Neus 2
Abstract
1
sity fields contains discernible noise as the 3D representa- • We design an incremental learning method with a
tion lacks surface constraints. novel global transformation prediction component for
To overcome these drawbacks, we propose NeuS2, a new reconstructing dynamic scenes with large movements
method for fast training of highly-detailed neural implicit in an efficient and stable manner.
surfaces (see Fig. 1). NeuS2 can reconstruct a static ob-
ject in minutes, and a moving object sequence in up to 20 2. Related Work
seconds per frame (excluding the first frame), on a single
GPU. To achieve fast reconstruction, we leverage multi- Multi-view Stereo. Traditional multi-view 3D recon-
resolution hash tables of learnable feature vectors [36] to struction methods can be categorized into depth-based and
enhance a neural network-encoded SDF and implement the voxel-based methods. Depth-based methods [5, 13, 14, 51]
whole system in CUDA. Notably, in this design, a surface reconstruct a point cloud by identifying point correspon-
constraint and the rendering formulation require calculat- dences across images. But the reconstruction quality is
ing the second-order derivatives. The main challenge is to heavily affected by the accuracy of correspondence match-
have a simple and memory-efficient calculation to achieve ing. Voxel-based methods [6, 11, 53] side-step the difficul-
the highest possible GPU computing performance. There- ties in explicit correspondence matching by recovering oc-
fore, we derive a simple formula of the second-order deriva- cupancy and color in a voxel grid from multi-view images
tives tailored to ReLU-based MLPs, which enables an effi- with a photometric consistency criterion. However, the re-
cient CUDA implementation with a small memory footprint construction of these methods is limited to low resolution
at a significantly reduced computational cost. To further en- due to the high memory consumption of voxel grids.
force and accelerate the training convergence, we introduce Classical Multi-view 4D Reconstruction. A large body
an efficient progressive training strategy for learning multi- of works [3, 7, 10, 17, 59, 65] in multi-view 4D reconstruc-
resolution hash encodings [36], which updates the hash ta- tion utilize a precomputed deformable model and deform
ble features in a coarse-to-fine manner. the model to fit the multi-view images. Different from these
We further extend our method to multi-view dynamic works, our method does not need a precomputed model, can
scene reconstruction. Instead of training each frame in the reconstruct detailed results and handle topology changes.
sequence separately, we propose a new incremental learning The most relevant to our work is [9], which is also a model-
strategy to efficiently learn a neural dynamic representation free method. [9] first uses RGB and depth inputs to recon-
of objects with large movements and deformations. Specif- struct high-quality point clouds for each frame, and then
ically, we exploit the similarity of the shape and appearance tracks over sequences of frames to produce temporally co-
information shared in two consecutive frames by first train- herent sequences. While [9] can produce impressive re-
ing the first frame and sequentially fine-tuning the subse- sults, it requires both RGB and depth as input and the
quent frames. While generally, this strategy works well, we whole pipeline consists of many sophisticated and time-
observed that when the movement between two consecutive consuming procedures. In this paper, we only focus on how
frames is relatively large, the predicted SDF of the occluded to efficiently get high-quality reconstruction per frame and
regions that are not observed in most images may get stuck do not consider the temporal mesh tracking since it is out
in the learned SDF of the previous frame. To address this, of the scope of this paper. Different from [9], we only need
we predict a global transformation to roughly align these RGB as input and can learn high-quality geometry and ap-
two frames before learning the representation of the new pearance for each frame in an end-to-end manner in 20 sec-
frame. onds per frame.
In summary, our technical contributions are: Neural Implicit Representations. Neural implicit rep-
resentations have made remarkable achievements in novel
• We propose a new method, NeuS2, for fast learn- view synthesis [22,30,33,35,54,55] and 3D/4D reconstruc-
ing of neural surface representations from multi-view tion [21, 23, 32, 37, 39, 47, 48, 60, 66, 67]. NeRF [35] has
RGB input for both static and dynamic scenes, which shown high-quality results in the novel view synthesis task,
achieves a significant speed up over the state-of-the-art but it cannot extract high-quality surfaces since the geome-
while achieving an unprecedented reconstruction qual- try representation lacks surface constraints. NeuS [60] rep-
ity. resents the 3D surface as an SDF for high-quality geometry
reconstruction. However, the training of NeuS is very slow,
• A simple formulation of the second-order derivatives
and it only works on static scene reconstruction. Instead,
tailored to ReLU-based MLPs is presented to enable
our method significantly accelerates the training of NeuS
efficient parallelization of GPU computation.
by 100 times, and the training can be further accelerated to
• A progressive training strategy for learning multi- 20 seconds per frame when applied to dynamic scene re-
resolution hash encodings from coarse to fine is pro- construction.
posed to enforce better and faster training convergence. There are other follow-up works of NeRF [35]. Some
2
works [36, 49, 57] introduce voxel-grid features to repre- 3. Background
sent 3D properties for fast training. However, these meth-
ods cannot extract high-quality surfaces as they inherit the NeuS. Given calibrated multi-view images of a static
volume density field as the geometry representation from scene, NeuS [60] implicitly represents the surface and ap-
NeRF [35]. In contrast, our method can achieve high- pearance of a scene as a signed distance field f (x) : R3 →
quality surface reconstruction as well as fast training. For R and a radiance field c(x, v) : R3 × S2 → R3 , where x
dynamic scene modeling, many works [27, 28, 40, 41, 45, denotes a 3D position and v ∈ S2 is a viewing direction.
58, 63] propose to disentangle a 4D scene into a shared The surface S of the object can be obtained by extracting
canonical space and a deformable field per frame. While the zero-level set of the SDF S = {x ∈ R3 |f (x) = 0}. To
these works make the training for dynamic scenes more ef- render an object into an image, NeuS leverages volume ren-
ficient, the training is still time-consuming. Furthermore, dering. Specifically, for each pixel of an image, we sample
these methods are not able to handle large movements and n points {p(ti ) = o + ti v|i = 0, 1, . . . , n − 1} along its
only reconstruct medium-quality surfaces. Some works in camera ray, where o is the center of the camera and v is the
human performance modeling [8, 20, 24, 31, 38, 43, 44, 56, view direction. By accumulating the SDF-based densities
61, 64] can model large movements by introducing a de- and colors of the sample points, we can compute the color Ĉ
formable template as a prior. In contrast, our method can of the ray. As the rendering process is differentiable, NeuS
handle large movements, does not require a deformable can learn the signed distance field f and the radiance field c
template and thus is not restricted to a specific dynamic ob- from the multi-view images. However, the training process
ject. Moreover, we can learn high-quality surfaces of dy- is very slow, taking about 8 hours on a single GPU.
namic scenes for 20 seconds per frame. Multi-resolution Hash Encoding. To overcome the
slow training time of deep coordinate-based MLPs, which
is also a main reason for the slow performance of NeuS, re-
cently, Instant-NGP [36] proposed a multi-resolution hash
encoding and has proven its effectiveness. Specifically,
Concurrent Work. [12] represents a 4D scene with a
Instant-NGP assumes that the object to be reconstructed is
time-aware voxel feature. [29] proposes a static-to-dynamic
bounded in multi-resolution voxel grids. The voxel grids
learning paradigm for fast dynamic scene learning. [26]
at each resolution are mapped to a hash table with a fixed-
presents a grid based method for efficiently reconstructing
size array of learnable feature vectors. For a 3D position
radiance fields frame by frame. These three methods focus
x ∈ R3 , it obtains a hash encoding at each level hi (x) ∈ Rd
on the novel view synthesis and, thus, are not designed to
(d is the dimension of a feature vector, i = 1, ..., L) by in-
reconstruct high-quality surfaces, which is different from
terpolating the feature vectors assigned at the surrounding
our goal of achieving high-quality surface geometry and
voxel grids at this level. The hash encodings at all L levels
appearance models. Voxurf [62] proposes a voxel-based
are then concatenated to be the multi-resolution hash en-
surface representation for fast multi-view 3D reconstruc-
coding h(x) = {hi (x)}L i=1 ∈ R
L×d
. While the runtime is
tion. While it enables 20x speedup over the baselines (i.e.,
significantly improved, Instant-NGP still does not reach the
NeuS [60]), our proposed method is over 3x faster than Vox-
quality of NeuS in terms of geometry reconstruction accu-
urf and achieves better geometry quality when compared
racy.
with Voxurf’s results reported in their paper. Also, Voxurf
is not designed for dynamic scene reconstruction. [69] pro-
poses a method for human modeling and rendering. It first 4. Static Neural Surface Reconstruction
reconstructs a neural surface representation for each frame;
then it applies non-rigid deformation to obtain a temporally We first present how our formulation can effectively
coherent mesh sequence. Our work focuses on the first learn the signed distance field of a static scene from cali-
part, that is, fast reconstruction of dynamic scenes, where brated multi-view images (see Fig. 2 a). To accelerate the
we exploit the temporal consistency between two consecu- training process, we first demonstrate how to incorporate
tive frames to accelerate the learning of dynamic represen- multi-resolution hash encodings [36] for representing the
tation. Therefore, our work is orthogonal to [69], and can SDF of the scene, and how volume rendering can be applied
be integrated into [69] as the first step. Last, Anonymous et to render the scene into an image (Sec. 4.1). Next, we derive
al. [25] propose a monocular dynamic surface reconstruc- a simplified expression of second-order derivatives tailored
tion approach by extending the NeuS formulation for bend- to ReLU-based MLPs, which can be efficiently parallelized
ing rays. In stark contrast to our approach, their focus lies in custom CUDA kernels (Sec. 4.2). Finally, we adopt a
on proving that unbiasedness also holds in the case of ray progressive training strategy for learning multi-resolution
bending and the challenging monocular setting, and less on hash encodings, which leads to faster training convergence
the highest possible quality at the fastest speeds. and better reconstruction quality (Sec. 4.3).
3
Pos SDF
n = ∇' SDF
Second-order
derivative
Volume
SDF Network RGB Network RGB
Rendering
Ray Dir
Hash Encoding Rendered Rendered
(a) NeuS2: static scene reconstruction Image Mesh
X % = R!% (X ! + T&! )
X "#$ = R! (X ! + T ! )
X % = R!#$
% (X
!#$
+ T&!#$ )
X% X "#$ X"
NeuS2
X!
Incremental
Training
Frame 0 Frame t-1 Frame t
4.1. Volume Rendering of a Hash-encoded SDF Volume Rendering. To render an image, we apply the
unbiased volume rendering of NeuS [60]. Additionally, we
For each 3D position x, we map it to its multi-resolution
adopt a ray marching acceleration strategy used in Instant-
hash encodings hΩ (x) with learnable hash table entries Ω.
NGP [36]. More details are provided in the supplementary
As hΩ (x) is an informative encoding of spatial position,
document.
the MLPs for mapping x to its SDF d and color c can be
Supervision. To train NeuS2, we minimize the color dif-
very shallow, which results in more efficient rendering and
ference between the rendered pixels Ĉi with i ∈ {1, ..., m}
training without compromising quality.
and the corresponding ground truth pixels Ci without any
SDF Network. In more detail, our SDF network
3D supervision. Here, m denotes the batch size during
training. We also employ an Eikonal term [15] to regularize
\label {eqn:sdf_net} (d, \mathbf {g}) = f_{\Theta }(\mathbf {e}), \qquad \mathbf {e} = (\mathbf {x}, h_\Omega (\mathbf {x})). (1)
the learned signed distance field. Our final loss is defined as
is a shallow MLP with weights Θ, which takes the 3D po-
sition x along with its hash encoding hΩ (x) as input and \label {eqn:loss} \mathcal {L}=\mathcal {L}_{\mathrm {color}}+\beta \mathcal {L}_{\mathrm {eikonal}}, (4)
outputs the SDF value d and a geometry feature vector where Lcolor = m 1
P
i R(P Ĉi , Ci ), and R is the Huber
g ∈ R15 . Here, we concatenate the point position as input loss [18]. Leikonal = mn 1 2
k,i (||nk,i || − 1) , and k in-
to the SDF network in order to apply a geometry initializa- dexes the kth sample along the ray with k ∈ {1, ..., n}, and
tion [4] for more stable SDF learning. n is the number of sampled points. nk,i is the normal of a
Color Network. The normal of x can be computed as sampled point (see Eq. 2).
\label {eqn:normal} \mathbf {n}=\nabla _{\mathbf {x}}d. (2) 4.2. Efficient Handling of Second-order Derivatives
where ∇x d denotes the gradient of the SDF with respect to To avoid the computational overhead during training, we
x. We then feed the normal with the geometry feature g, implement our whole system in CUDA.
the SDF d, the point x, and the ray direction v to our color Second-order Derivatives. In contrast to Instant-
network NGP [36], which only requires the first-order derivatives
\label {eqn:color_net} \mathbf {c} = c_{\Upsilon }(\textbf {x}, \textbf {n}, \textbf {v}, d, \mathbf {g}), (3) during the optimization, we must calculate the second-order
derivatives for the parameters associated with the normal
which denotes the color c of x. term n = ∇x d (Eq. 2), which is used as an input to the
4
∂ ∂e
color network cΥ (Eq. 3) and the Eikonal loss term Leikonal . ∂e
is irrelevant to Θ, we have ∂Θ
∂x
= 0. This results in the
∂x
To accelerate this computation, we directly calculate them following simplified form for the second-order derivatives.
using simplified formulas instead of applying the computa-
tional graph of PyTorch [42]. Specifically, we calculate the \label {eqn:simplified_dloss_denc} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Omega } &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Omega } \end {aligned} (9)
second-order derivatives of the hash table parameters Ω and
the SDF network parameters Θ using the chain rule as
\label {eqn:simplified_dloss_dmlp} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Theta } &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Theta } \end {aligned} (10)
\label {eqn:dloss_denc} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Omega } = \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \Omega } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Omega } ) \end {aligned} (5)
∂L ∂L
To further simply ∂Ω and ∂Θ , we can substitute the terms
∂e
∂ ∂x ∂ ∂d
\label {eqn:dloss_dsdf} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Theta } &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Theta } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Theta } ) \end {aligned} of ∂Ω and using Eq. 8, respectively. We show in
∂Θ
∂e
( ,j) (i, ) w_i(\lambda )= \begin {cases} 0, \text { if } i>\lambda \\ 1, \text { otherwise} \end {cases}, (12)
where H1 (jth column of H1 , HL is the ith row
is the
1, Hl−1 . . . g(H1 x) > 0
of HL , and Gl = . and the parameter λ modulates the bandwidth of the low-
0, otherwise
pass filter applied to the multi-resolution hash encodings,
which gradually increases during the training process.
Now the second-order derivates of a ReLU-based MLP with
respect to its input and intermediate layers can be defined.
5. Dynamic Neural Surface Reconstruction
Theorem 1 (Second-order derivative of ReLU-based MLP) We have explained how NeuS2 can produce highly accu-
Given a ReLU-based MLP f with L hidden layers with rate and fast reconstructions of static scenes. Next, we ex-
the same definition in Definition 1. The second-order tend NeuS2 to dynamic scene reconstruction. That is, given
derivative of the MLP f is: multi-view videos of a moving object and camera param-
eters of each view, our goal is to learn the neural implicit
\label {eqn:mlp_2ord_derivative_formula} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial H_l} = (P_l^j S_l^i)^\top ,\qquad \frac {\partial ^2 y}{\partial \mathbf {x}^2} = 0 surfaces of the object in each video frame (see Fig. 2 b)).
(8)
5.1. Incremental Training
∂y ∂y
where ∂x (i,j) is the matrix element (i,j) of ∂x , and Sli and Even though our reconstruction method of static objects
Plj are defined in Definition 1. can achieve promising efficiency and quality, constructing
dynamic scenes by training every single frame indepen-
Coming back to our original second-order derivatives dently is still time-consuming. However, scene changes
∂ ∂d
(Eqs. 5 and 6), by Theorem 1, we obtain ∂e∂e
= 0. Since from one frame to the other are typically small. Thus, we
5
propose an incremental training strategy to exploit the sim- COLMAP NeuS Instant-NGP Ours
ilarity in geometry and appearance information shared be-
CD ↓ 1.36 0.77 1.84 0.70
tween two consecutive frames, which enables faster con- PSNR ↑ - 28.00 28.86 28.82
vergence of our model. In detail, we train the first frame
from scratch as presented in our static scene reconstruction, Runtime 1h 8h 5 min 5 min
and then fine-tune the model parameters for the subsequent
Table 1. The quantitative comparison on DTU dataset. Our
frames based on the learned hash grid representation of the method outperforms other baselines for geometry reconstruction
preceding frame. Using this strategy, the model is able to regarding to the Chamfer Distance (CD) and is on par with Instant-
produce a good initialization of the neural representation of NGP of novel view synthesis in terms of PSNR.
the target frame and, thus, significantly accelerates its con-
vergence speed.
D-NeRF Ours
5.2. Global Transformation Prediction Dataset PSNR↑ CD↓ PSNR↑ CD↓
As we observed during the incremental training process, Lego 24.25 59.0 29.5 17.1
the predicted SDF may get stuck in the local minima of the Lion 31.45 - 33.60 -
learned SDF of the previous frame, especially when the ob- Human 29.33 5.73 33.20 1.86
ject’s movement between adjacent frames is relatively large. Runtime 20h 20s per frame
For instance, when our model reconstructs a walking se-
quence from muli-view images, the reconstructed surface Table 2. Quantitative comparisons on synthetic scenes. The
appears to have many holes, as shown in Fig. 7. To address Chamfer Distance of Lion sequence is omitted since the ground
this issue, we propose a global transformation prediction to truth geometry is not provided. Compared to D-NeRF, our method
roughly transform the target SDF into a canonical space be- achieves much better appearance and geometry reconstruction re-
sults while requiring significantly less training time.
fore the incremental training. Specifically, we predict the
rotation R and transition T of the object between two adja-
cent frames. For any given 3D position xi in the coordinate D-NeRF Ours
space of the frame i, it is transformed back to the coordinate Dataset PSNR↑ LPIPS↓ PSNR↑ LPIPS↓
space of the previous frame i − 1, denoted as xi−1
D1 17.47 0.154 27.76 0.037
D2 20.80 0.155 27.25 0.042
\label {eqn:globalmove_predict} \mathbf {x}_{i-1}=R_i(\mathbf {x}_{i}+T_i). (13)
D5 21.48 0.122 26.41 0.036
The transformations can then be accumulated to transform mean 19.92 0.144 27.14 0.038
the point xi back to xc in the first frame’s coordinate space Runtime 50h 20s per frame
\mathbf {x}_c=R_{i-1}^c(\mathbf {x}_{i-1}+T_{i-1}^c)=R_{i}^c(\mathbf {x}_{i}+T_{i}^c), (14) Table 3. Quantitative comparisons on real scenes. We found that
our method outperforms D-NeRF in all metrics.
where Ric = Ri−1
c
Ri and Tic = Ti + Ri−1 Ti−1
c
.
The global transformation prediction also allows us to
only update a small portion of the scene in the current frame when processing a new frame, we first predict the global
which is different from that in the previous frame, rather transformation, and then fine-tune the model’s parameters
than updating the entire scene in the current frame. In this and global transformation together to efficiently learn the
way, we can obtain more accurate reconstruction and reduce neural representation.
memory costs.
Notably, our method can handle large movements and 6. Experiments
deformations which are challenging for existing dynamic
scene reconstruction approaches [58], [45], thanks to the All experiments are conducted on a single GeForce
following designs of our approach: (1) Global transforma- RTX3090 GPU. Implementation details, additional results,
tion prediction that accounts for large global movements and video results are provided in the supplementary mate-
in the sequence; (2) Incremental training that learns rela- rial.
tively small deformable movements between two adjacent
6.1. Static Scene Reconstruction
frames rather than learns relatively large movements from
each frame to a common canonical space. For the static scene reconstruction, we use 15 scenes of
We implement the incremental training strategy com- the DTU dataset [19] for evaluation. There are 49 or 64 im-
bined with global transformation prediction in an end-to- ages with a resolution of 1600 × 1200, and we test each
end learning scheme, as illustrated in Fig. 2(b). Specifically, scene with foreground masks provided by IDR [68]. To
6
scan 24
scan 63
scan 118
D1
Lego
Lion
D2
Human
D5
Figure 4. Qualitative comparisons of synthetic scene for geom- Figure 5. Qualitative comparisons of real scenes for geometry re-
etry reconstruction and novel view synthesis. Compared to D- construction and novel view synthesis. Our method outperforms
NeRF, which is trained for 20 hours for each scene, our method D-NeRF for both tasks, demonstrating sharp and accurate recon-
produces photo-realistic rendering results and accurate geometry struction quality, while D-NeRF fails to handle complex transfor-
reconstructions with only 20 seconds of training time per frame. mations in real scenes.
12
50
Computing Time (ms)
7
synthesis and geometry reconstruction results compared to
D-NeRF. Notably, our method takes less than 1 hour to com-
plete the learning of a sequence, with 40 seconds (80 sec-
onds for the Lego sequence) of training time for the first
frame and 20 seconds for each subsequent frame; while the
training time of D-NeRF for each scene is about 20 hours.
The qualitative results are provided in Fig. 4, showing that
our method outperforms D-NeRF in terms of novel view
synthesis and geometry reconstruction.
Real Scenes. To further evaluate the effectiveness of our
method on real scenes with large and non-rigid movement,
we select three sequences from the Dynacap [16] dataset.
Each sequence contains 500 frames with about 50 to 100
(a) Reference image (b) w/o GTP (c) w/o GTP (d) Full model
w/o PT camera views for training and about 5 to 10 camera views
for testing. More details are provided in the supplemen-
Figure 7. Ablation study for Global Transformation Prediction tary material. Tab. 3 summarizes the quantitative compar-
(GTP) and progressive training (PT). Ablated models perform isons between our method and D-NeRF [45]. Since the real
poorly on both novel view synthesis and geometry reconstruction. scene datasets do not have ground truth geometry, we can
only calculate the PSNR and LPIPS score of novel view
rable performance on the novel view synthesis task to the
synthesis results to evaluate the rendering quality. For long
baseline methods.
sequences with 500 frames consisting of challenging move-
A qualitative comparison of geometry reconstruction and
ments, D-NeRF struggles to reconstruct the dynamic real
novel view synthesis results for all methods is presented in
scenes. Even when training D-NeRF for 50 hours, our
Fig. 3. As shown in Fig. 3 for the 3D geometry recon-
method achieves significantly better scores, taking only 20
struction results, NeuS [60] exhibits limited performance in
seconds per frame. Also, according to the qualitative eval-
terms of reconstructed details with excessively smooth sur-
uation results shown in Fig. 5, D-NeRF shows blurred ren-
faces. The extracted meshes of Instant-NGP [36]’s results
dering results and inaccurate geometry reconstruction. In
are noisy since it lacks surface constraints in the geometry
contrast, our approach produces photo-realistic renderings
representation – volume density field. Regarding the novel
and detailed geometry.
view synthesis results, our method outperforms NeuS [60],
presenting detailed rendering results on par with Instant- 6.3. Ablation
NGP. To conclude, our method achieves high-quality geom-
etry and appearance reconstruction without inducing noise, We first compare our efficient second-order deriva-
e.g., it can recover the complex structures of the windows tive backward computation (Theorem 1) implemented in
and render detailed textures in Scan 24 of the DTU dataset. CUDA with a baseline where we automatically compute the
second-order derivative in PyTorch [42] using their compu-
6.2. Dynamic Scene Reconstruction tational graph. As shown in Fig. 6, our method achieves
We conduct experiments on both synthetic and real dy- a faster speed for second-order derivative backpropagation
namic scenes for the tasks of novel view synthesis and ge- than the PyTorch implementation.
ometry reconstruction. We compare our method with the Second, we evaluate the performance of the individual
state-of-the-art neural-based method for dynamic scenes, components of NeuS2: Global Transformation Prediction
D-NeRF [46] quantitatively and qualitatively. D-NeRF [46] (GTP) and Progressive Training strategy (PT) in Fig. 7 and
models general scenes for dynamic novel view synthesis Tab. 4. The geometry quality and appearance reconstruc-
by combining NeRF [35] with deformation fields, which is tion quality of the full model are better than other ablated
learned from all the frames simultaneously. models, both, quantitatively and qualitatively. The ablated
Synthetic Scenes. We choose three different types model shows noisy holes on the surface and blurred ren-
of datasets for synthetic scene reconstruction, including a derings since it gets stuck in local minima during the in-
Lego scene shared by NeRF [35] (150 frames), a Lion se- cremental training, which can be alleviated by the Global
quence provided by Artemis [34] (177 frames), and a hu- Transformation Prediction and Progressive Training.
man character in the RenderPeople [2] dataset (100 frames). 7. Conclusion
For the quantitative evaluation, the Chamfer Distances and Limitations. While our method reconstructs each frame
PSNR scores are calculated and averaged over all frames of a dynamic scene in high quality, there is no temporal
and all testing views of all frames, respectively. As shown in correspondence across the frames. A possible way could
Tab. 2, our method shows significantly improved novel view be to deform a mesh template to fit our learned neural sur-
8
faces for each frame, like the mesh tracking used in [9, 69]. and Steve Sullivan. High-quality streamable free-viewpoint
Currently, we also need to save the network parameters of video. 34(4), jul 2015. 2, 9
each frame (25M). As future work, a compression of such [10] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed
parameters encoding the dynamic scene could be explored. Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Perfor-
We proposed a learning-based method for accurate mance capture from sparse multi-view video. ACM Trans.
multi-view reconstruction of both static and dynamic scenes Graph., 27(3):1–10, aug 2008. 2
at an unprecedented runtime performance. To achieve this, [11] Jeremy S De Bonet and Paul Viola. Poxels: Probabilistic
voxelized volume reconstruction. In Proceedings of Interna-
we integrated multi-resolution hash encodings into neural
tional Conference on Computer Vision (ICCV), pages 418–
SDF and introduced a simple calculation of the second-
425, 1999. 1, 2
order derivatives tailored to our dedicated network architec-
[12] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xi-
ture. To enhance the training convergence, we presented a aopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian.
progressive training strategy to learn multi-resolution hash Fast dynamic radiance fields with time-aware neural voxels.
encodings. For dynamic scene reconstruction, we proposed arxiv:2205.15285, 2022. 3
an incremental training strategy with a global transfor- [13] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and
mation prediction component, which leverages the shared robust multiview stereopsis. IEEE transactions on pattern
geometry and appearance information in two consecutive analysis and machine intelligence, 32(8):1362–1376, 2009.
frames. 1, 2
[14] Silvano Galliani, Katrin Lasinger, and Konrad Schindler.
Acknowledgement Gipuma: Massively parallel multi-view stereo reconstruc-
Christian Theobalt and Marc Habermann were supported tion. Publikationen der Deutschen Gesellschaft für Pho-
togrammetrie, Fernerkundung und Geoinformation e. V,
by ERC Consolidator Grant 4DReply (770784). Lingjie Liu
25(361-369):2, 2016. 1, 2
was supported by Lise Meitner Postdoctoral Fellowship.
[15] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
Yaron Lipman. Implicit geometric regularization for learning
References shapes. arXiv preprint arXiv:2002.10099, 2020. 4
[1] Blender. http : / / aiweb . techfak . uni - [16] Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zoll-
bielefeld . de / content / bworld - robot - hoefer, Gerard Pons-Moll, and Christian Theobalt. Real-time
control-software/. 13 deep dynamic characters. ACM Transactions on Graphics
[2] Renderpeople. https://renderpeople.com/3d- (TOG), 40(4):1–16, 2021. 8, 13
people/, 2018. 8, 13 [17] Marc Habermann, Weipeng Xu, Michael Zollhöfer, Gerard
[3] Naveed Ahmed, Christian Theobalt, Petar Dobrev, Hans- Pons-Moll, and Christian Theobalt. Livecap: Real-time
Peter Seidel, and Sebastian Thrun. Robust fusion of dynamic human performance capture from monocular video. ACM
shape and normal capture for high-quality reconstruction of Trans. Graph., 38(2):14:1–14:17, Mar. 2019. 2
time-varying geometry. In 2008 IEEE Conference on Com- [18] Peter J. Huber. Robust Estimation of a Location Parameter.
puter Vision and Pattern Recognition, pages 1–8, 2008. 2 The Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
[4] Matan Atzmon and Yaron Lipman. Sal: Sign agnos- 4
tic learning of shapes from raw data. In Proceedings of [19] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola,
the IEEE/CVF Conference on Computer Vision and Pattern and Henrik Aanæs. Large scale multi-view stereopsis eval-
Recognition, pages 2565–2574, 2020. 4 uation. In Proceedings of the IEEE conference on computer
[5] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and vision and pattern recognition, pages 406–413, 2014. 6, 7,
Dan B Goldman. Patchmatch: A randomized correspon- 13
dence algorithm for structural image editing. ACM Trans. [20] Zhang Jiakai, Liu Xinhang, Ye Xinyi, Zhao Fuqiang, Zhang
Graph., 28(3):24, 2009. 1, 2 Yanshun, Wu Minye, Zhang Yingliang, Xu Lan, and Yu
[6] Adrian Broadhurst, Tom W Drummond, and Roberto Jingyi. Editable free-viewpoint video using a layered neu-
Cipolla. A probabilistic framework for space carving. In ral representation. In ACM SIGGRAPH, 2021. 3
Proceedings Eighth IEEE International Conference on Com- [21] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker.
puter Vision. ICCV 2001, volume 1, pages 388–393. IEEE, Sdfdiff: Differentiable rendering of signed distance fields
2001. 1, 2 for 3d shape optimization. In The IEEE/CVF Conference
[7] Joel Carranza, Christian Theobalt, Marcus A. Magnor, and on Computer Vision and Pattern Recognition (CVPR), June
Hans-Peter Seidel. Free-viewpoint video of human actors. 2020. 2
ACM Trans. Graph., 22(3):569–577, jul 2003. 2 [22] Srinivas Kaza et al. Differentiable volume rendering using
[8] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao signed distance functions. PhD thesis, Massachusetts Insti-
Bao, and Huchuan Lu. Animatable neural radiance fields tute of Technology, 2019. 2
from monocular rgb video, 2021. 3 [23] Petr Kellnhofer, Lars Jebe, Andrew Jones, Ryan Spicer, Kari
[9] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- Pulli, and Gordon Wetzstein. Neural lumigraph rendering.
nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, arXiv preprint arXiv:2103.11571, 2021. 2
9
[24] Youngjoong Kwon, Dahun Kim, Duygu Ceylan, and Henry [38] Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya
Fuchs. Neural human performer: Learning generalizable ra- Harada. Neural articulated radiance field. In International
diance fields for human performance rendering. NeurIPS, Conference on Computer Vision, 2021. 3
2021. 3 [39] Michael Oechsle, Songyou Peng, and Andreas Geiger.
[25] SEE COVER LETTER. See cover letter, SEE COVER LET- Unisurf: Unifying neural implicit surfaces and radiance
TER. 3 fields for multi-view reconstruction. In International Con-
[26] Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping ference on Computer Vision (ICCV), 2021. 2
Tan. Streaming radiance fields for 3d video synthesis. arXiv [40] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
preprint arXiv:2210.14831, 2022. 3 Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
[27] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Martin-Brualla. Nerfies: Deformable neural radiance fields.
Green, Christoph Lassner, Changil Kim, Tanner Schmidt, In International Conference on Computer Vision (ICCV),
Steven Lovegrove, Michael Goesele, Richard Newcombe, 2021. 3
et al. Neural 3d video synthesis from multi-view video. In [41] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T.
Proceedings of the IEEE/CVF Conference on Computer Vi- Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-
sion and Pattern Recognition, pages 5521–5531, 2022. 3 Brualla, and Steven M. Seitz. Hypernerf: A higher-
[28] Z. Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neu- dimensional representation for topologically varying neural
ral scene flow fields for space-time view synthesis of dy- radiance fields. ACM Trans. Graph., 40(6), dec 2021. 3
namic scenes. ArXiv, abs/2011.13084, 2020. 3 [42] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[29] Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai-
Qie, and Mike Zheng Shou. Devrf: Fast deformable son, Andreas Kopf, Edward Yang, Zachary DeVito, Mar-
voxel radiance fields for dynamic scenes. arXiv preprint tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit
arXiv:2205.15723, 2022. 3 Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch:
An imperative style, high-performance deep learning library.
[30] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
In Advances in Neural Information Processing Systems 32,
Christian Theobalt. Neural sparse voxel fields. Advances in
pages 8024–8035. Curran Associates, Inc., 2019. 5, 8
Neural Information Processing Systems, 33, 2020. 2
[43] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan
[31] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu
Zhang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Animat-
Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor:
able neural radiance fields for human body modeling. ICCV,
Neural free-view synthesis of human actors with pose con-
2021. 3
trol. ACM Trans. Graph., 40(6), dec 2021. 3
[44] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang,
[32] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body:
Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit Implicit neural representations with structured latent codes
signed distance function with differentiable sphere tracing. for novel view synthesis of dynamic humans. CVPR,
In Proceedings of the IEEE/CVF Conference on Computer 1(1):9054–9063, 2021. 3
Vision and Pattern Recognition, pages 2019–2028, 2020. 2 [45] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
[33] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- for Dynamic Scenes. In Proceedings of the IEEE/CVF Con-
umes: Learning dynamic renderable volumes from images. ference on Computer Vision and Pattern Recognition, 2020.
ACM Transactions on Graphics (TOG), 38(4):65, 2019. 2 3, 6, 8, 13
[34] Haimin Luo, Teng Xu, Yuheng Jiang, Chenglin Zhou, QIwei [46] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
Qiu, Yingliang Zhang, Wei Yang, Lan Xu, and Jingyi Yu. Francesc Moreno-Noguer. D-nerf: Neural radiance fields
Artemis: Articulated neural pets with appearance and motion for dynamic scenes. In Proceedings of the IEEE/CVF Con-
synthesis. arXiv preprint arXiv:2202.05628, 2022. 8, 13 ference on Computer Vision and Pattern Recognition, pages
[35] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, 10318–10327, 2021. 8
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: [47] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-
Representing scenes as neural radiance fields for view syn- ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned
thesis. In European Conference on Computer Vision, pages implicit function for high-resolution clothed human digitiza-
405–421. Springer, 2020. 1, 2, 3, 8, 13, 15 tion. In Proceedings of the IEEE/CVF International Confer-
[36] Thomas Müller, Alex Evans, Christoph Schied, and Alexan- ence on Computer Vision, pages 2304–2314, 2019. 2, 13
der Keller. Instant neural graphics primitives with a multires- [48] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul
olution hash encoding. arXiv preprint arXiv:2201.05989, Joo. Pifuhd: Multi-level pixel-aligned implicit function for
2022. 1, 2, 3, 4, 7, 8, 13, 15 high-resolution 3d human digitization. In Proceedings of
[37] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and the IEEE/CVF Conference on Computer Vision and Pattern
Andreas Geiger. Differentiable volumetric rendering: Learn- Recognition, pages 84–93, 2020. 2
ing implicit 3d representations without 3d supervision. In [49] Sara Fridovich-Keil and Alex Yu, Matthew Tancik, Qinhong
Proceedings of the IEEE/CVF Conference on Computer Vi- Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
sion and Pattern Recognition, pages 3504–3515, 2020. 2 Radiance fields without neural networks. In CVPR, 2022. 3
10
[50] Johannes Lutz Schönberger and Jan-Michael Frahm. [65] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge
Structure-from-motion revisited. In Conference on Com- Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian
puter Vision and Pattern Recognition (CVPR), 2016. 13 Theobalt. Monoperfcap: Human performance capture from
[51] Johannes L Schönberger, Enliang Zheng, Jan-Michael monocular video. ACM Trans. Graph., 37(2):27:1–27:15,
Frahm, and Marc Pollefeys. Pixelwise view selection for May 2018. 2
unstructured multi-view stereo. In European Conference on [66] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
Computer Vision, pages 501–518. Springer, 2016. 1, 2 ume rendering of neural implicit surfaces. In Advances in
[52] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, Neural Information Processing Systems (NeurIPS), 2021. 2
and Jan-Michael Frahm. Pixelwise view selection for un- [67] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
structured multi-view stereo. In European Conference on Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
Computer Vision (ECCV), 2016. 13 ral surface reconstruction by disentangling geometry and ap-
[53] Steven M Seitz and Charles R Dyer. Photorealistic scene pearance. Advances in Neural Information Processing Sys-
reconstruction by voxel coloring. International Journal of tems, 33, 2020. 2
Computer Vision, 35(2):151–173, 1999. 1, 2 [68] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
[54] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Atzmon, Basri Ronen, and Yaron Lipman. Multiview neu-
Nießner, Gordon Wetzstein, and Michael Zollhofer. Deep- ral surface reconstruction by disentangling geometry and ap-
voxels: Learning persistent 3d feature embeddings. In Pro- pearance. Advances in Neural Information Processing Sys-
ceedings of the IEEE/CVF Conference on Computer Vision tems, 33:2492–2502, 2020. 6, 13
and Pattern Recognition, pages 2437–2446, 2019. 2 [69] Fuqiang Zhao, Yuheng Jiang, Kaixin Yao, Jiakai Zhang,
[55] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet- Liao Wang, Haizhao Dai, Yuhui Zhong, Yingliang Zhang,
zstein. Scene representation networks: Continuous 3d- Minye Wu, Lan Xu, et al. Human performance modeling
structure-aware neural scene representations. In Advances in and rendering via neural animated mesh. arXiv preprint
Neural Information Processing Systems, pages 1119–1130, arXiv:2209.08468, 2022. 3, 9
2019. 2
[56] Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge
Rhodin. A-nerf: Surface-free human 3d pose refinement via
neural rendering. arXiv preprint arXiv:2102.06199, 2021. 3
[57] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel
grid optimization: Super-fast convergence for radiance fields
reconstruction. In CVPR, 2022. 3
[58] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael
Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-
rigid neural radiance fields: Reconstruction and novel view
synthesis of a dynamic scene from monocular video. In IEEE
International Conference on Computer Vision (ICCV). IEEE,
2021. 3, 6
[59] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan
Popović. Articulated mesh animation from multi-view sil-
houettes. ACM Trans. Graph., 27(3):1–9, aug 2008. 2
[60] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Komura, and Wenping Wang. Neus: Learning neural implicit
surfaces by volume rendering for multi-view reconstruction.
NeurIPS, 2021. 1, 2, 3, 4, 7, 8, 13
[61] Yiming Wang, Qingzhe Gao, Libin Liu, Lingjie Liu, Chris-
tian Theobalt, and Baoquan Chen. Neural novel actor:
Learning a generalized animatable neural representation for
human actors. 2022. 3
[62] Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian
Theobalt, Ziwei Liu, and Dahua Lin. Voxurf: Voxel-based
efficient and accurate neural surface reconstruction, 2022. 3
[63] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
Kim. Space-time neural irradiance fields for free-viewpoint
video. In Computer Vision and Pattern Recognition (CVPR),
2021. 3
[64] Hongyi Xu, Thiemo Alldieck, and Cristian Sminchisescu.
H-nerf: Neural radiance fields for rendering and temporal
reconstruction of humans in motion, 2021. 3
11
Supplementary Material We can then calculate
In the following, we provide more details about our \begin {aligned} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial H_l} = \big (\frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_l P_l^j)}\big )^\top \frac {\partial (H_l P_l^j)}{\partial H_l}. \end {aligned}
(19)
method. First, we present the derivation of Eqs. 5 and 6
(Sec. A), and the proof of Theorem 1 (Sec. B), Second, we
introduce the dataset (Sec. C) we used in the experiments On the one hand, since Plj is independent of Hl , we have
and show more quantitative and qualitative results (Sec. D)
\begin {aligned} \frac {\partial (H_l P_l^j)}{\partial H_l} = (P_l^j)^\top . \end {aligned}
to further demonstrate the performance of our model. Fi- (20)
nally, we provide implementation details of our method
(Sec. E). On the other hand, we have
\begin {aligned} \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_l P_l^j)} = \frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_{L-1} P_{L-1}^j)} \frac {\partial (H_{L-1} P_{L-1}^j)}{\partial (H_{L-2} P_{L-2}^j)} \cdots \frac {\partial (H_{l+1} P_{l+1}^j)}{\partial (H_{l} P_{l}^j)}. \end {aligned}
A. Derivation of Equation 5 and Equation 6
We calculate the second-order derivatives with respect to (21)
the hash table parameters Ω as We can further derive that
\frac {\partial \frac {\partial y}{\partial x}_{(i,j)}}{\partial (H_{L-1} P_{L-1}^j)} = H_L^{(i,\_)} G_{L},
\label {eqn:dloss_denc_proof} \begin {aligned} \frac {\partial \mathcal {L}}{\partial \Omega } & = \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial \frac {\partial d}{\partial \mathbf {x}}}{\partial \Omega } \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} \frac {\partial (\frac {\partial d}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \mathbf {x}})}{\partial \Omega } \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \Omega } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \textbf {x}}}{\partial \Omega } ) \\ &= \frac {\partial \mathcal {L}}{\partial \mathbf {n}} ( \frac {\partial \mathbf {e}}{\partial \mathbf {x}} \frac {\partial \frac {\partial d}{\partial \mathbf {e}}}{\partial \mathbf {e}} \frac {\partial \mathbf {e}}{\partial \Omega } + \frac {\partial d}{\partial \mathbf {e}} \frac {\partial \frac {\partial \mathbf {e}}{\partial \mathbf {x}}}{\partial \Omega } ) \end {aligned} (22)
and
(15) \begin {aligned} \frac {\partial (H_{l} P_{l}^j)}{\partial (H_{l-1} P_{l-1}^j)} &= \frac {\partial (H_{l} G_{l} H_{l-1} P_{l-1}^j)}{\partial (H_{l-1} P_{l-1}^j)} \\ &= H_{l} G_{l} + H_{l} \frac {\partial G_{l}}{\partial (H_{l-1} P_{l-1}^j)} \\ &= H_{l} G_{l}, \end {aligned}
(23)
∂x1
∂
Note that ∂x
∂x
= 0. So we have
\begin {aligned} \frac {\partial ^2 y}{\partial \mathbf {x}^2} &= \frac {\partial \frac {\partial y}{\partial x_1}}{\partial x} \frac {\partial x_1}{\partial x} \\ &= \frac {\partial ^2 y}{\partial x_{1}^2} (\frac {\partial x_1}{\partial x})^2 \\ &= \frac {\partial ^2 y}{\partial x_{L-1}^2} (\frac {\partial x_{L-1}}{\partial x_{L-2}})^2 \cdots (\frac {\partial x_1}{\partial x})^2. \end {aligned}
∂y
And, the element (i, j) of ∂x ∈ RnL × Rn1 is (28)
\frac {\partial y}{\partial x}_{(i,j)} = H_L^{(i, \_)} G_{L} \cdots G_1 H_1^{(\_, j)}.
(18)
12
∂2y
Since y = HL xL−1 , we have ∂x2L−1
= 0. Thus we get D. Additional Result
Video Results. We provide a supplementary video to
\begin {aligned} \qquad \frac {\partial ^2 y}{\partial \mathbf {x}^2} &= 0 \end {aligned} better demonstrate the qualitative results of our method. We
(29)
highly encourage the readers to check our video.
Static Scene Reconstruction. In Tab. 5, we provide the
C. Datasets
per-scene breakdown analysis of the quantitative compar-
Dataset for Static Scene Reconstruction. For static isons on the DTU dataset presented in the main paper (Tab.
scene reconstruction, we use 15 scenes from the DTU 1). We also present additional qualitative comparisons on
dataset [19], same as those used in NeuS [60]. These scenes the DTU dataset in Fig. 8.
cover a wide variety of materials, appearance and geom-
etry, and include challenging cases for reconstruction al- E. Implementation Details
gorithms, such as non-Lambertian surfaces and fine struc-
E.1. Baselines
tures. Each scene contains 49 or 64 images with an image
resolution of 1600 × 1200. We split each scene into train- COLMAP [50, 52]. We directly refer to COLMAP’s re-
ing and testing parts following NeuS [60]. Specifically, we sults on the DTU dataset reported in NeuS [60].
set images indexed at 8, 13, 16, 21, 26, 31, 34 and 56 (if NeuS [60]. The Chamfer Distance scores of NeuS shown
available) for testing and the other images for training. We in the paper are directly referred to the results reported in the
train and test each scene with foreground masks provided original paper. The geometry reconstruction results are pro-
by IDR [68]. duced using the officially released pre-trained models1 with
Dataset for Dynamic Synthetic Scene Reconstruction. mask supervision. The PSNR scores and novel view syn-
We use three synthetic scenes with various types of defor- thesis results are obtained by training the officially released
mations and motions to evaluate our method. The Lego code1 on the DTU training dataset with mask supervision,
scene is shared by NeRF [35] in form of Blender [1] file. and testing it on the DTU testing dataset.
We transform the Lego object into different poses and po- Instant-NGP [36]. We use the officially released code2
sitions and render images at the resolution of 400 × 400 to train the model on the DTU dataset for 50k iterations.
in Blender. The Lego scene contains 150 frames with 40 The training takes about 5 minutes.
training camera views and 40 test camera views. The hu- D-NeRF [45]. We use the officially released code3 to
man scene is provided by RenderPeople [2]. We render im- train the model for 800k iterations. The training time of D-
ages at the resolution of 512 × 512 following [47], and the NeRF on real scenes is longer than that on synthetic scenes,
whole sequence contains 100 frames with 48 camera views taking about 50 and 20 hours, respectively. This is be-
for training and 12 camera views for testing. The Lion scene cause the length of real sequences is longer than that of the
is shared by Artemis [34], which has 177 frames with 30 synthetic sequences, which are around 500 frames and 150
camera views for training and 6 camera views for testing. frames, respectively. Moreover the number of camera views
Dataset for Dynamic Real Scene Reconstruction. We for real scenes is greater than that for synthetic scenes. For
use three sequences from the Dynacap dataset [16], denoted long sequences with dense camera views, the model cannot
as D1, D2 and D5, for real-scene reconstruction. These se- upload all the images at once due to the GPU memory lim-
quences are captured under a dense camera setup at a reso- itation, so extra time is needed to load the images during
lution of 1285 × 940. We crop the images with a 2D bound- training.
ing box which is estimated from the foreground masks
E.2. Network Architecture
to obtain the target images at a resolution of 512 × 512.
For each sequence, we choose 500 frames containing large As shown in Fig 9, the network architecture of NeuS2
movements for evaluation to show our advantages (D1: consists of the following components: (a) a multi-resolution
6,095 to 6,595, D2: 3,450 to 3,950, D5: 17,760 to 18,260). hash grid with 14 levels of different resolutions ranging
The D1 sequence has 50 camera views, from which we pick from 16 to 2048; (b) an SDF network modeled by a 1-layer
5 camera views (7, 17, 27, 37, 47) for testing and the rest MLP with 64 hidden units; (c) an RGB network modeled
of the views for training. The D2 sequence has 101 camera by a 2-layer MLP with 64 hidden units.
views, from which we pick 10 camera views (7, 17, 27, 37,
E.3. Training Details
47, 57, 67, 77, 87, 97) for testing and the rest of the views
for training. The D5 sequence has 94 camera views, from Unbiased Volume Rendering. To render an image, we
which we pick 9 camera views (7, 17, 27, 37, 47, 57, 67, apply the unbiased volume rendering of NeuS [60]. That
77, 87) for testing and the rest of the views for training. We 1 https://github.com/Totoro97/NeuS
train and test each scene with foreground masks provided 2 https://github.com/NVlabs/instant-ngp
13
Table 5. The quantitative comparison on the DTU dataset. Our method outperforms other baselines for geometry reconstruction regarding
the Chamfer Distance (CD) and is on par with Instant-NGP of novel view synthesis in terms of PSNR.
Figure 8. Qualitative comparisons on the DTU dataset for static scene geometry reconstruction and novel view synthesis. Our method
demonstrates high-quality rendering quality, superior to NeuS and comparable to Instant-NGP in terms of complex texture reconstruction.
In addition, it outperforms all the baselines regarding 3D geometry reconstruction, with fine details without inducing noise.
is, we first transform the signed distance field into a volume weight function in the volume rendering equation. Specif-
density field ϕs (f (x)), where ϕs (x) = se−sx /(1 + e−sx )2 ically, for each pixel of an image, we sample n points
is the logistic density distribution, which is the derivative {p(ti ) = o + ti v|i = 0, 1, . . . , n − 1} along its camera ray,
of the Sigmoid function Φs (x) = 1/(1 + e−sx ) and s where o is the center of the camera and v is the view direc-
is a learnable parameter. Next, we construct an unbiased tion. By accumulating the SDF-based densities and colors
14
copy
3
Normal
3
3
∇! (.)
Global Hash encoding 3
POS 3 Transformation 3 28
28 1 SDF
64 16 64 64 3
15
copy RGB
3 Geometry
Feature 16
Spherical Harmonics
Ray Dir 3 16
of the sample points, we can compute the color Ĉ of the ray global transformation independently for the first 100 itera-
with the same approximation scheme as used in NeRF [35] tions and we fine-tune the network parameters and global
as deformation together for the remaining iterations.
\hat {C}(\mathbf {o}, \mathbf {v}) = \sum _{i=0}^{n-1} T({t_i})\alpha ({t_i})c(\mathbf {p}({t_i}), \mathbf {v}) \label {eqn:neus_volume_rendering} (30)
\label {eqn:neus_rendering_alpha} \alpha (t_i) = \max \Big (\frac {\Phi _s(f(p(t_{i})))-\Phi _s(f(p(t_{i+1})))}{\Phi _s(f(p(t_{i})))}, 0\Big ). (31)
15