Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views11 pages

Real-Time 3D Capture with RGBD Sensors

This document presents a method called Function4D for real-time human volumetric capture from sparse RGBD sensors. It combines temporal volumetric fusion and deep implicit functions. Dynamic sliding fusion is used to fuse neighboring depth observations over a sliding window, producing noise-eliminated and temporally consistent results. Detail-preserving deep implicit functions are then used to generate detailed and complete surface reconstructions while eliminating dependency on long-term tracking. Experiments show it outperforms existing methods in terms of view sparsity, reconstruction quality, and efficiency.

Uploaded by

jomamureg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

Real-Time 3D Capture with RGBD Sensors

This document presents a method called Function4D for real-time human volumetric capture from sparse RGBD sensors. It combines temporal volumetric fusion and deep implicit functions. Dynamic sliding fusion is used to fuse neighboring depth observations over a sliding window, producing noise-eliminated and temporally consistent results. Detail-preserving deep implicit functions are then used to generate detailed and complete surface reconstructions while eliminating dependency on long-term tracking. Experiments show it outperforms existing methods in terms of view sparsity, reconstruction quality, and efficiency.

Uploaded by

jomamureg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Function4D: Real-time Human Volumetric Capture from

Very Sparse Consumer RGBD Sensors

Tao Yu1 , Zerong Zheng1 , Kaiwen Guo2 , Pengpeng Liu3 , Qionghai Dai1 , Yebin Liu1
1
Department of Automation, Tsinghua University, Beijing, China 2 Google, Switzerland
3
Institute of Automation, Chinese Academy of Sciences, Beijing, China

Figure 1: Volumetric capture results of various challenging scenarios using Function4D.

Abstract fast motions, and changing topologies (e.g., human-object


manipulations and multi-person interactions) that need to
Human volumetric capture is a long-standing topic in be faithfully reconstructed. Although high-end volumetric
computer vision and computer graphics. Although high- capture systems [4, 13, 25, 5, 27, 36] based on dense camera
quality results can be achieved using sophisticated off-line rigs (up to 100 cameras [8]) and custom-designed lighting
systems, real-time human volumetric capture of complex conditions [47, 14] can achieve high-quality reconstruction,
scenarios, especially using light-weight setups, remains they all suffer from complicated system setups and are lim-
challenging. In this paper, we propose a human volumet- ited to professional studio usage.
ric capture method that combines temporal volumetric fu-
In contrast, light-weight volumetric/performance capture
sion and deep implicit functions. To achieve high-quality
systems are more practical and attractive. Given a pre-
and temporal-continuous reconstruction, we propose dy-
scanned template, [22, 59, 15] track dense surface de-
namic sliding fusion to fuse neighboring depth observa-
formations from single-view RGB input [17, 18]. How-
tions together with topology consistency. Moreover, for de-
ever, the prerequisite of a fixed-topology template restricts
tailed and complete surface generation, we propose detail-
their applications for general volumetric capture. In 2015,
preserving deep implicit functions for RGBD input which
DynamicFusion [33] proposed the first template-free and
can not only preserve the geometric details on the depth
single-view dynamic 3D reconstruction system. The fol-
inputs but also generate more plausible texturing results.
lowing works [53, 54, 51, 42] further improve the recon-
Results and experiments show that our method outperforms
struction quality for human performance capture by incor-
existing methods in terms of view sparsity, generalization
porating semantic body priors. However, it remains chal-
capacity, reconstruction quality, and run-time efficiency.
lenging for them to handle large topological changes like
dressing or taking-off clothes. Recently, a line of research
1. Introduction [32, 37, 38, 23] leverages deep implicit functions for tex-
tured 3D human reconstruction only from a single RGB
Real-time volumetric capture of human-centric scenar- image. However, they still suffer from off-line reconstruc-
ios is the key to a large number of applications ranging tion performance [38, 37] or over-smoothed, temporally
from telecommunications, education, entertainment, and so discontinuous results [23]. State-of-the-art real-time volu-
on. And the underlying technique, volumetric capture, is metric capture systems are volumetric fusion methods like
a challenging and long-standing problem in both computer Fusion4D [10] and Motion2Fusion [9]. But both of them
vision and computer graphics due to the complex shapes, rely on custom-designed high-quality depth sensors (up to

1
120 fps and 1k resolution) and multiple (up to 9) high-end Volumetric capture from multi-view stereo. Multi-view
GPUs, which is infeasible for consumer usage. volumetric capture is an active research area in the computer
In this paper, we propose Function4D, a volumetric cap- vision and graphics community. Previous works use multi-
ture system using very sparse (as sparse as 3) consumer view images for human model reconstruction [20, 41, 25].
RGBD sensors. Compared with existing systems, our sys- Shape cues like silhouette, stereo, shading, and cloth priors
tem is able to handle various challenging scenarios, in- have been integrated to improve reconstruction/rendering
cluding human-object manipulations, dressing or taking off performance [41, 25, 49, 48, 47, 4, 31, 36, 50]. State-of-the-
clothes, fast motions and even multi-person interactions, as art methods build extremely sophisticated systems where
shown in Fig. 1. up to 100 cameras [8] and even custom-designed gradient
Our key observations are: To generate complete and tem- lighting [14] for high-quality volumetric capture. In partic-
poral consistent results, current volumetric fusion methods ular, the methods by [8] and [14] first perform multi-view
have to fuse as much temporal depth observations as pos- stereo for the point cloud generation, followed by mesh
sible. This results in heavy dependency on accurate and construction, simplification, tracking, and post-processing
long-term non-rigid tracking, which is especially challeng- steps such as UV mapping. Although the results are com-
ing under severe topology changes and large occlusions. On pelling, the reliance on well-controlled multi-camera stu-
the contrary, deep implicit functions are good at complet- dios and a huge amount of computational resources prohibit
ing surfaces, but they cannot recover detailed and temporal them from being used in living spaces.
continuous results due to the insufficient usage of depth in- Template-based Performance Capture. For perfor-
formation and severe noise from consumer RGBD sensors. mance capture, some of the previous works leverage pre-
To overcome all the limitations above, we propose a scanned templates and exploit multi-view geometry to track
novel volumetric capture framework that organically com- the motion of the templates. For instance, the methods
bines volumetric fusion with deep implicit functions. By in [46, 13, 5] adopted a template with an embedded skeleton
introducing dynamic sliding fusion, we re-design the vol- driven by multi-view silhouettes and temporal feature con-
umetric fusion pipeline to restrict tracking and fusion in a straints. Their methods are then extended to handle mul-
sliding window and finally got noise-eliminated, topology- tiple interacting characters in [27, 26]. Besides templates
consistent, and temporally-continuous volumetric fusion re- with an embedded skeleton, some works adopted non-rigid
sults. Based on the sliding fusion results, we propose detail- deformation for template motion tracking. Li et,al [22] uti-
preserving deep implicit functions for final surface recon- lized embedded deformation graph in [43] to parameter-
struction to eliminate the heavy dependency on long-term ize the non-rigid deformations of the pre-scanned template.
tracking. Moreover, by encoding truncated projective SDF Guo et,al. [15] adopted an `0 norm constraint to generate
(PSDF) values explicitly and incorporating attention mech- articulated motions for bodies and faces without explicitly
anism into the multi-view feature aggregation stage, our constructing a skeleton. Zollhöfer et, al. [59] took advan-
networks not only achieve detailed reconstruction results tage of GPU to parallelize the non-rigid registration algo-
but also orders of magnitude faster than existing methods. rithm and achieved real-time performance of general non-
Our contributions can be summarized as: rigid tracking. Recently, capturing 3D dense human body
• The first real-time volumetric capture system which com- deformation with coarse-to-fine registration from a single
bines volumetric fusion with deep implicit functions us- RGB camera has been enabled [52] and improved for real-
ing very sparse consumer RGBD sensors. time performance [17]. DeepCap [18] introduced a deep
learning method that jointly infers the articulated and non-
• Dynamic Sliding Fusion for generating noise-eliminated rigid 3D deformation parameters in a single feed-forward
and topology consistent volumetric fusion results. pass. Although template-based approaches require less in-
put than multi-view stereo methods, they are incapable of
• Detail-preserving Implicit Functions specifically de-
handling topological changes due to the prerequisite of a
signed for sufficient utilization of RGBD information to
template with fixed topology.
generate detailed reconstruction results.
Volumetric Fusion for Dynamic 3D Reconstruction. To
• The training and evaluation dataset, which contains 500 get rid of template priors and realize convenient deploy-
high-resolution scans of various poses and clothes, will ment, researchers turn to use one or sparse depth sensors for
be publicly available to stimulate future research. 3D reconstruction. In 2015, DynamicFusion [33] proposed
the first template-free, single-view, real-time dynamic 3D
2. Related Work reconstruction system, which integrates multiple frames
into a canonical model to reconstruct a complete surface
In the following, we focus on 3D human volumet- model. However, it only handles controlled and relatively
ric/performance capture and classify existing methods into slow motions due to the challenges of real-time non-rigid
4 categories according to their underlying techniques. tracking. In order to improve the robustness of Dynam-

2
Figure 2: Overview. We focusing on geometry reconstruction in this figure. For color inference please refer to Fig.5.

icFusion, the following works incorporated killing/sobolev a post-processing optimization step. IFNet [7] and IP-
constraints [39, 40], articulated priors [21], skeleton track- Net [2] can recover impressive 3D humans from partial
ing [53], parametric body models [54], sparse inertial mea- point clouds, but the dependency on multi-scale 3D con-
surements [55], data-driven prior [42] or learned correspon- volutions and parametric body models block the realization
dences [3] into the non-rigid fusion pipeline. However, of real-time reconstruction performance.
all of these methods are prone to tracking failure in invis-
ible areas, which is an inherent drawback of single-view 3. Overview
systems. To overcome this limitation, Fusion4D [10] and
Motion2fusion [9] focused on real-time multi-view setups As shown in Fig. 2, the proposed volumetric capture
using high-end custom-design sensors, with the notion of pipeline mainly contains 2 steps: Dynamic Sliding Fusion
key volume updating and learning-based surface matching. and Deep Implicit Surface Reconstruction. Given a group
Even though the pipelines were carefully designed in [10] of synchronized multi-view RGBD inputs, we first perform
and [9], they still suffer from incomplete and noisy recon- dynamic sliding fusion by fusing its neighboring frames to
structions when severe topological changes occur especially generate noise-eliminated and temporal-continuous fusion
under very sparse system setups. results. After that, we re-render multi-view RGBD images
using the sliding fusion results in the original viewpoints.
Learning-based 3D Human Reconstruction. Fueled by Finally, in the deep implicit surface reconstruction step, we
the rapid developments in neural 3D representations(e.g., propose detail-preserving implicit functions (which consists
[34, 35, 6, 29] etc.), a lot of data-driven methods for 3D of multi-view image encoders, a feature aggregation mod-
human reconstruction have been proposed in recent years. ule, and an SDF/RGB decoder) for generating detailed and
Methods in [58, 1] proposed to deform a parametric body complete reconstruction results.
model to fit the image observations including keypoints,
silhouettes, and shading. DeepHuman [57] combined the 4. Dynamic Sliding Fusion (DSF)
parametric body model with a coarse-scale volumetric re-
construction network to reconstruct 3D human models from Different from previous volumetric fusion methods, the
a single RGB image. Some methods infer human shapes proposed DSF method aims at augmenting current obser-
on 2D image domains using multi-view silhouettes [32] vations but not completing surfaces. So we re-design the
or front-back depth pairs [12]. PIFu [37] proposed to fusion pipeline to make sure we can get topologically con-
regress an implicit function using pixel-aligned image fea- sistent and noise-eliminated results for current observations.
tures. Unlike voxel-based methods, PIFu is able to recon- The proposed DSF mainly contains 3 steps: topology-
struct high-resolution results thanks to the compactness of aware node graph initialization, non-rigid surface tracking,
the implicit representations. PIFuHD [38] extended PIFu and observation-consistent truncated SDF (TSDF) fusion.
to capture more local details. However, both PIFu [37] To eliminate the heavy dependency on long-term tracking,
and PIFuHD [38] fail to reconstruct plausible models in instead of fusing all frames as in previous methods, we only
cases of challenging poses and self-occlusions. PaMIR [56] perform fusion in a sliding window in DSF. Specifically, we
resolved this challenge by using the SMPL model as a allow a 1-frame delay for the reconstruction pipeline and
prior but suffers from run-time inefficiency since it requires fuse the current frame (indexed by t) only with its preced-

3
Figure 3: Topology-aware node graph initialization. (a) and
(b) are the preceding and the current frame, respectively. Figure 4: Evaluation of dynamic sliding fusion. From (a) to
(c) is the overlay of the current frame and the warped node (d) are multi-view RGB references and depth masks, multi-
graph of the preceding frame; erroneous nodes can be found view depth input rendered in a side viewpoint, results with-
around the pillow and arms where large topological changes out and with dynamic sliding fusion, respectively.
occur. (d), (e) and (f) are the cleared node graph, the addi-
tional nodes (green) for newly observed surfaces, and the
final node graph for the current frame, respectively. and we delete this node to maintain the current mesh topol-
ogy. Finally, considering that there may still exist newly
observed surfaces that are not covered by the cleared node
ing frame (t-1) as well as its succeeding frame (t+1) to min- graph, we further refine it based on the current observations
imize the topological changes and tracking error accumula- by sampling additional nodes (Fig. 3(e)) as in DynamicFu-
tion. Note that we only need to perform non-rigid surface sion [33] to make sure that all of the surfaces can be covered
tracking for the succeeding frame since the deformation be- by the final node graph (Fig. 3(f)).
tween the current frame and the preceding frame has already
been tracked. Regarding the TSDF fusion stage, we propose 4.2. Non-rigid Surface Tracking
observation-consistent TSDF fusion to fuse multi-view ob- For non-rigid surface registration, we follow previ-
servations of frame t+1 into frame t. ous methods to search projective point-to-plane correspon-
dences between the surface at frame t and the multi-view
4.1. Topology-aware Node Graph Initialization depth observations at frame t+1. The definition of the non-
Previous fusion methods initialize the embedded defor- rigid tracking energy is:
mation graph (ED graph) [43] in the canonical frame [33].
However, such an ED graph cannot well describe the topo- Etracking = λdata Edata + λreg Ereg , (1)
logical changes in live frames. Different from previous
methods, we have to initialize the ED graph exactly for where Edata and Ereg are the energies of data term and reg-
the current frame to guarantee that the topology of the node ularization term respectively. The data term measures the
graph is consistent with the current observations. However, fitting error between the deformed surface and the depth ob-
it is inefficient to initialize the node graphs from scratch for servations, while the regularization term enforces local as-
every frame because of the complexity of node selection, rigid-as-possible surface deformations. A Gauss-Newton
graph connection, and volume KNN-field calculation. To solver with Preconditioned Conjugate Gradient algorithm
overcome this limitation, we propose topology-aware node (PCG) is used to solve this non-linear optimization problem
graph initialization, which can not only leverage the node efficiently on GPU. Please refer to [33] for more details.
graph of the previous frame for fast initialization but also
4.3. Observation-consistent TSDF Fusion
has the ability to generate a topologically-consistent node
graph for the current observations. After non-rigid surface tracking, we warp the volume of
As shown in Fig. 3, we first initialize the current node frame t to the depth observations at frame t+1 for TSDF
graph using the live node graph from the preceding frame. fusion. Since we are focusing on augmenting the current
This is achieved by warping the node graph of the preceding observations but not completing surfaces, we propose an
frame to the current frame directly. Due to tracking errors aggressive TSDF fusion strategy to eliminate the impact of
and topological changes, the live node graph may not be tracking errors and severe topological changes. Specifically,
well aligned with current observations (Fig. 3(c)). So we in our fusion method, a voxel in the volume of frame t will
clear those nodes that are located far from current observa- update its TSDF value if and only if: i) the tracking er-
tions by constraining their TSDF values in the current TSDF ror corresponding to this voxel is lower than a threshold δe ,
volume (Fig. 3(d)). Specifically, if the magnitude of the nor- and ii) there exist valid voxels (voxels that in the truncated
malized TSDF value corresponding to a node is greater than band of current observations) located around this voxel (the
δt , we suppose that it is relatively far from the observations searching radius is set to 3 in our implementation).

4
Calculating the tracking error for a specific voxel is not
straightforward since non-rigid tracking is established be-
tween the reference surface and the depth maps [9]. So we
first calculate the tracking error for each node on the ED
graph and then calculate the tracking error for each voxel by
interpolation. Suppose C = (vi , pi ) is the correspondence
set after the last iteration of non-rigid tracking, where vi is
the ith vertex on the reference surface and pi is the corre-
sponding point of vi on live depth inputs. The tracking error
corresponding to the jth node can be calculated as:
0
P
v ∈C r(vi , pi )
e(nj ) = P i j , (2)
vi ∈Cj w(vi , nj ) + 

where r(vi0 , pi ) = kpnTi · (vi0 − pi )k22 is the residue of vi


after non-rigid tracking, vi0 the warped position of vi , pni
the surface normal of pi , Cj a subset of C which includes
Figure 5: Network structures of GeoNet and ColorNet.
all the reference vertices controlled by node j, w(vi , nj ) =
exp(−kvi − nj k22 /(2d2 )) the blending weight of node j on
vi , in which d is the influence radius of nj and  = 1e − 6
is used to avoid zero division. designed for RGBD input. The implicit surface reconstruc-
For a voxel xk , its tracking error is then interpolated tion contains two steps: First, we re-render the multi-view
using its K-Nearest-Neighbors on the node graph N (xk ) RGBD images from the fused surface after dynamic slid-
(where K = 4) as: ing fusion. And then, given multi-view RGBD inputs, we
X propose detail-preserving implicit functions to reconstruct
e(xk ) = w(xk , nj ) · e(nj ). (3) a complete surface with texture for the current frame.
j∈N (xk )

5. Deep Implicit Surface Reconstruction 5.1. Multi-view RGBD Re-rendering


After dynamic sliding fusion, we can get noise- In this step, we re-render multi-view RGBD images from
eliminated surfaces. However, the surfaces are by no means the fused surfaces using input camera viewpoints. The re-
complete due to the very sparse inputs and occlusions. The rendered RGBD images contain much less noise than the
goal of the deep implicit surface reconstruction step is to original inputs thanks to the dynamic sliding fusion step.
generate complete and detailed surface reconstruction re- Note that another benefit of multi-view RGBD re-rendering
sults using deep implicit functions. Since we have al- is that we can manually fix the perspective projection pa-
ready fused a 3D TSDF volume in dynamic sliding fu- rameters for all the rendered RGBD images to make sure
sion, a straightforward idea is to use a 3D convolution- they are consistent with the projection parameters that were
based encoder-decoder network to “inpaint” the volume. used for rendering the training dataset.
And the methods in [7, 2] have achieved complete 3D
surface reconstruction results by proposing multi-scale 3D
convolution networks. However, the dependency on inef- 5.2. Detail-preserving Implicit Functions
ficient 3D convolution limits their applications in real-time
systems, and the huge memory consumption restricts them We propose two networks, GeoNet and ColorNet, for
from generating high-resolution results. In contrast, real- inferring detailed and complete geometry with color from
time implicit surface reconstruction can be achieved using multi-view RGBD images. As shown in Fig. 5, GeoNet
2D pixel-aligned local features combined with positional and ColorNet share similar network architectures. Different
encoding as shown in [23]. However, it was designed from [37, 23], we explicitly calculate the truncated PSDF
for using only RGB images and can only generate over- feature in GeoNet for preserving geometric details on depth
smoothed results. Finally, regarding the RGBD-based im- maps. Moreover, we use a multi-head transformer network
plicit functions proposed in [24], we can see that simply for multi-view feature aggregation in ColorNet to gener-
adding depth input as an additional input channel still can- ate more plausible color inference results. Empirically, we
not preserve the geometric details on the depth inputs. found that using only depth images is enough for training
To resolve the limitations above, we propose a new deep the GeoNet, so we eliminate the RGB information for ge-
implicit surface reconstruction method that is specifically ometry reconstruction for efficient reconstruction.

5
(the arms on top of the body) will be misleadingly trans-
ferred to the invisible regions (Fig. 6) and finally leads to
ghost artifacts (the ghost arm in the red circle of Fig. 7(c)).

5.2.2 Multi-view Feature Aggregation

Figure 6: Illustration of the truncated PSDF feature. The Although PIFu [37] has demonstrated multi-view recon-
green line represents the depth input. Note that we visual- struction results by averaging the intermediate feature of the
ize the absolute of PSDF values here for simplicity, and the SDF/RGB decoders, we argue that the average pooling op-
darker the grid, the larger the absolute PSDF values. eration has limited capacity and cannot capture the differ-
ence in inference confidence between different viewpoints.
For color inference, the network should have the ability to
“sense” the geometric structure and also the visibility in dif-
ferent viewpoints for a query point.
To fulfill this goal, we propose to leverage the attention
mechanism in [45] for multi-view feature aggregation in
ColorNet. Compared with direct averaging, the attention
mechanism has the advantage of incorporating the inter-
feature correlations between different viewpoints, which is
Figure 7: Evaluation of the truncated PSDF. (a) is single- necessary for multi-view feature aggregation. Intuitively,
view depth input, (b,c,d) are results generated without for a query point that is visible in view0 but fully oc-
PSDF, with PSDF and with truncated PSDF, respectively. cluded in other views, the feature from the view0 should
play a leading role in the final decoding stage. As shown
in Fig. 10, direct averaging of the multi-view features may
5.2.1 Truncated PSDF Feature lead to erroneous texturing results. On the contrary, using
attention-based feature aggregation can enable more effec-
The feature used for decoding occupancy values in pixel- tive feature merging and thus generate more plausible color
aligned implicit functions can be decomposed into a 2D im- inference results. In practice, we follow [45] to use multi-
age feature and a positional encoding (z value in [37] or head self-attention with 8 heads and 2 layers without posi-
the one-hot mapping in [23]). The previous method [24] tional encoding. The input is the concatenation of multi-
augments the 2D image feature by enhancing the 2D image view geometry features, color features, and RGB values
feature using RGBD images as input. Although this can as in Fig. 5. And we fuse the output multi-head features
successfully guide the network to resolve the z-ambiguity through a two layers FC and weighted summation. More-
of using only RGB images, it does not preserves the geo- over, we found that the attention mechanism has limited im-
metric details on the depth inputs. This is due to the fact provement for geometry reconstruction since the visibility
that the variation of the geometric details on depth inputs is has been encoded by the truncated PSDF feature in GeoNet.
too subtle (when compared with the global range of depth
inputs) for the networks to “sense” by 2D convolutions. To
6. Results
fully utilize the depth information, we propose to use trun-
cated PSDF values as an additional feature dimension. The The results of our system are shown in Fig. 1 and Fig. 8.
truncated PSDF value is calculated by: Note that the temporal-continuous results are reconstructed
by our system under various challenging scenarios, includ-
f (q) = T(q.z − D(Π(q))), (4) ing severe topological changes, human-object manipula-
where q is the coordinate of the query point, Π(·) is the tions, and multi-person interactions.
perspective projection function, D(·) is a bi-linear sampling
6.1. Real-time Implementation
function used for fetching depth values on the depth image,
and T (·) is used to truncate the PSDF values in [−δp , δp ]. In order to achieve real-time performance, we imple-
As shown in Fig. 6, the truncated PSDF value is a strong ment our run-time pipeline fully on GPU. Specifically, for
signal corresponding to the observed depth inputs. More- deep implicit surface reconstruction, we use TensorRT with
over, it also eliminates the ambiguities of using global depth mixed precision for fast inference. After that, the major
values. Fig. 7(b) demonstrates that without using the PSDF efficiency bottleneck of geometry inference lies in the eval-
values, we can only get over-smoothed results even for the uation of an excessive number of voxels when evaluating
visible regions with detailed observations. Moreover, with- every voxel in the volume. Since we already have multi-
out truncation, the depth variations of the visible regions view depth maps as input, we can leverage the depth infor-

6
Figure 8: Temporal reconstruction results of a fast dancing girl. Our system can generate temporally-continuous and high-
quality reconstruction results under challenging deformations (of the skirt) and severe topological changes (of the hair).

mation directly for acceleration without using the surface


localization method in [23]. Specifically, we first use the
depth images to filter out empty voxels. Then we follow the
octree-based reconstruction algorithm [30] to perform in-
ference for the rest voxels in a coarse-to-fine manner, which
starts from a resolution of 643 to the final resolution of 2563 .
To further improve the run-time efficiency, we simplify the
network architectures as follows. For the image encoders,
we follow [23] and use HRNetV2-W18-Small-v2 [44] as
the backbone, setting its output resolution to 64 × 64 and
channel dimension to 32. For the SDF/color decoders, we
use MLPs with skip connections and the hidden neurons as
(128, 128, 128, 128, 128). For dynamic sliding fusion, we
set δt = 0.5, δe = 0.1 and δp = 0.01m for all the cases. We
refer readers to [19, 16] for real-time implementation de-
tails. For multi-view RGBD re-rendering, we render multi-
view RGBD images in a single render pass with original
color images as textures to improve efficiency. Finally, our
system achieves reconstruction at 25f ps with 21ms for dy- Figure 9: Qualitative comparison. For each subject, from
namic sliding fusion, 17ms for deep implicit surface recon- left to right are results of our method, Motion2Fusion [9]
struction (using 3 viewpoints) and 2ms for surface extrac- and Multi-view PIFu [37], respectively.
tion using Marching-Cubes Algorithm [28].
Method P2S×10−3 ↓ Chamfer×10−3 ↓ Normal-Consis↑
Multi-view PIFU [37] 4.594 4.657 0.862
IPNet [2] 3.935 3.858 0.902
6.2. Network Training Details GeoNet 1.678 1.719 0.941

We use 500 high-quality scans for training GeoNet Table 1: Quantitative comparison of geometry reconstruc-
and ColorNet, which contains various poses, clothes and tion with multi-view PIFU and IPNet.
human-object interactions. We rotate each scan around the
yaw axis, apply random shifts and render 60 views of the
scan with image resolution of 512×512. For color image 6.3. Comparisons
rendering, we use the PRT-based rendering as in [37]. For
depth rendering, we first render ground truth depth maps Qualitative Comparison The qualitative comparison with
and then synthesis the sensor noises of TOF depth sensors Motion2Fusion [9] and Multi-view PIFu [37] is shown in
on top of the depth maps according to [11]. Note that we Fig. 9. Given very sparse and low frame rate depth in-
render all the RGBD images using perspective projection puts from consumer RGBD sensors, Motion2Fusion gener-
to keep consistent with real world sensors. During network ates noisy results under severe topological changing regions
training, gradient-based adaptive sampling (in which we use and fast motions due to the deteriorated non-rigid tracking
discrete gaussian curvature and rgb gradient as reference for performance. Moreover, the lack of depth information in
query point sampling in GeoNet and ColorNet respectively) Multi-view PIFu leads to over-smoothed results.
is used for more effective sampling around detailed regions. Quantitative Comparison We compare with 2 state-
We randomly select 3 views from the rendered 60 views of of-the-art deep implicit surface reconstruction methods,
a subject for multi-view training. Multi-view PIFu [37](with RGB images as input) and IP-

7
Method P2S×10−3 ↓ Chamfer×10−3 ↓ Normal-Consis↑
w/o PSDF w.RGBD 2.36 2.458 0.916
w/o PSDF w.depth-only 2.264 2.359 0.918
w. PSDF w.depth-only 1.678 1.719 0.941

Table 2: Ablation study on the truncated PSDF feature.

Net [2](with voxelized point clouds as input). We re-


train their networks using our training dataset and perform
the evaluation on a testing dataset that contains 116 high-
quality scans with various poses, clothes, and human-object
interactions. Tab. 1 shows the quantitative comparison re-
sults. We can see that the lack of depth information de-
teriorates the reconstruction accuracy of Multi-view PIFu.
Moreover, even with multi-view depth images as input, the
heavy dependency on SMPL initialization (which is diffi-
cult to get with large poses and human-object interactions)
restricts the IPNet from generating highly accurate results.
Finally, by explicitly encoding depth observations using the
truncated PSDF values, the proposed GeoNet can not only
achieves accurate reconstruction results but also orders of
magnitude faster than IPNet (approximately 80 seconds for Figure 10: Evaluation of the attention mechanism in Color-
reconstruction). For a detailed description of the compari- Net. From left to right are: input RGB images, texture re-
son, please refer to the supplementary material. sults with (green) and without (red) attention, respectively.
are correlated with each other in the 3D space.
6.4. Ablation Studies
Dynamic Sliding Fusion As shown in Fig. 4(a) and (b), the
7. Conclusion
depth inputs in different views are not consistent with each In this paper, we propose Function4D, a real-time volu-
other due to the challenging hair motion. This results in metric capture system using very sparse consumer RGBD
incomplete results (the orange circle). More importantly, sensors. By proposing dynamic sliding fusion for topology
without using dynamic sliding fusion, the result is much consistent volumetric fusion and detail-preserving deep im-
noisy (the red circle). By using dynamic sliding fusion, we plicit functions for high-quality surface reconstruction, our
can get more complete and noise-eliminated reconstruction system achieves detailed and temporally-continuous volu-
results as shown in Fig. 4(d). Please refer to the supplemen- metric capture even under various extremely challenging
tary video for more clear evaluations. scenarios. We believe that such a light-weight, high-fidelity,
Truncated PSDF Feature The qualitative evaluation of the and real-time volumetric capture system will enable many
truncated PSDF feature is shown in Fig.7. Tab. 2 also pro- applications, especially consumer-level holographic com-
vides quantitative evaluation results for the networks with munications, on-line education, and gaming, etc.
and without using truncated PSDF values. We conduct two Limitations and Future Work Although we can preserve
experiments with RGBD images and depth-mask images the geometric details in the visible regions, generating ac-
as input, respectively. We can see that without using the curate and detailed surfaces&textures for the fully occluded
truncated PSDF feature, the depth-only model and RGBD regions remains challenging. This is because current deep
model produces similar results. Benefiting from the trun- implicit functions are mainly focus on per-frame indepen-
cated PSDF feature, our GeoNet achieves much accurate re- dent reconstruction. Expanding deep implicit functions for
sults, which demonstrates the effectiveness of our method. using temporal observations may resolve this problem in
Attention-based Feature Aggregation In Fig. 10, we com- the future. Moreover, specific materials like black hair may
pare the models without using a multi-view self attention cause the lack of observations of depth sensors and there-
mechanism for color inference qualitatively. Benefiting fore severely deteriorate current system, incorporating RGB
from the multi-view self-attention mechanism, the color in- information for geometry reconstruction may resolve this
ference results becomes much sharper and plausible espe- limitation and we leave this as a future work.
cially around observation boundaries. This is because the Acknowledgements This work is supported by the
self-attention enables dynamic feature aggregation rather National Key Research and Development Program of
than the simple average-based feature aggregation, which China No.2018YFB2100500; the NSFC No.61827805 and
enforces the MLP-based-decoder to learn how multi-view No.61861166002; and the China Postdoctoral Science
features (including geometric features and texture features) Foundation NO.2020M670340.

8
References Non-parametric 3d human shape estimation from single im-
ages. In Proceedings of the IEEE International Conference
[1] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, on Computer Vision, pages 2232–2241, 2019. 3
and Marcus Magnor. Tex2shape: Detailed full human body
[13] Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian
geometry from a single image. In Proceedings of the IEEE
Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. Motion
International Conference on Computer Vision, pages 2293–
capture using joint skeleton tracking and surface estimation.
2303, 2019. 3
In IEEE Conference on Computer Vision and Pattern Recog-
[2] Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian nition, pages 1746–1753. IEEE, 2009. 1, 2
Theobalt, and Gerard Pons-Moll. Combining implicit func-
[14] Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch,
tion learning and parametric models for 3d human recon-
Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts-
struction. In European Conference on Computer Vision
Escolano, Rohit Pandey, Jason Dourgarian, Danhang Tang,
(ECCV). Springer, August 2020. 3, 5, 7, 8
Anastasia Tkach, Adarsh Kowdle, Emily Cooper, Ming-
[3] Aljaž Božič, Michael Zollhöfer, Christian Theobalt, and
song Dou, Sean Fanello, Graham Fyffe, Christoph Rhemann,
Matthias Nießner. Deepdeform: Learning non-rigid rgb-d re-
Jonathan Taylor, Paul Debevec, and Shahram Izadi. The re-
construction with semi-supervised data. In Proc. Computer
lightables: Volumetric performance capture of humans with
Vision and Pattern Recognition (CVPR), IEEE, 2020. 3
realistic relighting. ACM Trans. Graph., 38(6), Nov. 2019.
[4] Derek Bradley, Tiberiu Popa, Alla Sheffer, Wolfgang Hei- 1, 2
drich, and Tamy Boubekeur. Markerless garment capture. In
[15] Kaiwen Guo, Feng Xu, Yangang Wang, Yebin Liu, and
ACM SIGGRAPH 2008 papers, pages 1–9. 2008. 1, 2
Qionghai Dai. Robust non-rigid motion tracking and sur-
[5] Thomas Brox, Bodo Rosenhahn, Juergen Gall, and Daniel face reconstruction using l0 regularization. In Proceedings
Cremers. Combined region and motion-based 3d tracking of of the IEEE International Conference on Computer Vision,
rigid and articulated objects. IEEE transactions on pattern pages 3083–3091, 2015. 1, 2
analysis and machine intelligence, 32(3):402–415, 2009. 1,
[16] Kaiwen Guo, Feng Xu, Tao Yu, Xiaoyang Liu, Qionghai Dai,
2
and Yebin Liu. Real-time geometry, albedo and motion re-
[6] Rohan Chabra, Jan E. Lenssen, Eddy Ilg, Tanner Schmidt,
construction using a single rgbd camera. ACM Transactions
Julian Straub, Steven Lovegrove, and Richard Newcombe.
on Graphics, 36(3):32:1–32:13, 2017. 7
Deep local shapes: Learning local sdf priors for detailed 3d
reconstruction. In Andrea Vedaldi, Horst Bischof, Thomas [17] Marc Habermann, Weipeng Xu, Michael Zollhoefer, Ger-
Brox, and Jan-Michael Frahm, editors, Computer Vision – ard Pons-Moll, and Christian Theobalt. Livecap: Real-time
ECCV 2020, pages 608–625, Cham, 2020. Springer Interna- human performance capture from monocular video. ACM
tional Publishing. 3 Transactions on Graphics (TOG), 38(2):1–17, 2019. 1, 2
[7] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. [18] Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard
Implicit functions in feature space for 3d shape reconstruc- Pons-Moll, and Christian Theobalt. Deepcap: Monocular
tion and completion. In IEEE Conference on Computer Vi- human performance capture using weak supervision. In Pro-
sion and Pattern Recognition (CVPR). IEEE, jun 2020. 3, ceedings of the IEEE/CVF Conference on Computer Vision
5 and Pattern Recognition, pages 5052–5063, 2020. 1, 2
[8] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- [19] Shahram Izadi, David Kim, Otmar Hilliges, David
nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie
and Steve Sullivan. High-quality streamable free-viewpoint Shotton, Steve Hodges, Dustin Freeman, Andrew Davison,
video. ACM Trans. Graph, 34(4):69, 2015. 1, 2 et al. Kinectfusion: real-time 3d reconstruction and inter-
[9] Mingsong Dou, Philip L. Davidson, Sean Ryan Fanello, action using a moving depth camera. In Proceedings of the
Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, 24th annual ACM symposium on User interface software and
Vladimir Tankovich, and Shahram Izadi. Motion2fusion: technology, pages 559–568, 2011. 7
real-time volumetric performance capture. ACM Trans. [20] Takeo Kanade, Peter Rander, and P. J. Narayanan. Virtual-
Graph., 36(6):246:1–246:16, 2017. 1, 3, 5, 7 ized reality: Constructing virtual worlds from real scenes.
[10] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip IEEE MultiMedia, 4(1):34–47, 1997. 2
Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts [21] Chao Li, Zheheng Zhang, and Xiaohu Guo. Articulatedfu-
Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, sion: Real-time reconstruction of motion, geometry and seg-
et al. Fusion4d: Real-time performance capture of challeng- mentation using a single depth camera. In European Confer-
ing scenes. ACM Transactions on Graphics (TOG), 35(4):1– ence on Computer Vision (ECCV), pages 324–40, Munich,
13, 2016. 1, 3 2018. SPRINGER. 3
[11] P. Fankhauser, M. Bloesch, D. Rodriguez, R. Kaestner, M. [22] Hao Li, Bart Adams, Leonidas J Guibas, and Mark Pauly.
Hutter, and R. Siegwart. Kinect v2 for mobile robot naviga- Robust single-view geometry and motion reconstruction.
tion: Evaluation and modeling. In 2015 International Con- ACM Transactions on Graphics, 28(5):1–10, 2009. 1, 2
ference on Advanced Robotics (ICAR), pages 388–394, 2015. [23] Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle
7 Olszewski, and Hao Li. Monocular real-time volumetric per-
[12] Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin, formance capture. In Proceedings of the European Confer-
Cordelia Schmid, and Gregory Rogez. Moulding humans: ence on Computer Vision (ECCV), 2020. 1, 5, 6, 7

9
[24] Zhe Li, Tao Yu, Chuanyu Pan, Zerong Zheng, and Yebin tion. In Proceedings of the IEEE International Conference
Liu. Robust 3d self-portraits in seconds. In Proceedings of on Computer Vision, pages 2304–2314, 2019. 1, 3, 5, 6, 7
the IEEE/CVF Conference on Computer Vision and Pattern [38] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul
Recognition, pages 1344–1353, 2020. 5, 6 Joo. Pifuhd: Multi-level pixel-aligned implicit function for
[25] Yebin Liu, Qionghai Dai, and Wenli Xu. A point-cloud- high-resolution 3d human digitization. In Proceedings of the
based multiview stereo algorithm for free-viewpoint video. IEEE Conference on Computer Vision and Pattern Recogni-
IEEE transactions on visualization and computer graphics, tion, June 2020. 1, 3
16(3):407–418, 2009. 1, 2 [39] Miroslava Slavcheva, Maximilian Baust, Daniel Cremers,
[26] Yebin Liu, Juergen Gall, Carsten Stoll, Qionghai Dai, Hans- and Slobodan Ilic. Killingfusion: Non-rigid 3d reconstruc-
Peter Seidel, and Christian Theobalt. Markerless motion cap- tion without correspondences. In Proceedings of the IEEE
ture of multiple characters using multiview image segmenta- Conference on Computer Vision and Pattern Recognition,
tion. IEEE T-PAMI, 35(11):2720–2735, 2013. 2 pages 1386–1395, 2017. 3
[27] Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel, [40] Miroslava Slavcheva, Maximilian Baust, and Slobodan Ilic.
and Christian Theobalt. Markerless motion capture of inter- Sobolevfusion: 3d reconstruction of scenes undergoing free
acting characters using multi-view image segmentation. In non-rigid motion. In IEEE Conference on Computer Vision
IEEE Conference on Computer Vision and Pattern Recogni- and Pattern Recognition (CVPR), pages 2646–2655, Salt
tion, pages 1249–1256. IEEE, 2011. 1, 2 Lake City, June 2018. IEEE. 3
[28] William E Lorensen and Harvey E Cline. Marching cubes: [41] Jonathan Starck and Adrian Hilton. Surface capture for
A high resolution 3d surface construction algorithm. ACM performance-based animation. IEEE Computer Graphics
siggraph computer graphics, 21(4):163–169, 1987. 7 and Applications, 27(3):21–31, 2007. 2
[29] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- [42] Zhuo Su, Lan Xu, Zerong Zheng, Tao Yu, Yebin Liu, and Lu
bastian Nowozin, and Andreas Geiger. Occupancy networks: Fang. Robustfusion: human volumetric capture with data-
Learning 3d reconstruction in function space. In Proceed- driven visual cues using a rgbd camera. In Proceedings of the
ings IEEE Conf. on Computer Vision and Pattern Recogni- European Conference on Computer Vision (ECCV), 2020. 1,
tion (CVPR), 2019. 3 3
[30] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- [43] Robert W. Sumner, Johannes Schmid, and Mark Pauly. Em-
bastian Nowozin, and Andreas Geiger. Occupancy networks: bedded deformation for shape manipulation. ACM Transac-
Learning 3d reconstruction in function space. In The IEEE tions on Graphics, 26(3), July 2007. 2, 4
Conference on Computer Vision and Pattern Recognition [44] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
(CVPR), June 2019. 7 high-resolution representation learning for human pose esti-
[31] A. Mustafa, H. Kim, J-Y. Guillemaut, and A. Hilton. Tempo- mation. In CVPR, 2019. 7
rally coherent 4d reconstruction of complex dynamic scenes. [45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
In CVPR, 2016. 2 reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Il-
[32] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, lia Polosukhin. Attention is all you need. In I. Guyon,
Chongyang Ma, Hao Li, and Shigeo Morishima. Siclope: U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
Silhouette-based clothed people. In Proceedings of the IEEE wanathan, and R. Garnett, editors, Advances in Neural Infor-
Conference on Computer Vision and Pattern Recognition, mation Processing Systems, volume 30, pages 5998–6008.
pages 4480–4490, 2019. 1, 3 Curran Associates, Inc., 2017. 6
[33] Richard A Newcombe, Dieter Fox, and Steven M Seitz. [46] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan
Dynamicfusion: Reconstruction and tracking of non-rigid Popovic. Articulated mesh animation from multi-view sil-
scenes in real-time. In Proceedings of the IEEE conference houettes. ACM Trans. Graph., 27(3):97:1–97:9, 2008. 2
on computer vision and pattern recognition, pages 343–352, [47] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul E. Debevec, Jo-
2015. 1, 2, 4 van Popovic, Szymon Rusinkiewicz, and Wojciech Matusik.
[34] Jeong Joon Park, Peter Florence, Julian Straub, Richard Dynamic shape capture using multi-view photometric stereo.
Newcombe, and Steven Lovegrove. Deepsdf: Learning con- ACM Trans. Graph., 28(5):174:1–174:11, 2009. 1, 2
tinuous signed distance functions for shape representation. [48] Michael Waschbüsch, Stephan Würmlin, Daniel Cotting,
In The IEEE Conference on Computer Vision and Pattern Filip Sadlo, and Markus H. Gross. Scalable 3d video of
Recognition (CVPR), June 2019. 3 dynamic scenes. The Visual Computer, 21(8-10):629–638,
[35] Songyou Peng and Michael Niemeyer et,al. Convolutional 2005. 2
occupancy networks. In ECCV, 2020. 3 [49] Chenglei Wu, Kiran Varanasi, Yebin Liu, Hans-Peter Seidel,
[36] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J and Christian Theobalt. Shading-based dynamic shape re-
Black. Clothcap: Seamless 4d clothing capture and retar- finement from multi-view video under general illumination.
geting. ACM Transactions on Graphics (TOG), 36(4):1–15, In IEEE ICCV, pages 1108–1115, 2011. 2
2017. 1, 2 [50] Minye Wu, Yuehao Wang, Qiang Hu, and Jingyi Yu.
[37] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- Multi-view neural human rendering. In Proceedings of
ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned the IEEE/CVF Conference on Computer Vision and Pattern
implicit function for high-resolution clothed human digitiza- Recognition (CVPR), June 2020. 2

10
[51] Lan Xu, Zhuo Su, Lei Han, Tao Yu, Yebin Liu, and Lu Fang.
Unstructuredfusion: Realtime 4d geometry and texture re-
construction using commercial RGBD cameras. IEEE Trans.
Pattern Anal. Mach. Intell., 42(10):2508–2522, 2020. 1
[52] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge
Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian
Theobalt. Monoperfcap: Human performance capture from
monocular video. ACM Transactions on Graphics (ToG),
37(2):1–15, 2018. 2
[53] Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su, Jian-
hui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu. Bodyfu-
sion: Real-time capture of human motion and surface geom-
etry using a single depth camera. In IEEE International Con-
ference on Computer Vision (ICCV), pages 910–919, Venice,
2017. IEEE. 1, 3
[54] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai
Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Doublefu-
sion: Real-time capture of human performances with inner
body shapes from a single depth sensor. In IEEE Conference
on Computer Vision and Pattern Recognition(CVPR), pages
7287–7296, Salt Lake City, June 2018. IEEE. 1, 3
[55] Zerong Zheng, Tao Yu, Hao Li, Kaiwen Guo, Qionghai Dai,
Lu Fang, and Yebin Liu. Hybridfusion: Real-time perfor-
mance capture using a single depth sensor and sparse imus.
In European Conference on Computer Vision (ECCV), pages
389–406, Munich, Sept 2018. SPRINGER. 3
[56] Zerong Zheng, Tao Yu, Yebin Liu, and Qionghai Dai.
Pamir: Parametric model-conditioned implicit representa-
tion for image-based human reconstruction. arXiv, 2020. 3
[57] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and
Yebin Liu. Deephuman: 3d human reconstruction from a
single image. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 7739–7749, 2019. 3
[58] Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang
Yang. Detailed human shape estimation from a single im-
age by hierarchical mesh deformation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 4491–4500, 2019. 3
[59] Michael Zollhöfer, Matthias Nießner, Shahram Izadi,
Christoph Rehmann, Christopher Zach, Matthew Fisher,
Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian
Theobalt, et al. Real-time non-rigid reconstruction using
an rgb-d camera. ACM Transactions on Graphics (ToG),
33(4):1–12, 2014. 1, 2

11

You might also like