Efficient Geometry-Aware 3D Generative Adversarial Networks
Efficient Geometry-Aware 3D Generative Adversarial Networks
Eric R. Chan * †1,2 , Connor Z. Lin* 1 , Matthew A. Chan* 1 , Koki Nagano* 2 , Boxiao Pan1 , Shalini De
Mello2 , Orazio Gallo2 , Leonidas Guibas1 , Jonathan Tremblay2 , Sameh Khamis2 , Tero Karras2 , and
Gordon Wetzstein1
1 2
Stanford University NVIDIA
Abstract
Unsupervised generation of high-quality multi-view-
consistent images and 3D shapes using only collections of
single-view 2D photographs has been a long-standing chal-
lenge. Existing 3D GANs are either compute-intensive or
make approximations that are not 3D-consistent; the for-
mer limits quality and resolution of the generated images
and the latter adversely affects multi-view consistency and
shape quality. In this work, we improve the computational
efficiency and image quality of 3D GANs without overly
relying on these approximations. We introduce an expres-
sive hybrid explicit-implicit network architecture that, to-
gether with other design choices, synthesizes not only high- Figure 1. Our 3D GAN enables synthesis of scenes, producing
resolution multi-view-consistent images in real time but also high-quality, multi-view-consistent renderings and detailed geom-
produces high-quality 3D geometry. By decoupling fea- etry. Our approach trains from a collection of 2D images without
ture generation and neural rendering, our framework is target-specific shape priors, ground truth 3D scans, or multi-view
able to leverage state-of-the-art 2D CNN generators, such supervision. Please see the accompanying video for more results.
as StyleGAN2, and inherit their efficiency and expressive-
ever, the image quality and resolution of existing 3D GANs
ness. We demonstrate state-of-the-art 3D-aware synthesis
have lagged far behind those of 2D GANs. Furthermore,
with FFHQ and AFHQ Cats, among other experiments.
their 3D reconstruction quality, so far, leaves much to be
desired. One of the primary reasons for this gap is the com-
putational inefficiency of previously employed 3D genera-
1. Introduction tors and neural rendering architectures.
Generative adversarial networks (GANs) have seen im- In contrast to 2D GANs, 3D GANs rely on a combination
mense progress, with recent models capable of generat- of a 3D-structure-aware inductive bias in the generator net-
ing high-resolution, photorealistic images indistinguishable work architecture and a neural rendering engine that aims
from real photographs [27–29]. Current state-of-the-art at providing view-consistent results. The inductive bias can
GANs, however, operate in 2D only and do not explicitly be modeled using explicit voxel grids [14, 21, 47, 48, 68, 74]
model the underlying 3D scenes. or neural implicit representations [4, 47, 49, 58]. While
Recent work on 3D-aware GANs has begun to tackle the successful in single-scene “overfitting” scenarios, neither
problem of multi-view-consistent image synthesis and, to a of these representations is suitable for training a high-
lesser extent, extraction of 3D shapes without being super- resolution 3D GAN because they are simply too memory
vised on geometry or multi-view image collections. How- inefficient or slow. Training a 3D GAN requires rendering
* Equal contribution.
tens of millions of images, but state-of-the-art neural vol-
† Part of the work was
done during an internship at NVIDIA. ume rendering [45] at high-resolutions with these represen-
Project page: https://github.com/NVlabs/eg3d tations is computationally infeasible. CNN-based image up-
1
sampling networks have been proposed to remedy this [49],
but such an approach sacrifices view consistency and im-
pairs the quality of the learned 3D geometry.
We introduce a novel generator architecture for unsuper-
vised 3D representation learning from a collection of single-
view 2D photographs that seeks to improve the computa-
tional efficiency of rendering while remaining true to 3D-
grounded neural rendering. We achieve this goal with a two-
pronged approach. First, we improve the computational ef-
ficiency of 3D-grounded rendering with a hybrid explicit–
implicit 3D representation that offers significant speed and
memory benefits over fully implicit or explicit approaches Figure 2. Neural implicit representations use fully connected lay-
without compromising on expressiveness. These advan- ers (FC) with positional encoding (PE) to represent a scene, which
tages enable our method to skirt the computational con- can be slow to query (a). Explicit voxel grids or hybrid variants
straints that have limited the rendering resolutions and qual- using small implicit decoders are fast to query, but scale poorly
ity of previous approaches [4, 58] and forced over-reliance with resolution (b). Our hybrid explicit–implicit tri-plane repre-
on image-space convolutional upsampling [49]. Second, sentation (c) is fast and scales efficiently with resolution, enabling
greater detail for equal capacity.
although we use some image-space approximations that
stray from the 3D-grounded rendering, we introduce a dual- 2. Related work
discrimination strategy that maintains consistency between
Neural scene representation and rendering. Emerging
the neural rendering and our final output to regularize their
neural scene representations use differentiable 3D-aware
undesirable view-inconsistent tendencies. Moreover, we in-
representations [1, 3, 6, 8, 13, 17, 43, 44, 52, 65] that can be
troduce pose-based conditioning to our generator, which de-
optimized using 2D multi-view images via neural render-
couples pose-correlated attributes (e.g., facial expressions)
ing [15, 20, 24, 30, 34–37, 40, 45, 46, 50, 51, 54, 62, 63, 70–
for a multi-view consistent output during inference while
72]. Explicit representations, such as discrete voxel grids
faithfully modeling the joint distributions of pose-correlated
(Fig. 2b), are fast to evaluate but often incur heavy mem-
attributes inherent in the training data.
ory overheads, making them difficult to scale to high res-
As an additional benefit, our framework decouples fea- olutions or complex scenes [38, 61]. Implicit representa-
ture generation from neural rendering, enabling it to di- tions, or coordinate networks (Fig. 2a), offer potential ad-
rectly leverage state-of-the-art 2D CNN-based feature gen- vantages in memory efficiency and scene complexity com-
erators, such as StyleGAN2, to generalize over spaces of 3D pared to discrete voxel grids by representing a scene as a
scenes while also benefiting from 3D multi-view-consistent continuous function (e.g., [43, 45, 52, 60, 66]). In practice,
neural volume rendering. Our approach not only achieves these implicit architectures use large fully connected net-
state-of-the-art qualitative and quantitative results for view- works that are slow to evaluate as each query requires a full
consistent 3D-aware image synthesis, but also generates pass through the network. Therefore, fully explicit and im-
high-quality 3D shapes of the synthesized scenes due to its plicit representations provide complementary benefits.
strong 3D-structure-aware inductive bias (see Fig. 1). Local implicit representations [3, 5, 23, 56] and hybrid
explicit–implicit representations [11,35,39,53] combine the
Our contributions are the following:
benefits of both types of representations by offering compu-
tationally and memory-efficient architectures. Inspired by
these ideas, we design a new hybrid explicit–implicit 3D-
• We introduce a tri-plane-based 3D GAN framework,
aware network that uses a memory-efficient tri-plane repre-
which is both efficient and expressive, to enable high-
sentation to explicitly store features on axis-aligned planes
resolution geometry-aware image synthesis.
that are aggregated by a lightweight implicit feature decoder
• We develop a 3D GAN training strategy that pro-
for efficient volume rendering (Fig. 2c). Our representation
motes multi-view consistency via dual discrimination
bears some resemblance to previous plane-based hybrid ar-
and generator pose conditioning while faithfully mod-
chitectures [11, 53], but it is unique in its specific design.
eling pose-correlated attribute distributions (e.g., ex-
Our representation is key to enabling the high 3D GAN im-
pressions) present in real-world datasets.
age quality that we demonstrate through efficient training
• We demonstrate state-of-the-art results for uncondi-
comparable (in time scales) to modern 2D GANs [27].
tional 3D-aware image synthesis on the FFHQ and
AFHQ Cats datasets along with high-quality 3D ge- Generative 3D-aware image synthesis. Generative ad-
ometry learned entirely from 2D in-the-wild images. versarial networks [16] have recently achieved photorealis-
2
tic image quality for 2D image synthesis [25,28,29,55]. Ex-
tending these capabilities to 3D settings has started to gain
momentum as well. Mesh-based approaches build on the
most popular primitives used in computer graphics, but lack
the expressiveness needed for high-fidelity image genera-
tion [33, 64]. Voxel-based GANs directly extend the CNN
generators used in 2D settings to 3D [14, 21, 47, 48, 68, 74].
The high memory requirements of voxel grids and the com-
putational burden of 3D convolutions, however, make high-
resolution 3D GAN training difficult. Low-resolution 3D
volume generation can be remedied with 2D CNN-based
image upsampling layers [49], but without an inductive 3D Figure 3. A synthesized view of the multi-view Family scene,
bias the results often lack view consistency. Block-based comparing a fully implicit Mip-NeRF representation (left), a dense
sparse volume representations overcome some of these is- voxel grid (center), and our tri-plane representation (right). Even
sues, but are applicable to mostly empty scenes [19,35] and though neither voxels nor tri-planes model view-dependent effects,
difficult to generalize across scenes. As an alternative, fully they achieve high quality.
implicit representation networks have been proposed for 3D
scene generation [4, 58], but these architectures are slow to MLP Rel. Speed ↑ Rel. Mem. ↓
query, which makes the GAN training inefficient, limiting Mip-NeRF [2] 8 × 256 1× 1×
the quality and resolution of generated images. Voxels (hybrid) 4 × 128 3.5× 0.33×
One of the primary insights of our work is that an ef- Tri-plane (SSO) 4 × 128 2.9× 0.32×
ficient 3D GAN architecture with 3D-grounded inductive Tri-plane (GAN) 1 × 64 7.8× 0.06×
biases is crucial for successfully generating high-resolution Table 1. Relative speedups and memory consumption compared to
view-consistent images and high-quality 3D shapes. Our Mip-NeRF. The proposed tri-plane representation is 3–8× faster
framework achieves this in several ways. First, unlike most than a fully implicit Mip-NeRF network and only requires a frac-
existing 3D GANs, we directly leverage a 2D CNN-based tion of its memory. In this example, both voxel grid and tri-plane
feature generator, i.e., StyleGAN2 [29], removing the need representation use an MLP-based decoder, as indicated. The num-
for inefficient 3D convolutions on explicit voxel grids. Sec- ber of voxels is chosen to match the total parameters of the tri-
ond, our tri-plane representation allows us to leverage neu- plane representation, thus the resolution is relatively low and the
ral volume rendering as an inductive bias, but in a much memory footprint lower than Mip-NeRF. In the SSO experiment
(Fig. 3), we used a larger decoder for the tri-plane representation
more computationally efficient way than fully implicit 3D
than for the GAN experiments discussed in Sec. 4 to optimize ex-
networks [4, 45, 58]. Similar to [49], we also employ pressiveness over speed for this experiment.
2D CNN-based upsampling after neural rendering, but our
method introduces dual discrimination to avoid view incon- the representation in this section for a single-scene over-
sistencies introduced by the upsampling layers. Unlike ex- fitting (SSO) experiment, before discussing how it is inte-
isting StyleGAN2-based 2.5D GANs, which generate im- grated in our GAN framework in the next section.
ages and depth maps [59], our method works naturally for In the tri-plane formulation, we align our explicit fea-
steep camera angles and in 360◦ viewing conditions. tures along three axis-aligned orthogonal feature planes,
The concurrently developed 3D-aware GANs StyleN- each with a resolution of N × N × C (Fig. 2c) with N
eRF [18] and CIPS-3D [73] demonstrate impressive image being spatial resolution and C the number of channels. We
quality. The central distinction between these and ours is query any 3D position x ∈ R3 by projecting it onto each of
that while StyleNeRF and CIPS-3D operate primarily in the three feature planes, retrieving the corresponding fea-
image-space, with less emphasis on the 3D representation, ture vector (Fxy , Fxz , Fyz ) via bilinear interpolation, and
our method operates primarily in 3D. Our approach demon- aggregating the three feature vectors via summation. An
strates greater view consistency, and is capable of generat- additional lightweight decoder network, implemented as a
ing high-quality 3D shapes. Furthermore, our experiments small MLP, interprets the aggregated 3D features F as color
report superior FID image scores on FFHQ and AFHQ. and density. These quantities are rendered into RGB images
using (neural) volume rendering [41, 45].
3. Tri-plane hybrid 3D representation The primary advantage of this hybrid representation is
Training a high-resolution GAN requires a 3D represen- efficiency—by keeping the decoder small and shifting the
tation that is both efficient and expressive. In this section, bulk of the expressive power into the explicit features, we
we introduce a new hybrid explicit–implicit tri-plane repre- reduce the computational cost of neural rendering compared
sentation that offers both of these advantages. We introduce to fully implicit MLP architectures [2, 45] without losing
3
Figure 4. Our 3D GAN framework comprises several parts: a pose-conditioned StyleGAN2-based feature generator and mapping net-
work, a tri-plane 3D representation with a lightweight feature decoder, a neural volume renderer, a super-resolution module, and a pose-
conditioned StyleGAN2 discriminator with dual discrimination. This architecture elegantly decouples feature generation and neural render-
ing, allowing the use of a powerful StyleGAN2 generator for 3D scene generalization. Moreover, the lightweight 3D tri-plane representation
is both expressive and efficient in enabling high-quality 3D-aware view synthesis in real-time.
expressiveness. To validate that the tri-plane representa- tion to efficiently render images through neural volume ren-
tion is compact yet sufficiently expressive, we evaluate it dering, but make a number of modifications to adapt this
with a common novel-view synthesis setup. For this pur- representation to the 3D GAN setting. Unlike in the SSO
pose, we directly optimize the features of the planes and the experiment, where the features of the planes were directly
weights of the decoder to fit 360◦ views of a scene from optimized from the multiple input views, for the GAN set-
the Tanks & Temples dataset [31] (Fig. 3). In this exper- ting we generate the tri-plane features, each containing 32
iment, we use feature planes of resolution N = 512 and channels, with the help of a 2D convolutional StyleGAN2
channels C = 48, paired with an MLP of four layers of backbone (Sec. 4.1). Instead of producing an RGB im-
128 hidden units each and a Fourier feature encoding [66]. age, in the GAN setting our neural renderer aggregates fea-
We compare the results against a dense feature volume of tures from each of the 32-channel tri-planes and predicts
equal capacity. For reference, we include comparisons to a 32-channel feature images from a given camera pose. This
state-of-the-art fully implicit 3D representation [2]. Fig. 3 is followed by a “super-resolution” module to upsample
and Tab. 1 demonstrate that the tri-plane representation is and refine these raw neurally rendered images (Sec. 4.2).
capable of representing this complex scene, albeit without The generated images are critiqued by a slightly modified
view-dependent effects, outperforming dense feature vol- StyleGAN2 discriminator (Sec. 4.3). The entire pipeline
ume representations [38, 61] and fully implicit represen- is trained end-to-end from random initialization, using the
tations [45] in terms of PSNR and SSIM, while offering non-saturating GAN loss function [16] with R1 regulariza-
considerable advantages in computation and memory effi- tion [42], following the training scheme in StyleGAN2 [29].
ciency. For a side length of N features, tri-planes scale with To speed training, we use a two-stage training strategy in
O(N 2 ) rather than O(N 3 ) as dense voxels do, which means which we train with a reduced (642 ) neural rendering reso-
for equal capacity and memory, the tri-plane representation lution followed by a short fine-tuning period at full (1282 )
can use higher resolution features and capture greater de- neural rendering resolution. Additional experiments found
tail. Finally, our tri-plane representation has one other key that regularization to encourage smoothness of the density
advantage over these alternatives: the feature planes can be field helped reduce artifacts in 3D shapes. The following
generated with an off-the-shelf 2D CNN-based generator, sections discuss major components of our framework in de-
enabling generalization across 3D representations using the tail. For additional descriptions, implementation details,
GAN framework discussed next. and hyperparameters, please see the supplement.
4
4.3. Dual discrimination
As in standard 2D GAN training, the resulting renderings
are critiqued by a 2D convolutional discriminator. We use a
StyleGAN2 discriminator with two modifications.
First, we introduce dual discrimination as a method to
avoid multi-view inconsistency issues observed in prior
Figure 5. Dual discrimination ensures that the raw neural render- work [47, 49]. For this purpose, we interpret the first three
+
ing IRGB and super-resolved output IRGB maintain consistency,
feature channels of a neurally rendered feature image IF
enabling high-resolution and multi-view-consistent rendering.
as a low-resolution RGB image IRGB . Intuitively, dual
feature image is split channel-wise and reshaped to form discrimination then ensures consistency between IRGB and
+
three 32-channel planes (see Fig. 4). We choose Style- the super-resolved image IRGB . This is achieved by bilin-
+
GAN2 for predicting the tri-plane features because it is a early upsampling IRGB to the same resolution as IRGB and
well-understood and efficient architecture achieving state- concatenating the results to form a six-channel image (see
of-the-art results for 2D image synthesis. Furthermore, our Fig. 4). The real images fed into the discriminator are also
model inherits many of the desirable properties of Style- processed by concatenating each of them with an appropri-
GAN: a well-behaved latent space that enables style-mixing ately blurred copy of itself. We discriminate over these six-
and latent-space interpolation (see Sec. 5 and supplement). channel images instead of the three-channel images tradi-
We sample features from the tri-planes, aggregate by tionally seen in GAN discriminators.
summation, and process the aggregated features with a Dual discrimination not only encourages the final output
lightweight decoder, as described in Sec. 3. Our decoder to match the distribution of real images, but also offers ad-
is a multi-layer perceptron with a single hidden layer of ditional effects: it encourages the neural rendering to match
64 units and softplus activation functions. The MLP does the distribution of downsampled real images; and it encour-
not use a positional encoding, coordinate inputs, or view- ages the super-resolved images to be consistent with the
direction inputs. This hybrid representation can be queried neural rendering (see Fig. 5). The second point importantly
for continuous coordinates and outputs a scalar density σ allows us to leverage effective image-space super-resolution
as well as a 32-channel feature, both of which are then pro- layers without introducing view-inconsistency artifacts.
cessed by a neural volume renderer to project the 3D feature Second, we make the discriminator aware of the cam-
volume into a 2D feature image. era poses from which the generated images are ren-
Volume rendering [41] is implemented using two-pass dered. Specifically, following the conditional strategy from
importance sampling as in [45]. Following [49], volume StyleGAN2-ADA [26], we pass the rendering camera in-
rendering in our GAN framework produces feature images, trinsics and extrinsics matrices (collectively P) to the dis-
rather than RGB images, because feature images contain criminator as a conditioning label. We find that this con-
more information that can be effectively utilized for the ditioning introduces additional information that guides the
image-space refinement described next. For the majority generator to learn correct 3D priors. We provide additional
of the experiments reported in this manuscript, we render studies in the supplement showing the effect of this discrim-
32-channel feature images IF at a resolution of 1282 , with inator conditioning and the robustness of our framework to
96 total depth samples per ray. high levels of noise in the input camera poses.
5
Figure 6. Curated examples at 5122 , synthesized by models trained with FFHQ [28] and AFHQv2 Cats [7]
Figure 7. Qualitative comparison between GIRAFFE, π-GAN, Lifting StyleGAN, ours, with FFHQ at 2562 . Shapes are iso-surfaces
extracted from the density field using marching cubes. We inspected the underlying 3D representations of GIRAFFE and found that its
over-reliance on image-space approximations significantly harms the learning of the 3D geometry.
attributes observed in the training images. To this end, we we randomly swap the conditioning pose in P with another
provide the backbone mapping network not only a latent random pose with 50% probability during training.
code vector z, but also the camera parameters P as input,
following the conditional generation strategy in [26]. By
giving the backbone knowledge of the rendering camera po-
sition, we allow the target view to influence scene synthesis. 5. Experiments and results
During training, pose conditioning allows the generator Datasets. We compare methods on the task of uncondi-
to model pose-dependent biases implicit to the dataset, al- tional 3D-aware generation with FFHQ [28], a real-world
lowing our model to faithfully reproduce the image distri- human face dataset, and AFHQv2 Cats [7,27], a small, real-
butions in the dataset. To prevent the scene from shifting world cat face dataset. We augment both datasets with hori-
with camera pose during inference, we condition the gener- zontal flips and use off-the-shelf pose estimators [10, 32] to
ator on a fixed camera pose when rendering from a moving extract approximate camera extrinsics. For all methods on
camera trajectory. We noticed that always conditioning the AFHQv2, we apply transfer learning [26] from correspond-
generator with the rendering camera pose can lead to degen- ing FFHQ checkpoints; for our method on AFHQv2 5122 ,
erate solutions where the GAN produces 2D billboards an- we additionally use adaptive data augmentation [26]. For
gled towards the camera (see supplement). To prevent this, more results, please see the accompanying video.
6
FFHQ Cats Res. GIRAFFE π-GAN Lift. SG Ours Ours + TC
FID↓ ID↑ Depth↓ Pose↓ FID↓ 2
256 181 5 51 27 36
GIRAFFE 2562 31.5 0.64 0.94 .089 16.1 5122 161 1 — 26 35
π-GAN 1282 29.9 0.67 0.44 .021 16.0
Table 3. Runtime in frames per second at different rendering res-
Lift. SG 2562 29.8 0.58 0.40 .023 —
olutions. We compare variants of our approach with and without
Ours 2562 4.8 0.76 0.31 .005 3.88
tri-plane caching (TC). Run on a single RTX 3090 GPU.
Ours 5122 4.7 0.77 0.39 .005 2.77†
Table 2. Quantitative evaluation using FID, identity consistency FID ↓ FACS Smile Std. ↓
(ID), depth accuracy, and pose accuracy for FFHQ and AFHQ Naive model 5.5 0.069
Cats. Labelled is the image resolution of training and evaluation. + DD 6.5 0.054
†
Trained with adaptive data augmentation [26]. + DD, GPC (ours) 4.7 0.031
5.1. Comparisons Table 4. Dual-discrimination (DD) improves multi-view expres-
Baselines. We compare our methods against three state- sion consistency but hurts the model’s ability to capture pose-
correlated attributes for image quality. Adding generator pose con-
of-the-art methods for 3D-aware image synthesis: π-
ditioning (GPC) allows the model to improve upon both aspects.
GAN [4], GIRAFFE [49], and Lifting StyleGAN [59].
Reported at 5122 , with FFHQ.
Qualitative results. Fig. 6 presents selected examples 96 total depth samples per ray, suitable for applications
synthesized by our model with FFHQ and AFHQ at a such as real-time visualization. When rendering consecu-
resolution of 5122 , highlighting the image quality, view- tive frames of a static scene, we need not regenerate the
consistency, and diversity of outputs produced by our tri-plane features every frame; caching the generated fea-
method. Fig. 7 provides a qualitative comparison against tures is a simple tweak that improves render speed. The
baselines. While GIRAFFE synthesizes high-quality im- proposed approach is significantly faster than fully implicit
ages, reliance on view-inconsistent convolutions produces methods like π-GAN [4]. Although it is not as fast as Lift-
poor-quality shapes and identity shift—note the hairline in- ing StyleGAN [59] and GIRAFFE [49], we believe major
consistency between rendered views. π-GAN and Lifting improvements in image quality, geometry quality, and view-
StyleGAN generate adequate shapes and images but both consistency outweigh the increased compute cost.
struggle with photorealism and in capturing detailed shapes.
Our method synthesizes not only images that are higher
quality and more view-consistent but also higher-fidelity 3D
geometry as seen in the detailed glasses and hair strands. 5.2. Ablation study
Quantitative evaluations. Table 2 provides quantitative Without dual discrimination, generated images can in-
metrics comparing the proposed approach against baselines. clude multi-view inconsistencies due to the unconstrained
We measure image quality with Fréchet Inception Distance image-space super-resolution layers. We measure this ef-
(FID) [22] between 50k generated images and all available fect quantitatively by extracting smile-related Facial Action
real images. We evaluate shape quality by calculating MSE Coding System (FACS) [12] coefficients from videos pro-
against pseudo-ground-truth depth-maps (Depth) and poses duced by models with and without dual discrimination, us-
(Pose) estimated from synthesized images by [10]; a simi- ing a proprietary facial tracker. We measure the standard
lar evaluation was introduced by [59]. We assess multi-view deviation of smile coefficients for the same scene across
facial identity consistency (ID) by calculating the mean Ar- video frames. A view-consistent scene should exhibit lit-
cface [9] cosine similarity score between pairs of views of tle expression shift and thus produce little variation in smile
the same synthesized face rendered from random camera coefficients. This is validated in Table 4 showing that intro-
poses. Additional evaluation details are provided in the sup- ducing dual discrimination (second row) reduces the smile
plement. Our model demonstrates significant improvements coefficient variation versus the naive model (first row), in-
in FID across both datasets, bringing the 3D GAN to near dicating improved expression consistency. However, dual
the same level as StyleGAN2 5122 (2.97 for FFHQ [29] and discrimination also reduces image quality as seen by the
2.99 for Cats [26]) while also maintaining state-of-the-art slightly worse FID score, perhaps because the model is re-
view consistency, geometry quality, and pose accuracy. stricted from reproducing the pose-correlated attribute bi-
ases in the FFHQ dataset. By adding generator pose con-
Runtime. Table 3 compares rendering speed at inference ditioning (third row), we allow the generator to faithfully
running on a single NVIDIA RTX 3090 GPU. Our end- model pose-correlated attributes while decoupling them at
to-end approach achieves real-time framerates at 5122 fi- inference, leading to both the best FID score and view-
nal resolution with 1282 neural rendering resolution and consistent results.
7
Figure 9. We use PTI [57] to fit a target image and recover the
underlying 3D shape. Target (left); reconstructed image (center);
Figure 8. Style-mixing [27–29] with FFHQ 5122 . reconstructed shape (right). From a model trained on FFHQ 5122 .
8
References rendering at 200fps. arXiv preprint arXiv:2103.10380, 2021.
2
[1] Matan Atzmon and Yaron Lipman. SAL: Sign agnostic
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
learning of shapes from raw data. In IEEE Conference on
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Computer Vision and Pattern Recognition (CVPR), 2020. 2
Yoshua Bengio. Generative adversarial nets. In Advances in
[2] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Neural Information Processing Systems (NeurIPS), 2014. 2,
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. 4
Mip-nerf: A multiscale representation for anti-aliasing neu-
[17] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
ral radiance fields. In ICCV, 2021. 3, 4
Yaron Lipman. Implicit geometric regularization for learning
[3] Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, shapes. In International Conference on Machine Learning
Julian Straub, Steven Lovegrove, and Richard Newcombe. (ICML), 2020. 2
Deep local shapes: Learning local SDF priors for detailed [18] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian
3D reconstruction. In European Conference on Computer Theobalt. StyleNeRF: A style-based 3D-aware genera-
Vision (ECCV), 2020. 2 tor for high-resolution image synthesis. arXiv preprint
[4] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, arXiv:2110.08985, 2021. 3, 8
and Gordon Wetzstein. pi-GAN: Periodic implicit generative [19] Zekun Hao, Arun Mallya, Serge Belongie, and Ming-Yu Liu.
adversarial networks for 3D-aware image synthesis. In IEEE GANcraft: Unsupervised 3D neural rendering of minecraft
Conference on Computer Vision and Pattern Recognition worlds. In IEEE International Conference on Computer
(CVPR), 2021. 1, 2, 3, 5, 7 Vision (ICCV), 2021. 3
[5] Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning con- [20] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall,
tinuous image representation with local implicit image func- Jonathan T. Barron, and Paul Debevec. Baking neu-
tion. In IEEE Conference on Computer Vision and Pattern ral radiance fields for real-time view synthesis. In IEEE
Recognition (CVPR), 2021. 2 International Conference on Computer Vision (ICCV), 2021.
[6] Zhiqin Chen and Hao Zhang. Learning implicit fields 2
for generative shape modeling. In IEEE Conference on [21] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escap-
Computer Vision and Pattern Recognition (CVPR), 2019. 2 ing Plato’s cave: 3D shape from adversarial rendering. In
[7] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. IEEE International Conference on Computer Vision (ICCV),
Stargan v2: Diverse image synthesis for multiple domains. 2019. 1, 3
In Proceedings of the IEEE Conference on Computer Vision [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
and Pattern Recognition, 2020. 6 Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter.
[8] Thomas Davies, Derek Nowrouzezahrai, and Alec Jacobson. GANs trained by a two time-scale update rule converge to
Overfit neural networks as a compact shape representation. a nash equilibrium. In Advances in Neural Information
arXiv preprint arXiv:2009.09808, 2020. 2 Processing Systems (NeurIPS), 2017. 7
[9] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos [23] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei
Zafeiriou. Arcface: Additive angular margin loss for deep Huang, Matthias Nießner, and Thomas Funkhouser. Lo-
face recognition. In CVPR, 2019. 7 cal implicit grid representations for 3D scenes. In IEEE
[10] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Conference on Computer Vision and Pattern Recognition
Jia, and Xin Tong. Accurate 3d face reconstruction with (CVPR), 2020. 2
weakly-supervised learning: From single image to image [24] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker.
set. In IEEE Computer Vision and Pattern Recognition SDFDiff: Differentiable rendering of signed distance fields
Workshops, 2019. 4, 6, 7 for 3D shape optimization. In IEEE Conference on
[11] Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Computer Vision and Pattern Recognition (CVPR), 2020. 2
Graham W. Taylor, and Joshua M. Susskind. Unconstrained [25] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
scene generation with locally conditioned radiance fields. Progressive growing of GANs for improved quality, stabil-
arXiv preprint arXiv:2104.00670, 2021. 2 ity, and variation. In International Conference on Learning
[12] Paul Ekman and Wallace V. Friesen. Facial Action Coding Representations (ICLR), 2018. 3
System: A Technique for the Measurement of Facial [26] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Movement. Consulting Psychologists Press, 1978. 7 Jaakko Lehtinen, and Timo Aila. Training generative adver-
[13] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, sarial networks with limited data. In Advances in Neural
Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Rud- Information Processing Systems (NeurIPS), 2020. 5, 6, 7
erman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. [27] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
Neural scene representation and rendering. Science, 2018. 2 Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-
[14] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3D shape free generative adversarial networks. In Advances in Neural
induction from 2D views of multiple objects. In International Information Processing Systems (NeurIPS), 2021. 1, 2, 5, 6,
Conference on 3D Vision, 2017. 1, 3 8
[15] Stephan J Garbin, Marek Kowalski, Matthew Johnson, Jamie [28] Tero Karras, Samuli Laine, and Timo Aila. A style-
Shotton, and Julien Valentin. FastNeRF: High-fidelity neural based generator architecture for generative adversarial net-
9
works. In IEEE Conference on Computer Vision and Pattern [43] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Recognition (CVPR), 2019. 1, 3, 6, 8 bastian Nowozin, and Andreas Geiger. Occupancy networks:
[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Learning 3D reconstruction in function space. In IEEE
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- Conference on Computer Vision and Pattern Recognition
ing the image quality of StyleGAN. In IEEE Conference on (CVPR), 2019. 2
Computer Vision and Pattern Recognition (CVPR), 2020. 1, [44] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack,
3, 4, 7, 8 Mahsa Baktashmotlagh, and Anders Eriksson. Implicit sur-
[30] Petr Kellnhofer, Lars Jebe, Andrew Jones, Ryan Spicer, Kari face representations as layers in neural networks. In IEEE
Pulli, and Gordon Wetzstein. Neural lumigraph render- International Conference on Computer Vision (ICCV), 2019.
ing. In IEEE Conference on Computer Vision and Pattern 2
Recognition (CVPR), 2021. 2 [45] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[31] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Koltun. Tanks and temples: Benchmarking large-scale scene Representing scenes as neural radiance fields for view
reconstruction. ACM Transactions on Graphics, 36(4), 2017. synthesis. In European Conference on Computer Vision
4 (ECCV), 2020. 1, 2, 3, 4, 5
[32] Taehee Brad Lee. Cat hipsterizer, 2018. https:// [46] Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas
github.com/kairess/cat_hipsterizer. 4, 6 Kurz, Joerg H. Mueller, Chakravarty R. Alla Chaitanya, An-
[33] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas ton S. Kaplanyan, and Markus Steinberger. DONeRF: To-
Geiger. Towards unsupervised learning of generative models wards Real-Time Rendering of Compact Neural Radiance
for 3D controllable image synthesis. In IEEE Conference on Fields using Depth Oracle Networks. Computer Graphics
Computer Vision and Pattern Recognition (CVPR), 2020. 3 Forum, 40(4), 2021. 2
[34] David B Lindell, Julien NP Martel, and Gordon Wetzstein. [47] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
AutoInt: Automatic integration for fast neural volume ren- Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised
dering. In IEEE Conference on Computer Vision and Pattern learning of 3D representations from natural images. In IEEE
Recognition (CVPR), 2021. 2 International Conference on Computer Vision (ICCV), 2019.
[35] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and 1, 3, 5
Christian Theobalt. Neural sparse voxel fields. In Advances [48] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-
in Neural Information Processing Systems (NeurIPS), 2020. Liang Yang, and Niloy Mitra. BlockGAN: Learning 3D
2, 3 object-aware scene representations from unlabelled images.
[36] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. In Advances in Neural Information Processing Systems
Learning to infer implicit surfaces without 3D supervision. (NeurIPS), 2020. 1, 3
arXiv preprint arXiv:1911.00767, 2019. 2 [49] Michael Niemeyer and Andreas Geiger. GIRAFFE: Rep-
[37] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc resenting scenes as compositional generative neural feature
Pollefeys, and Zhaopeng Cui. DIST: Rendering deep im- fields. In IEEE Conference on Computer Vision and Pattern
plicit signed distance function with differentiable sphere Recognition (CVPR), 2021. 1, 2, 3, 5, 7, 8
tracing. In IEEE Conference on Computer Vision and Pattern [50] Michael Niemeyer, Lars Mescheder, Michael Oechsle,
Recognition (CVPR), 2020. 2 and Andreas Geiger. Differentiable volumetric rendering:
[38] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Learning implicit 3d representations without 3d supervi-
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- sion. In IEEE Conference on Computer Vision and Pattern
umes: Learning dynamic renderable volumes from images. Recognition (CVPR), 2020. 2
ACM Transactions on Graphics (SIGGRAPH), 2019. 2, 4 [51] Michael Oechsle, Songyou Peng, and Andreas Geiger.
[39] Julien N.P. Martel, David B. Lindell, Connor Z. Lin, Eric R. UNISURF: Unifying neural implicit surfaces and radiance
Chan, Marco Monteiro, and Gordon Wetzstein. ACORN: fields for multi-view reconstruction. In IEEE International
Adaptive coordinate networks for neural representation. Conference on Computer Vision (ICCV), 2021. 2, 8
ACM Transactions on Graphics (SIGGRAPH), 2021. 2 [52] Jeong Joon Park, Peter Florence, Julian Straub, Richard
[40] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Newcombe, and Steven Lovegrove. DeepSDF: Learning
Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- continuous signed distance functions for shape representa-
worth. NeRF in the wild: Neural radiance fields for uncon- tion. In IEEE Conference on Computer Vision and Pattern
strained photo collections. In IEEE Conference on Computer Recognition (CVPR), 2019. 2
Vision and Pattern Recognition (CVPR), 2021. 2 [53] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc
[41] N. Max. Optical models for direct volume rendering. Pollefeys, and Andreas Geiger. Convolutional occupancy
IEEE Transactions on Visualization and Computer Graphics networks. In European Conference on Computer Vision
(TVCG), 1995. 3, 5 (ECCV), 2020. 2
[42] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. [54] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
Which training methods for GANs do actually converge? Francesc Moreno-Noguer. D-NeRF: Neural radiance fields
In International Conference on Machine Learning (ICML), for dynamic scenes. arXiv preprint arXiv:2011.13961, 2020.
2018. 4 2
10
[55] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper- construction. Advances in Neural Information Processing
vised representation learning with deep convolutional gener- Systems (NeurIPS), 2021. 8
ative adversarial networks. In International Conference on [68] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Free-
Learning Representations (ICLR), 2016. 3 man, and Joshua B. Tenenbaum. Learning a probabilistic
[56] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas latent space of object shapes via 3D generative-adversarial
Geiger. KiloNeRF: Speeding up neural radiance fields with modeling. In Advances in Neural Information Processing
thousands of tiny MLPs. In IEEE International Conference Systems (NeurIPS), 2016. 1, 3
on Computer Vision (ICCV), 2021. 2 [69] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
[57] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel ume rendering of neural implicit surfaces. arXiv preprint
Cohen-Or. Pivotal tuning for latent-based editing of real im- arXiv:2106.12052, 2021. 8
ages. arXiv preprint arXiv:2106.05744, 2021. 8 [70] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
[58] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Atzmon, Ronen Basri, and Yaron Lipman. Multiview neu-
Geiger. GRAF: Generative radiance fields for 3D-aware im- ral surface reconstruction by disentangling geometry and ap-
age synthesis. In Advances in Neural Information Processing pearance. In Advances in Neural Information Processing
Systems (NeurIPS), 2020. 1, 2, 3, 5 Systems (NeurIPS), 2020. 2
[59] Yichun Shi, Divyansh Aggarwal, and Anil K Jain. Lifting 2D [71] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
stylegan for 3D-aware face generation. In IEEE Conference Angjoo Kanazawa. PlenOctrees for real-time rendering of
on Computer Vision and Pattern Recognition (CVPR), 2021. neural radiance fields. In IEEE International Conference on
3, 5, 7 Computer Vision (ICCV), 2021. 2
[60] Vincent Sitzmann, Julien N.P. Martel, Alexander W. [72] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Bergman, David B. Lindell, and Gordon Wetzstein. Implicit Koltun. Nerf++: Analyzing and improving neural radiance
neural representations with periodic activation functions. fields. arXiv preprint arXiv:2010.07492, 2020. 2
In Advances in Neural Information Processing Systems [73] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian.
(NeurIPS), 2020. 2 CIPS-3D: A 3D-Aware Generator of GANs Based on
[61] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Conditionally-Independent Pixel Synthesis. arXiv preprint
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- arXiv:2110.09788, 2021. 3
Voxels: Learning persistent 3D feature embeddings. In IEEE [74] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu,
Conference on Computer Vision and Pattern Recognition Antonio Torralba, Joshua B. Tenenbaum, and William T.
(CVPR), 2019. 2, 4 Freeman. Visual object networks: Image generation with
[62] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet- disentangled 3D representations. In Advances in Neural
zstein. Scene representation networks: Continuous 3D- Information Processing Systems (NeurIPS), 2018. 1, 3
structure-aware neural scene representations. In Advances
in Neural Information Processing Systems (NeurIPS), 2019.
2
[63] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang,
Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron.
NeRV: Neural reflectance and visibility fields for relighting
and view synthesis. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2021. 2
[64] Attila Szabó, Givi Meishvili, and Paolo Favaro. Unsuper-
vised generative 3D shape learning from natural images.
arXiv preprint arXiv:1910.00287, 2019. 3
[65] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten
Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacob-
son, Morgan McGuire, and Sanja Fidler. Neural geomet-
ric level of detail: Real-time rendering with implicit 3D
shapes. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2021. 2
[66] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier fea-
tures let networks learn high frequency functions in low di-
mensional domains. In Advances in Neural Information
Processing Systems (NeurIPS), 2020. 2, 4
[67] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt,
Taku Komura, and Wenping Wang. NeuS: Learning neu-
ral implicit surfaces by volume rendering for multi-view re-
11
Supplemental Material
Efficient Geometry-aware 3D Generative Adversarial Networks
Eric R. Chan * †1,2 , Connor Z. Lin* 1 , Matthew A. Chan* 1 , Koki Nagano* 2 , Boxiao Pan1 , Shalini De
Mello2 , Orazio Gallo2 , Leonidas Guibas1 , Jonathan Tremblay2 , Sameh Khamis2 , Tero Karras2 , and
Gordon Wetzstein1
1 2
Stanford University NVIDIA
(Sec. 4), such as datasets and baselines, and further explana- 0.6
tions for experiments such as inversion. Lastly, we consider
artifacts (Sec. 5) that may be targets of future work. We 0.4
encourage readers to view the accompanying supplemental
video, which contains additional visual results, including a 0.2
live demonstration of real-time synthesis.
0.0
-1.0 -0.8 -0.6 -0.4 -0.2 -0.0 0.2 0.4 0.6 0.8 1.0
1. Additional experiments yaw (radians)
Figure 1. We plot the probability of smiling against head yaw an-
1.1. Analyzing pose/facial expression correlation in
gle, as measured by [38]. People looking at the camera are more
FFHQ likely to be smiling than people angled away, indicating a correla-
Fig. 1 plots the likelihood a subject from FFHQ [17] is tion between scene appearance and camera pose.
smiling (measured by [38]), against head yaw (computed
by [9]). The plot indicates that individuals facing towards
the camera are more likely to be smiling than are individuals
who are facing away from the camera. An intuitive explana-
tion for this phenomenon is that people who are knowingly
being photographed, as in portrait images, are more likely
to be smiling than people who are photographed candidly.
Left uncompensated for, this correlation between pose
and facial expressions incentivizes “expression warping”,
where the expressions of synthesized faces shift as we move
the camera. We propose dual discrimination (Section 4.3 of Figure 2. COLMAP [32, 33] reconstruction of 128 frames of syn-
thesized video (top) which followed an oval trajectory. The result-
the main paper) and generator pose conditioning (Section
ing dense, well-defined point cloud (bottom) is indicative of highly
4.4 of the main paper) to reduce such expression warping. multi-view-consistent rendering.
1.2. COLMAP reconstruction point-cloud of a synthesized video sequence (Fig. 2). We
To further validate the multi-view consistency of our reconstruct a video sequence of 128 frames, taken from an
method, we employ COLMAP [32, 33] to reconstruct a oval trajectory similar to the camera paths shown in the sup-
plemental video. We use COLMAP’s “automatic” recon-
* Equal contribution.
† Part of the work was
struction, without specifying camera parameters. The re-
done during an internship at NVIDIA.
1
sulting point cloud is dense and well-defined, indicating that
our 3D GAN produces highly multi-view-consistent render-
ings.
1.3. Regularizing generator pose conditioning
2
FFHQ Cats Cars
FID↓ KID↓ ID↑ Depth↓ Pose↓ FID↓ KID↓ FID↓ KID↓
GIRAFFE 1282 — — — — — — — 27.3 1.703
GIRAFFE 2562 31.5 1.992 0.64 0.94 .089 16.1 2.723 — —
π-GAN 1282 29.9 3.573 0.67 0.44 .021 16.0 1.492 17.3 0.932
Lift. SG 2562 29.8 — 0.58 0.40 .023 — — — —
Ours 1282 — — — — — — — 2.75 0.097
Ours 2562 4.8 0.149 0.76 0.31 .005 3.88 0.091 — —
Ours 5122 4.7 0.132 0.77 0.39 .005 2.77† 0.041† — —
Table 1. Quantitative evaluation using FID, KID×100, identity consistency (ID), depth accuracy, and
pose accuracy for FFHQ [17] and FID, KID×100 for AFHQv2 Cats [7,16] and ShapeNet Cars [6,35].
Labeled is the image resolution of training and evaluation. † Trained with adaptive discriminator
augmentation [15].
erate solution in which it renders textures on a flat plane, pitch (Fig. 5a) because the dataset’s pitch range is even nar-
without properly capturing the 3D shape of scenes. Provid- rower.
ing even very imprecise camera poses is enough to break Our method, despite also using 2D convolutions, is less
this tendency; conditioning the discriminator on camera reliant on view-inconsistent convolutions for considering
poses distorted by three standard deviations of Gaussian the placement of features in the final image. By utilizing
noise still produces accurate 3D shapes. With extreme an expressive 3D representation as a “scaffold”, our method
noise (e.g. four standard deviations), some scenes main- provides more reasonable extrapolation to rare views in
tain the correct 3D structure while others are flattened onto both pitch and yaw than methods that more strongly depend
the plane. Our results indicate that while our method re- on image-space convolutions for image synthesis, such as
quires additional information to prevent collapse, only very GIRAFFE [29].
weak supervision is necessary. Future work may examine
this tendency further and discover ways to prevent this un- 1.6. Additional quantitative results
desirable behavior without requiring images to be labelled Table 1 is an expanded version of Table 2 of the main
with poses. manuscript that provides additional quantitative metrics, in-
cluding Kernel Inception Distance [2] for all datasets and
1.5. Extrapolation to steep camera angles
image quality evaluations for ShapeNet Cars. Strong rela-
Fig. 5 provides a visual comparison of our method tive performance on Cars, a dataset in which camera poses
against baselines for generating views from steep camera are distributed uniformly about the sphere, is evidence that
poses. We note that the FFHQ [17] dataset is primarily our method is not restricted to face-forward datasets like
composed of front-facing images—few images depict faces FFHQ [17] and AFHQv2 [7, 16].
from extreme yaw angles, and even fewer images depict
faces from extreme pitch angles. Nevertheless, reasonable 2. Additional visual results
extrapolation to the edges of the pose distribution is a desir-
able quality and indicates reliance on a robust 3D represen- Style mixing, in shapes. Fig. 6 shows the underlying
tation. shapes of the style mixing [17] examples in Fig. 8 of
the main manuscript. While mixed examples inherit most
Lifting StyleGAN [34], which represents scenes as a
of their shape structure from the modulations of the back-
textured mesh, demonstrates consistent rendering quality.
bone’s low-resolution layers, the modulations of the high-
However the steep camera angles reveal inaccurate 3D ge-
resolution layers can influence fine details in the shape, such
ometry (e.g. foreshortened faces) learned by the method.
as eye regions and hair patterns. The results were obtained
π-GAN [5], reasonably extrapolates to steep angles but ex-
from a model trained without style-mixing regularization.
hibits visible quality degradation at the edges of the pose
distribution. GIRAFFE [29], being highly reliant on view-
inconsistent convolutions, has difficulty reproducing angles Additional single image 3D reconstructions. Fig. 7 pro-
that are rarely seen in the dataset. If we force GIRAFFE vides additional 3D reconstructions of single test images
to extrapolate beyond the camera poses sampled at train- through Pivotal Tuning Inversion (PTI) [31] of a model
ing (e.g. the leftmost and rightmost images of Fig. 5b), we trained on FFHQ 5122 . A pipeline for high-fidelity, single-
receive degraded, view-inconsistent images rather than ren- image reconstruction of faces that does not require explicit
derings from steeper angles. The problem is amplified for 3D ground-truth training data opens the door for many
3
(a) Extrapolation to steep pitch angles.
Figure 5. We compare methods in their extrapolation to steep camera viewing angles. La-
belled is the percentile for camera pitch or yaw. A yaw angle in the 96th percentile means
96% of training poses are less steep, i.e. 4% of training poses are beyond the given pose.
4
Additional selected examples synthesized with AFHQv2
Cats. Fig. 9 shows renderings and shapes for selected ex-
amples, synthesized by our method trained on AFHQv2
Cats [7, 16] 5122 .
3. Implementation details
We implemented our 3D GAN framework on top of the
official PyTorch implementation of StyleGAN2, an updated
version of which is available at https://github.com/
NVlabs/stylegan3. Most of our training parameters
are identical to those of StyleGAN2 [18], including the use
of equalized learning rates for the trainable parameters [14],
a minibatch standard deviation layer at the end of the dis-
criminator [14], exponential moving average of the genera-
Figure 7. Additional single-view 3D reconstructions of test images tor weights, and a non-saturating logistic loss [12] with R1
demonstrate a use for our generator’s learned prior over facial fea- regularization [26].
tures.
promising applications, such as photo-to-avatar creation. Two-stage training. In order to save computational re-
sources, we perform the majority of the training at a neural
rendering resolution of 642 , before gradually stepping the
resolution up to 1282 . Note that the final image resolution
Shapenet Cars. Fig. 8 contains uncurated renderings remains fixed throughout training (e.g. 2562 or 5122 ). We
from random camera poses for models trained with implement this simply by bilinearly resizing the raw neu-
ShapeNet Cars [6, 35]. This experiment serves as a demon- ral rendering IRGB to 1282 before it is operated on by the
stration that our method is capable of operating successfully super-resolution module. Thus, the super-resolution mod-
on datasets that include camera poses that span the entire ule always receives a 1282 -sized feature map as an input,
360◦ camera azimuth and 180◦ camera elevation distribu- regardless of the actual neural rendering resolution. In con-
tions, unlike 2.5D GANs [34], which are intended for face- trast to previous progressive growing strategies [5, 14] that
forward datasets. double the resolution in a single step, we gradually increase
5
Figure 8. Qualitative comparison of uncurated examples of cars. All methods are sampled with truncation [4, 17, 25], using ψ = 0.7.
the neural rendering resolution, pixel-by-pixel, over 1 mil- Backbone. Our backbone (i.e., StyleGAN2 generator)
lion images, i.e., (642 , 652 , 662 , ..., 1262 , 1272 , 1282 ). We follows the implementation of [18], with a mapping net-
continue training with the resolution fixed at 1282 for an work of 8 hidden layers. For all of our experiments (regard-
additional 1.5 million images, for a total of 2.5M iterations less of final image resolution), the backbone operates at a
of fine-tuning. This two-stage training procedure provides resolution of 2562 . We modify the output convolutions such
a roughly 2× speed-up versus training from scratch at full that they produce a 96-channel output feature image, which
resolution and produces similar results to training at full we reshape into three planes, each of shape 256 × 256 × 32.
neural rendering resolution from scratch. Unlike approaches that require pre-trained 2D image GANs
[34], we do not utilize pre-trained StyleGAN2 checkpoints
for the backbone; the entire pipeline is trained end-to-end.
6
Figure 9. Curated examples from a model trained on AFHQv2 [7, 16] 5122 .
For large datasets, such as FFHQ [17] and ShapeNet Cars sampling. For FFHQ [17] and AFHQv2 [7, 16], we use 48
[6, 35], we train from scratch with random initialization; uniformly-spaced and 48 importance samples per ray; for
for small datasets, such as AFHQv2 [7, 16], we follow pre- ShapeNet Cars, we use 64 uniformly-spaced and 64 impor-
vailing methodology [15] by fine-tuning from a checkpoint tance samples per ray. When rendering videos that feature
trained on a larger dataset. thin surfaces, we found it beneficial to increase the samples
per ray during inference to reduce flicker.
7
Figure 10. Uncurated examples of cats, for GIRAFFE [29] 2562 , π-GAN 1282 , and our method 5122 . All methods are sampled with
truncation [4, 17, 25], using ψ = 0.7.
Discriminator. Our discriminator is a StyleGAN2 [18] R1 Regularization. We use R1 regularization [26] with
with two modifications. First, to enable dual discrimination, γ = 1 for all datasets and resolutions, except for ShapeNet
we adjust the input layer to accept six-channel input images, Cars, where we use γ = 0.1. Regularization strengths were
rather than 3-channel input images. Fig. 14 provides a dia- informally chosen based on values that have shown success
gram that illustrates the creation of these six-channel inputs, with previous methods [15, 18].
for both real and generated images. Second, we condition
the discriminator on the camera parameters of the incoming Density Regularization. Further experiments, conducted
image to help prevent degenerate shape solutions; we follow after our initial submission, suggested that additional regu-
the class-conditional discriminator modifications of [15] to larization over the estimated density field reduced the preva-
inject this information. lence of undesirable seams and other shape artifacts. Sim-
ilar to the total variation regularization used in previous
work [23], our density regularization encourages smooth-
ness of the density field. For each generated scene in the
Mixed Precision. To speed up training, we use a similar batch, we randomly sample points x in the volume and also
mixed-precision methodology as [15]. We use FP16 in the sample additional ‘perturbed’ points that are offset with a
four highest resolution blocks of the discriminator and in small amount Gaussian noise δx. Our density regulariza-
both blocks of our super-resolution module. We do not use tion loss is an L1 loss that minimizes the difference between
FP16 in our generator backbone. the estimated densities σ(x) and σ(x + δx). We apply our
8
Figure 11. Images and geometry for seeds 0-31, synthesized using a model trained on FFHQ [17] 5122 . Sampled with truncation [17],
using ψ = 0.5.
density regularization over 1000 pairs of randomly sampled that increasing the number of samples per ray at inference
points every four training iterations. time can reduce unwanted flickering when rendering videos
that feature thin objects such as eye glasses. For clips shown
in the supplemental video, we double both the number of
Training. We train all models with a batch size of 32. We
coarse samples (from 48 to 96) and the number of fine sam-
use a discriminator learning rate of 0.002 and a generator
ples (from 48 to 96), bringing the total number of depth
learning rate of 0.0025. Following [16], we blur images
samples per ray to 192. Increasing the number of samples
as they enter the discriminator, gradually reducing the blur
per ray incurs a penalty to the rendering speed. When using
amount over the first 200K images. Unlike [18], we train
96 total depth samples per ray, frame rates are reduced to
without style-mixing regularization.
approximately 24 frames per second with tri-plane caching
Using the two-stage training discussed previously, we
– down from 36 frames per second when using the default
train at a resolution of 642 for 25M images and at 1282 for
48 samples. Images shown in the main manuscript were
an additional 2.5M images. Using a neural rendering reso-
synthesized without increasing the number of depth sam-
lution of 642 , our 3D GAN framework takes ∼24 seconds
ples along each ray.
to train on 1000 images (24 s/kimg) on 8 Tesla V100 GPUs;
this increases to 46 s/kimg at a neural rendering resolution
of 1282 . For reference, StyleGAN3-R [16] achieves train-
AFHQv2. Following [15], we fine-tune from FFHQ-
ing rates of 20 s/kimg on similar hardware.
trained models to achieve optimum performance on Cats.
Our total training time on 8 Tesla V100 GPUs is on the
Beginning from a checkpoint trained on FFHQ, we train for
order of 8.5 days (7 days of 642 training, plus 1.5 days of
6.2M images at a neural rendering resolution of 642 ; and for
1282 fine-tuning), compared to 6 days on similar hardware
an additional 2.6M images, while fine-tuning the neural ren-
for StyleGAN3-R.
dering resolution to 1282 . Because π-GAN and GIRAFFE
were not designed with the benefits of adaptive discrimina-
Inference-time depth samples. We use neural volume tor augmentation (ADA) [15], we also do not use ADA for
rendering [27] with two-pass importance sampling to render our method at 2562 , in an effort to keep comparisons across
feature images from our tri-plane representation. We found methods fair. We use adaptive discriminator augmentation
9
Figure 12. Linear interpolations between latent codes, showing renderings and shapes.
with its default settings, for our method only at 5122 . and trained until convergence with the parameters recom-
mended for analogous datasets.
4. Experiment details
Lifting StyleGAN [34] is a method for disentangling
4.1. Baselines and lifting a pre-trained StyleGAN2 image generator to
3D-aware face generation. The original Lifting StyleGAN
π-GAN [5] is a 3D-aware GAN that relies upon a FiLM-
manuscript reports results on a slightly tighter crop of
conditioned MLP with periodic activation functions for
FFHQ than we used. Because we had difficulty match-
camera-controllable synthesis. We utilized the official code
ing the quality of Lifting StyleGAN’s pre-trained model
(https : / / github . com / marcoamonteiro / pi -
when we trained it from scratch on our less-cropped dataset,
GAN) and trained until convergence with the parameters rec-
we instead used their official pre-trained model for their
ommended for analogous datasets.
tighter crops and the FID score reported in their manuscript.
GIRAFFE [29] is a 3D-aware GAN that incorporates a
We utilized the offical code, found here: (https://
compositional 3D scene representation to enable control-
github.com/seasonSH/LiftedGAN).
lable synthesis. We utilized the official code (https:
/ / github . com / autonomousvision / giraffe) StyleGAN2 is a style-based GAN that achieves state-
10
Figure 13. Additional selected examples, from a model trained on FFHQ [17] 5122 .
of-the-art image quality for 2D image synthesis and fea- 4.2. Dataset Details
tures a well-behaved latent space that enables image ma-
nipulation. We obtained a pre-trained checkpoint for Style- FFHQ We prepare our dataset by starting with the ”in-
GAN2 on FFHQ 5122 from the collection of official mod- the-wild” version of the FFHQ dataset [17], which is
els (https://catalog.ngc.nvidia.com/orgs/ composed of uncropped, original PNG images of people
nvidia/teams/research/models/stylegan2). sourced from Flickr. We use an off-the-shelf face detection
Following the recommended tuning of [15], we trained and pose-extraction pipeline [9] to both identify the face re-
both StyleGAN2 config F and the 512 × 512 config gion and label the image with a pose. We crop the images
from [15], sweeping R1 [26] regularization strength, γ = to roughly the same size as the original FFHQ dataset.
{0.2, 0.5, 1, 2, 5, 10, 20}. The best result for AFHQv2 was We assume fixed camera intrinsics across the entire
obtained with StyleGAN2 config F, after training for 10M dataset, with a focal length of 4.26 × image width, equiv-
images at γ = 1. alent to a standard portrait lens. We prune a small number
of images that resisted face detection; our final dataset con-
tains 69957 images. We augment the dataset with horizontal
11
Dataset Generator
Discriminator Synthesized
Downsample
Upsample
( 128 x 128 x 3 )
Upsample
Real or Fake
Figure 14. In dual-discrimination, we discriminate on a six-channel concatenation of the final image and the raw neural rendering, in order
to maintain consistency between high-resolution final images and view-consistent (but low resolution) neural renderings. This diagram
illustrates how we obtain a six-channel discriminator input tensor for both real and fake images. Our generator produces both a 5122 final
+
rendering (IRGB ) as well as the (1282 ) raw neural rendering (IRGB ). The raw rendering, IRGB is the first three channels of the 32-channel
rendered features, IF . We create a six-channel discriminator input by upsampling the raw image to 5122 and concatenating it with the
final image to form a (512 × 512 × 6) discriminator input tensor. For real images, we extract a 5122 real image from the dataset and
downsample it to the same size as IRGB to obtain an analogue for IRGB . We then upsample this image back to 5122 and concatenate it
with the original image to form a (512 × 512 × 6) discriminator input tensor. The downsample-then-upsample operation has the effect of
blurring the original image.
12
4.3. Single scene overfitting. zero mean, unit variance and calculate the L2 distance be-
tween them.
To illustrate the effectiveness of our architecture, we
evaluate the relative performance of the tri-plane 3D rep-
resentation against a comparable voxel-based hybrid repre- Multi-view consistency. We evaluate multi-view consis-
sentation and Mip-NeRF [1] on the Family scene of Tanks tency and face identity preservation for models trained on
& Temples [19] dataset as desribed in Section 3 of the main FFHQ [17] by measuring ArcFace [8] cosine similarity. For
manuscript. We use the pre-processed images, as well as the each method, we generate 1024 random faces and render
training/test split, of [22]. We use 512 uniformly-spaced two views of each face from poses randomly selected from
depth samples and 256 importance samples per ray and a the training dataset pose distribution. For each image pair,
ray batch size of 6400. The tri-planes are treated as learn- we measure facial identity similarity [8] and compute the
able parameters of shape 3×48×512×512. The dense voxel mean score.
parameters were chosen to optimize quality for comparable
parameter count as the tri-planes; the voxel features are of Pose accuracy. We evaluate pose accuracy with the help
shape 18×128×128×128. Both voxel and tri-plane hybrid of a pre-trained face reconstruction model [9]. With [9], we
representations are coupled with two-layer, 128 hidden unit detect pitch, yaw, and roll from 1024 generated images then
decoders with Fourier feature embeddings [36]. We train compute L2 loss against the ground truth poses to determine
voxel and cube representations for 200K iterations; we train each model’s pose drift.
Mip-NeRF for the recommended 1M iterations.
Runtime. We evaluate runtime for each model by calcu-
4.4. Pivotal tuning inversion. lating the average framerate over a 400 frame sequence. We
We use off-the-shelf face detection [9] to extract process frames consecutively, i.e., with batch size 1. In or-
appropriately-sized crops and camera extrinsics from test der to give each method a best-case-scenario, we ignore op-
images and we resize each cropped image to 5122 . We erations such as copying rendered frames from GPU to CPU
follow Pivotal Tuning Inversion (PTI) [31], optimizing the and saving files to disk.
latent code for 500 iterations, followed by fine-tuning the
generator weights for an additional 500 iterations. FACS estimation In Section 5.2 of the main paper, we
For inversion of grayscale images, we convert the gener- quantitatively measure the effect of dual discrimination and
ator’s 3-channel, RGB renderings to perceived luminance, generator pose conditioning at preserving facial expressions
Y , before computing image distance loss during optimiza- across multi-view face videos. To evaluate facial expres-
tion. This allows the generator’s prior to colorize the render- sions, we employ a proprietary facial tracker that mea-
ings. To compute single-channel luminance from 3-channel sures detailed movement of sub-regions of the face in terms
RGB images, we use Y = 0.299R + 0.587G + 0.114B. of Facial Action Coding System (FACS) [10] coefficients.
For grayscale optimization, we use 400 latent code inver- Specifically, our facial tracker measures all 53 FACS blend-
sion steps and 250 generator fine-tuning steps. shape coefficients defined in Li et al. [21] and we compared
the variability in the ‘mouthSmile L’ and ‘mouthSmile R’
4.5. Evaluation Metrics blendshape coefficients across the different videos.
FID and KID. We compute Fréchet Inception Distance 4.6. Visualization of Geometry
(FID) [13] and Kernel Inception Distance (KID) [2] image
quality metrics between 50k generated images and all train- To visualize shapes, we sample the volume to obtain a
ing images using the implementation provided in the Style- 5123 cube of density values and extract the surface of the
GAN3 [16] codebase. scene as a mesh using Marching Cubes [24]. We found
that a levelset between 0 and 10 generally yielded visu-
ally appealing results. Renderings of shapes shown in this
Geometry. We follow a similar procedure to [34] in the manuscript were generated using ChimeraX [11].
evaluation of geometry. We generate 1024 images and
depth maps from random poses that match the dataset pose 5. Discussion
distribution. With the application of a pre-trained 3D face
5.1. Shape artifacts
reconstruction model [9], we generate a “pseudo” ground-
truth depth map for each generated image. Next we limit Despite significant improvements in the quality of the 3D
both the generated depth maps and “pseudo” ground-truth geometry compared to previous methods, our synthesized
depth maps to the facial regions as defined by the recon- shapes are not free from artifacts, which are visible in geom-
struction model. Finally, we normalize all depth maps to etry renderings throughout the main paper and supplement
13
(e.g. Fig. 11, Fig. 13). Sunken eye sockets allow the illu-
sion of eyes that follow the viewing camera, even when the
geometry and neural renderings are view-consistent; such
“hollow face illusions” have demonstrated similar effects in
the physical world. Similarly, deep creases near the cor-
ners of mouths enable the creation of “view-inconsistent”
effects that in fact are faithful to the underlying shapes. Fu-
ture work that incorporates stronger dataset priors, e.g. that
eyeballs are convex, may help resolve these artifacts.
While our method produces more-detailed eyeglasses
than previous methods, it tends to produce “goggles”—the
sides of the eyeglasses are opaque where there should be
empty space. Future neural rendering methods that can
accurately model lens refraction may enable more faithful
reconstruction of eyeglasses and other objects that contain
transparent elements.
In some shapes and renderings generated by our method,
a seam is visible between the face and the rest of the head.
While we find the optional density regularization in Sec.
3 helps reduce such artifacts, we hypothesize that recent
hybrid-SDF rendering solutions [30, 37, 39], which have
shown promising results in robust geometry recovery from
images, may yield improved shapes with fewer artifacts.
In the interests of simplicity, we model the scene with a
single 3D representation, without any explicit background
handling. Consequently, the generator learns to represent
backgrounds of images with textured surfaces fused to fore-
ground objects. Future work that models backgrounds with
a separate 3D representation [28, 29, 40] may enable isola-
tion of foreground objects.
14
References ity, and variation. In International Conference on Learning
Representations (ICLR), 2018. 5
[1] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter
[15] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan.
Jaakko Lehtinen, and Timo Aila. Training generative adver-
Mip-nerf: A multiscale representation for anti-aliasing neu-
sarial networks with limited data. In Advances in Neural
ral radiance fields. In ICCV, 2021. 13
Information Processing Systems (NeurIPS), 2020. 3, 7, 8, 9,
[2] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Ar- 11
bel, and Arthur Gretton. Demystifying MMD GANs. [16] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
In International Conference on Learning Representations Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-
(ICLR), 2018. 3, 13 free generative adversarial networks. In Advances in Neural
[3] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Information Processing Systems (NeurIPS), 2021. 3, 5, 7, 9,
Software Tools, 2000. 12 12, 13
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [17] Tero Karras, Samuli Laine, and Timo Aila. A style-
scale GAN training for high fidelity natural image synthe- based generator architecture for generative adversarial net-
sis. In International Conference on Learning Representations works. In IEEE Conference on Computer Vision and Pattern
(ICLR), 2019. 5, 6, 8 Recognition (CVPR), 2019. 1, 3, 5, 6, 7, 8, 9, 11, 13
[5] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, [18] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
and Gordon Wetzstein. pi-GAN: Periodic implicit generative Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
adversarial networks for 3D-aware image synthesis. In IEEE ing the image quality of StyleGAN. In IEEE Conference on
Conference on Computer Vision and Pattern Recognition Computer Vision and Pattern Recognition (CVPR), 2020. 5,
(CVPR), 2021. 3, 5, 10 6, 7, 8, 9
[6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, [19] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Koltun. Tanks and temples: Benchmarking large-scale scene
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: reconstruction. ACM Transactions on Graphics, 36(4), 2017.
An information-rich 3d model repository. arXiv preprint 13
arXiv:1512.03012, 2015. 3, 5, 7, 12 [20] Taehee Brad Lee. Cat hipsterizer, 2018. https://
[7] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. github.com/kairess/cat_hipsterizer. 2, 12
Stargan v2: Diverse image synthesis for multiple domains. [21] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara,
In Proceedings of the IEEE Conference on Computer Vision Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha
and Pattern Recognition, 2020. 3, 5, 7, 12 Prasad, Bipin Kishore, Jun Xing, et al. Learning formation
[8] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos of physically-based face attributes. In IEEE Conference on
Zafeiriou. Arcface: Additive angular margin loss for deep Computer Vision and Pattern Recognition (CVPR), 2020. 13
face recognition. In CVPR, 2019. 13 [22] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[9] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Christian Theobalt. Neural sparse voxel fields. In Advances
Jia, and Xin Tong. Accurate 3d face reconstruction with in Neural Information Processing Systems (NeurIPS), 2020.
weakly-supervised learning: From single image to image 13
set. In IEEE Computer Vision and Pattern Recognition [23] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
Workshops, 2019. 1, 2, 11, 13 Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol-
[10] Paul Ekman and Wallace V. Friesen. Facial Action Coding umes: Learning dynamic renderable volumes from images.
System: A Technique for the Measurement of Facial ACM Transactions on Graphics (SIGGRAPH), 2019. 8
Movement. Consulting Psychologists Press, 1978. 13 [24] William E Lorensen and Harvey E Cline. Marching cubes:
[11] Thomas D Goddard, Conrad C Huang, Elaine C Meng, A high resolution 3D surface construction algorithm. ACM
Eric F Pettersen, Gregory S Couch, John H Morris, and Transactions on Graphics (ToG), 1987. 13
Thomas E Ferrin. Ucsf chimerax: Meeting modern chal- [25] Marco Marchesi. Megapixel size image creation using gen-
lenges in visualization and analysis. Protein Science, erative adversarial networks, 2017. 5, 6, 8
27(1):14–25, 2018. 13 [26] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Which training methods for GANs do actually converge?
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and In International Conference on Machine Learning (ICML),
Yoshua Bengio. Generative adversarial nets. In Advances in 2018. 5, 8, 11
Neural Information Processing Systems (NeurIPS), 2014. 5 [27] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Representing scenes as neural radiance fields for view
GANs trained by a two time-scale update rule converge to synthesis. In European Conference on Computer Vision
a nash equilibrium. In Advances in Neural Information (ECCV), 2020. 7, 9
Processing Systems (NeurIPS), 2017. 13 [28] Michael Niemeyer and Andreas Geiger. CAMPARI:
[14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Camera-aware decomposed generative neural radiance
Progressive growing of GANs for improved quality, stabil- fields. arXiv preprint arXiv:2103.17269, 2021. 14
15
[29] Michael Niemeyer and Andreas Geiger. GIRAFFE: Rep-
resenting scenes as compositional generative neural feature
fields. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2021. 3, 5, 7, 8, 10, 14
[30] Michael Oechsle, Songyou Peng, and Andreas Geiger.
UNISURF: Unifying neural implicit surfaces and radiance
fields for multi-view reconstruction. In IEEE International
Conference on Computer Vision (ICCV), 2021. 14
[31] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
Cohen-Or. Pivotal tuning for latent-based editing of real im-
ages. arXiv preprint arXiv:2106.05744, 2021. 3, 13
[32] Johannes Lutz Schönberger and Jan-Michael Frahm.
Structure-from-motion revisited. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2016. 1
[33] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
and Jan-Michael Frahm. Pixelwise view selection for un-
structured multi-view stereo. In European Conference on
Computer Vision (ECCV), 2016. 1
[34] Yichun Shi, Divyansh Aggarwal, and Anil K Jain. Lifting 2D
stylegan for 3D-aware face generation. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2021.
3, 5, 6, 10, 13
[35] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet-
zstein. Scene representation networks: Continuous 3D-
structure-aware neural scene representations. In Advances
in Neural Information Processing Systems (NeurIPS), 2019.
3, 5, 7, 12
[36] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier fea-
tures let networks learn high frequency functions in low di-
mensional domains. In Advances in Neural Information
Processing Systems (NeurIPS), 2020. 13
[37] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt,
Taku Komura, and Wenping Wang. NeuS: Learning neu-
ral implicit surfaces by volume rendering for multi-view re-
construction. Advances in Neural Information Processing
Systems (NeurIPS), 2021. 14
[38] Jie Wu. Facial expression recognition pytorch, 2018.
https : / / github . com / WuJie1010 / Facial -
Expression-Recognition.Pytorch. 1
[39] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces. arXiv preprint
arXiv:2106.12052, 2021. 14
[40] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arXiv:2010.07492, 2020. 14
16