0% found this document useful (0 votes)

20 views27 pages

Efficient Geometry-Aware 3D Generative Adversarial Networks

The document presents an efficient hybrid explicit-implicit 3D Generative Adversarial Network (GAN) architecture that improves the quality and computational efficiency of 3D image synthesis from single-view 2D photographs. It introduces a tri-plane representation and a dual-discrimination training strategy to ensure multi-view consistency and high-quality 3D geometry without relying heavily on approximations. The proposed framework achieves state-of-the-art results in 3D-aware image synthesis while leveraging existing 2D CNN generators like StyleGAN2.

Uploaded by

Ali Reza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views27 pages

Efficient Geometry-Aware 3D Generative Adversarial Networks

Uploaded by

Ali Reza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Efficient Geometry-aware 3D Generative Adversarial Networks

Eric R. Chan * †1,2 , Connor Z. Lin* 1 , Matthew A. Chan* 1 , Koki Nagano* 2 , Boxiao Pan1 , Shalini De
Mello2 , Orazio Gallo2 , Leonidas Guibas1 , Jonathan Tremblay2 , Sameh Khamis2 , Tero Karras2 , and
Gordon Wetzstein1
1 2
Stanford University NVIDIA

Abstract
Unsupervised generation of high-quality multi-view-
consistent images and 3D shapes using only collections of
single-view 2D photographs has been a long-standing chal-
lenge. Existing 3D GANs are either compute-intensive or
make approximations that are not 3D-consistent; the for-
mer limits quality and resolution of the generated images
and the latter adversely affects multi-view consistency and
shape quality. In this work, we improve the computational
efficiency and image quality of 3D GANs without overly
relying on these approximations. We introduce an expres-
sive hybrid explicit-implicit network architecture that, to-
gether with other design choices, synthesizes not only high- Figure 1. Our 3D GAN enables synthesis of scenes, producing
resolution multi-view-consistent images in real time but also high-quality, multi-view-consistent renderings and detailed geom-
produces high-quality 3D geometry. By decoupling fea- etry. Our approach trains from a collection of 2D images without
ture generation and neural rendering, our framework is target-specific shape priors, ground truth 3D scans, or multi-view
able to leverage state-of-the-art 2D CNN generators, such supervision. Please see the accompanying video for more results.
as StyleGAN2, and inherit their efficiency and expressive-
ever, the image quality and resolution of existing 3D GANs
ness. We demonstrate state-of-the-art 3D-aware synthesis
have lagged far behind those of 2D GANs. Furthermore,
with FFHQ and AFHQ Cats, among other experiments.
their 3D reconstruction quality, so far, leaves much to be
desired. One of the primary reasons for this gap is the com-
putational inefficiency of previously employed 3D genera-
1. Introduction tors and neural rendering architectures.

Generative adversarial networks (GANs) have seen im- In contrast to 2D GANs, 3D GANs rely on a combination
mense progress, with recent models capable of generat- of a 3D-structure-aware inductive bias in the generator net-
ing high-resolution, photorealistic images indistinguishable work architecture and a neural rendering engine that aims
from real photographs [27–29]. Current state-of-the-art at providing view-consistent results. The inductive bias can
GANs, however, operate in 2D only and do not explicitly be modeled using explicit voxel grids [14, 21, 47, 48, 68, 74]
model the underlying 3D scenes. or neural implicit representations [4, 47, 49, 58]. While
Recent work on 3D-aware GANs has begun to tackle the successful in single-scene “overfitting” scenarios, neither
problem of multi-view-consistent image synthesis and, to a of these representations is suitable for training a high-
lesser extent, extraction of 3D shapes without being super- resolution 3D GAN because they are simply too memory
vised on geometry or multi-view image collections. How- inefficient or slow. Training a 3D GAN requires rendering
* Equal contribution.
tens of millions of images, but state-of-the-art neural vol-
† Part of the work was
done during an internship at NVIDIA. ume rendering [45] at high-resolutions with these represen-
Project page: https://github.com/NVlabs/eg3d tations is computationally infeasible. CNN-based image up-

1
sampling networks have been proposed to remedy this [49],
but such an approach sacrifices view consistency and im-
pairs the quality of the learned 3D geometry.
We introduce a novel generator architecture for unsuper-
vised 3D representation learning from a collection of single-
view 2D photographs that seeks to improve the computa-
tional efficiency of rendering while remaining true to 3D-
grounded neural rendering. We achieve this goal with a two-
pronged approach. First, we improve the computational ef-
ficiency of 3D-grounded rendering with a hybrid explicit–
implicit 3D representation that offers significant speed and
memory benefits over fully implicit or explicit approaches Figure 2. Neural implicit representations use fully connected lay-
without compromising on expressiveness. These advan- ers (FC) with positional encoding (PE) to represent a scene, which
tages enable our method to skirt the computational con- can be slow to query (a). Explicit voxel grids or hybrid variants
straints that have limited the rendering resolutions and qual- using small implicit decoders are fast to query, but scale poorly
ity of previous approaches [4, 58] and forced over-reliance with resolution (b). Our hybrid explicit–implicit tri-plane repre-
on image-space convolutional upsampling [49]. Second, sentation (c) is fast and scales efficiently with resolution, enabling
greater detail for equal capacity.
although we use some image-space approximations that
stray from the 3D-grounded rendering, we introduce a dual- 2. Related work
discrimination strategy that maintains consistency between
Neural scene representation and rendering. Emerging
the neural rendering and our final output to regularize their
neural scene representations use differentiable 3D-aware
undesirable view-inconsistent tendencies. Moreover, we in-
representations [1, 3, 6, 8, 13, 17, 43, 44, 52, 65] that can be
troduce pose-based conditioning to our generator, which de-
optimized using 2D multi-view images via neural render-
couples pose-correlated attributes (e.g., facial expressions)
ing [15, 20, 24, 30, 34–37, 40, 45, 46, 50, 51, 54, 62, 63, 70–
for a multi-view consistent output during inference while
72]. Explicit representations, such as discrete voxel grids
faithfully modeling the joint distributions of pose-correlated
(Fig. 2b), are fast to evaluate but often incur heavy mem-
attributes inherent in the training data.
ory overheads, making them difficult to scale to high res-
As an additional benefit, our framework decouples fea- olutions or complex scenes [38, 61]. Implicit representa-
ture generation from neural rendering, enabling it to di- tions, or coordinate networks (Fig. 2a), offer potential ad-
rectly leverage state-of-the-art 2D CNN-based feature gen- vantages in memory efficiency and scene complexity com-
erators, such as StyleGAN2, to generalize over spaces of 3D pared to discrete voxel grids by representing a scene as a
scenes while also benefiting from 3D multi-view-consistent continuous function (e.g., [43, 45, 52, 60, 66]). In practice,
neural volume rendering. Our approach not only achieves these implicit architectures use large fully connected net-
state-of-the-art qualitative and quantitative results for view- works that are slow to evaluate as each query requires a full
consistent 3D-aware image synthesis, but also generates pass through the network. Therefore, fully explicit and im-
high-quality 3D shapes of the synthesized scenes due to its plicit representations provide complementary benefits.
strong 3D-structure-aware inductive bias (see Fig. 1). Local implicit representations [3, 5, 23, 56] and hybrid
explicit–implicit representations [11,35,39,53] combine the
Our contributions are the following:
benefits of both types of representations by offering compu-
tationally and memory-efficient architectures. Inspired by
these ideas, we design a new hybrid explicit–implicit 3D-
• We introduce a tri-plane-based 3D GAN framework,
aware network that uses a memory-efficient tri-plane repre-
which is both efficient and expressive, to enable high-
sentation to explicitly store features on axis-aligned planes
resolution geometry-aware image synthesis.
that are aggregated by a lightweight implicit feature decoder
• We develop a 3D GAN training strategy that pro-
for efficient volume rendering (Fig. 2c). Our representation
motes multi-view consistency via dual discrimination
bears some resemblance to previous plane-based hybrid ar-
and generator pose conditioning while faithfully mod-
chitectures [11, 53], but it is unique in its specific design.
eling pose-correlated attribute distributions (e.g., ex-
Our representation is key to enabling the high 3D GAN im-
pressions) present in real-world datasets.
age quality that we demonstrate through efficient training
• We demonstrate state-of-the-art results for uncondi-
comparable (in time scales) to modern 2D GANs [27].
tional 3D-aware image synthesis on the FFHQ and
AFHQ Cats datasets along with high-quality 3D ge- Generative 3D-aware image synthesis. Generative ad-
ometry learned entirely from 2D in-the-wild images. versarial networks [16] have recently achieved photorealis-

2
tic image quality for 2D image synthesis [25,28,29,55]. Ex-
tending these capabilities to 3D settings has started to gain
momentum as well. Mesh-based approaches build on the
most popular primitives used in computer graphics, but lack
the expressiveness needed for high-fidelity image genera-
tion [33, 64]. Voxel-based GANs directly extend the CNN
generators used in 2D settings to 3D [14, 21, 47, 48, 68, 74].
The high memory requirements of voxel grids and the com-
putational burden of 3D convolutions, however, make high-
resolution 3D GAN training difficult. Low-resolution 3D
volume generation can be remedied with 2D CNN-based
image upsampling layers [49], but without an inductive 3D Figure 3. A synthesized view of the multi-view Family scene,
bias the results often lack view consistency. Block-based comparing a fully implicit Mip-NeRF representation (left), a dense
sparse volume representations overcome some of these is- voxel grid (center), and our tri-plane representation (right). Even
sues, but are applicable to mostly empty scenes [19,35] and though neither voxels nor tri-planes model view-dependent effects,
difficult to generalize across scenes. As an alternative, fully they achieve high quality.
implicit representation networks have been proposed for 3D
scene generation [4, 58], but these architectures are slow to MLP Rel. Speed ↑ Rel. Mem. ↓
query, which makes the GAN training inefficient, limiting Mip-NeRF [2] 8 × 256 1× 1×
the quality and resolution of generated images. Voxels (hybrid) 4 × 128 3.5× 0.33×
One of the primary insights of our work is that an ef- Tri-plane (SSO) 4 × 128 2.9× 0.32×
ficient 3D GAN architecture with 3D-grounded inductive Tri-plane (GAN) 1 × 64 7.8× 0.06×
biases is crucial for successfully generating high-resolution Table 1. Relative speedups and memory consumption compared to
view-consistent images and high-quality 3D shapes. Our Mip-NeRF. The proposed tri-plane representation is 3–8× faster
framework achieves this in several ways. First, unlike most than a fully implicit Mip-NeRF network and only requires a frac-
existing 3D GANs, we directly leverage a 2D CNN-based tion of its memory. In this example, both voxel grid and tri-plane
feature generator, i.e., StyleGAN2 [29], removing the need representation use an MLP-based decoder, as indicated. The num-
for inefficient 3D convolutions on explicit voxel grids. Sec- ber of voxels is chosen to match the total parameters of the tri-
ond, our tri-plane representation allows us to leverage neu- plane representation, thus the resolution is relatively low and the
ral volume rendering as an inductive bias, but in a much memory footprint lower than Mip-NeRF. In the SSO experiment
(Fig. 3), we used a larger decoder for the tri-plane representation
more computationally efficient way than fully implicit 3D
than for the GAN experiments discussed in Sec. 4 to optimize ex-
networks [4, 45, 58]. Similar to [49], we also employ pressiveness over speed for this experiment.
2D CNN-based upsampling after neural rendering, but our
method introduces dual discrimination to avoid view incon- the representation in this section for a single-scene over-
sistencies introduced by the upsampling layers. Unlike ex- fitting (SSO) experiment, before discussing how it is inte-
isting StyleGAN2-based 2.5D GANs, which generate im- grated in our GAN framework in the next section.
ages and depth maps [59], our method works naturally for In the tri-plane formulation, we align our explicit fea-
steep camera angles and in 360◦ viewing conditions. tures along three axis-aligned orthogonal feature planes,
The concurrently developed 3D-aware GANs StyleN- each with a resolution of N × N × C (Fig. 2c) with N
eRF [18] and CIPS-3D [73] demonstrate impressive image being spatial resolution and C the number of channels. We
quality. The central distinction between these and ours is query any 3D position x ∈ R3 by projecting it onto each of
that while StyleNeRF and CIPS-3D operate primarily in the three feature planes, retrieving the corresponding fea-
image-space, with less emphasis on the 3D representation, ture vector (Fxy , Fxz , Fyz ) via bilinear interpolation, and
our method operates primarily in 3D. Our approach demon- aggregating the three feature vectors via summation. An
strates greater view consistency, and is capable of generat- additional lightweight decoder network, implemented as a
ing high-quality 3D shapes. Furthermore, our experiments small MLP, interprets the aggregated 3D features F as color
report superior FID image scores on FFHQ and AFHQ. and density. These quantities are rendered into RGB images
using (neural) volume rendering [41, 45].
3. Tri-plane hybrid 3D representation The primary advantage of this hybrid representation is
Training a high-resolution GAN requires a 3D represen- efficiency—by keeping the decoder small and shifting the
tation that is both efficient and expressive. In this section, bulk of the expressive power into the explicit features, we
we introduce a new hybrid explicit–implicit tri-plane repre- reduce the computational cost of neural rendering compared
sentation that offers both of these advantages. We introduce to fully implicit MLP architectures [2, 45] without losing

3
Figure 4. Our 3D GAN framework comprises several parts: a pose-conditioned StyleGAN2-based feature generator and mapping net-
work, a tri-plane 3D representation with a lightweight feature decoder, a neural volume renderer, a super-resolution module, and a pose-
conditioned StyleGAN2 discriminator with dual discrimination. This architecture elegantly decouples feature generation and neural render-
ing, allowing the use of a powerful StyleGAN2 generator for 3D scene generalization. Moreover, the lightweight 3D tri-plane representation
is both expressive and efficient in enabling high-quality 3D-aware view synthesis in real-time.

expressiveness. To validate that the tri-plane representation to efficiently render images through neural volume ren-
tion is compact yet sufficiently expressive, we evaluate it dering, but make a number of modifications to adapt this
with a common novel-view synthesis setup. For this pur- representation to the 3D GAN setting. Unlike in the SSO
pose, we directly optimize the features of the planes and the experiment, where the features of the planes were directly
weights of the decoder to fit 360◦ views of a scene from optimized from the multiple input views, for the GAN set-
the Tanks & Temples dataset [31] (Fig. 3). In this exper- ting we generate the tri-plane features, each containing 32
iment, we use feature planes of resolution N = 512 and channels, with the help of a 2D convolutional StyleGAN2
channels C = 48, paired with an MLP of four layers of backbone (Sec. 4.1). Instead of producing an RGB im-
128 hidden units each and a Fourier feature encoding [66]. age, in the GAN setting our neural renderer aggregates fea-
We compare the results against a dense feature volume of tures from each of the 32-channel tri-planes and predicts
equal capacity. For reference, we include comparisons to a 32-channel feature images from a given camera pose. This
state-of-the-art fully implicit 3D representation [2]. Fig. 3 is followed by a “super-resolution” module to upsample
and Tab. 1 demonstrate that the tri-plane representation is and refine these raw neurally rendered images (Sec. 4.2).
capable of representing this complex scene, albeit without The generated images are critiqued by a slightly modified
view-dependent effects, outperforming dense feature vol- StyleGAN2 discriminator (Sec. 4.3). The entire pipeline
ume representations [38, 61] and fully implicit represen- is trained end-to-end from random initialization, using the
tations [45] in terms of PSNR and SSIM, while offering non-saturating GAN loss function [16] with R1 regulariza-
considerable advantages in computation and memory effi- tion [42], following the training scheme in StyleGAN2 [29].
ciency. For a side length of N features, tri-planes scale with To speed training, we use a two-stage training strategy in
O(N 2 ) rather than O(N 3 ) as dense voxels do, which means which we train with a reduced (642 ) neural rendering reso-
for equal capacity and memory, the tri-plane representation lution followed by a short fine-tuning period at full (1282 )
can use higher resolution features and capture greater de- neural rendering resolution. Additional experiments found
tail. Finally, our tri-plane representation has one other key that regularization to encourage smoothness of the density
advantage over these alternatives: the feature planes can be field helped reduce artifacts in 3D shapes. The following
generated with an off-the-shelf 2D CNN-based generator, sections discuss major components of our framework in de-
enabling generalization across 3D representations using the tail. For additional descriptions, implementation details,
GAN framework discussed next. and hyperparameters, please see the supplement.

4.1. CNN generator backbone and rendering

4. 3D GAN framework
The features of the tri-plane representation, when used
Armed with an efficient and expressive 3D representa- in our GAN setting, are generated by a StyleGAN2 CNN
tion, we train a 3D GAN for geometry-aware image synthe- generator. The random latent code and camera parameters
sis from 2D photographs, without any explicit 3D or multi- are first processed by a mapping network to yield an inter-
view supervision. We associate each training image with mediate latent code which then modulates the convolution
a set of camera intrinsics and extrinsics using off-the-shelf kernels of a separate synthesis network.
pose detectors [10, 32]; see the supplement for details. We change the output shape of the StyleGAN2 backbone
Fig. 4 gives an overview of our network architecture. We such that, rather than producing a three-channel RGB im-
use the tri-plane representation introduced in the last sec- age, we produce a 256 × 256 × 96 feature image. This

4
4.3. Dual discrimination
As in standard 2D GAN training, the resulting renderings
are critiqued by a 2D convolutional discriminator. We use a
StyleGAN2 discriminator with two modifications.
First, we introduce dual discrimination as a method to
avoid multi-view inconsistency issues observed in prior
Figure 5. Dual discrimination ensures that the raw neural render- work [47, 49]. For this purpose, we interpret the first three
+
ing IRGB and super-resolved output IRGB maintain consistency,
feature channels of a neurally rendered feature image IF
enabling high-resolution and multi-view-consistent rendering.
as a low-resolution RGB image IRGB . Intuitively, dual
feature image is split channel-wise and reshaped to form discrimination then ensures consistency between IRGB and
+
three 32-channel planes (see Fig. 4). We choose Style- the super-resolved image IRGB . This is achieved by bilin-
+
GAN2 for predicting the tri-plane features because it is a early upsampling IRGB to the same resolution as IRGB and
well-understood and efficient architecture achieving state- concatenating the results to form a six-channel image (see
of-the-art results for 2D image synthesis. Furthermore, our Fig. 4). The real images fed into the discriminator are also
model inherits many of the desirable properties of Style- processed by concatenating each of them with an appropri-
GAN: a well-behaved latent space that enables style-mixing ately blurred copy of itself. We discriminate over these six-
and latent-space interpolation (see Sec. 5 and supplement). channel images instead of the three-channel images tradi-
We sample features from the tri-planes, aggregate by tionally seen in GAN discriminators.
summation, and process the aggregated features with a Dual discrimination not only encourages the final output
lightweight decoder, as described in Sec. 3. Our decoder to match the distribution of real images, but also offers ad-
is a multi-layer perceptron with a single hidden layer of ditional effects: it encourages the neural rendering to match
64 units and softplus activation functions. The MLP does the distribution of downsampled real images; and it encour-
not use a positional encoding, coordinate inputs, or view- ages the super-resolved images to be consistent with the
direction inputs. This hybrid representation can be queried neural rendering (see Fig. 5). The second point importantly
for continuous coordinates and outputs a scalar density σ allows us to leverage effective image-space super-resolution
as well as a 32-channel feature, both of which are then pro- layers without introducing view-inconsistency artifacts.
cessed by a neural volume renderer to project the 3D feature Second, we make the discriminator aware of the cam-
volume into a 2D feature image. era poses from which the generated images are ren-
Volume rendering [41] is implemented using two-pass dered. Specifically, following the conditional strategy from
importance sampling as in [45]. Following [49], volume StyleGAN2-ADA [26], we pass the rendering camera in-
rendering in our GAN framework produces feature images, trinsics and extrinsics matrices (collectively P) to the dis-
rather than RGB images, because feature images contain criminator as a conditioning label. We find that this con-
more information that can be effectively utilized for the ditioning introduces additional information that guides the
image-space refinement described next. For the majority generator to learn correct 3D priors. We provide additional
of the experiments reported in this manuscript, we render studies in the supplement showing the effect of this discrim-
32-channel feature images IF at a resolution of 1282 , with inator conditioning and the robustness of our framework to
96 total depth samples per ray. high levels of noise in the input camera poses.

4.4. Modeling pose-correlated attributes

4.2. Super resolution
Most real-world datasets like FFHQ include biases that
Although the tri-plane representation is significantly correlate camera poses with other attributes (e.g., facial ex-
more computationally efficient than previous approaches, it pressions), and naively handling them leads to view incon-
is still too slow to natively train or render at high resolutions sistent results. For example, the camera angle with respect
while maintaining interactive framerates. We thus perform to a person’s face is correlated with smiling (see supple-
volume rendering at a moderate resolution (e.g., 1282 ) and ment). While faithfully modeling such attribute correla-
rely upon image-space convolutions to upsample the neural tions inherent in the dataset is important for reproducing
rendering to the final image size of 2562 or 5122 . the best image quality, such unwanted attributes need to be
Our super resolution module is composed of two blocks decoupled during inference for multi-view consistent syn-
of StyleGAN2-modulated convolutional layers that upsam- thesis. Related work has been successful at being view
ple and refine the 32-channel feature image IF into the final consistent [4, 58, 59] or modeling pose-appearance corre-
+
RGB image IRGB . We disable per-pixel noise inputs to re- lations [47, 49], but cannot achieve both simultaneously.
duce texture sticking [27] and reuse the mapping network We introduce generator pose conditioning as a means to
of the backbone to modulate these layers. model and decouple correlations between pose and other

5
Figure 6. Curated examples at 5122 , synthesized by models trained with FFHQ [28] and AFHQv2 Cats [7]

Figure 7. Qualitative comparison between GIRAFFE, π-GAN, Lifting StyleGAN, ours, with FFHQ at 2562 . Shapes are iso-surfaces
extracted from the density field using marching cubes. We inspected the underlying 3D representations of GIRAFFE and found that its
over-reliance on image-space approximations significantly harms the learning of the 3D geometry.

attributes observed in the training images. To this end, we we randomly swap the conditioning pose in P with another
provide the backbone mapping network not only a latent random pose with 50% probability during training.
code vector z, but also the camera parameters P as input,
following the conditional generation strategy in [26]. By
giving the backbone knowledge of the rendering camera po-
sition, we allow the target view to influence scene synthesis. 5. Experiments and results
During training, pose conditioning allows the generator Datasets. We compare methods on the task of uncondi-
to model pose-dependent biases implicit to the dataset, al- tional 3D-aware generation with FFHQ [28], a real-world
lowing our model to faithfully reproduce the image distri- human face dataset, and AFHQv2 Cats [7,27], a small, real-
butions in the dataset. To prevent the scene from shifting world cat face dataset. We augment both datasets with hori-
with camera pose during inference, we condition the gener- zontal flips and use off-the-shelf pose estimators [10, 32] to
ator on a fixed camera pose when rendering from a moving extract approximate camera extrinsics. For all methods on
camera trajectory. We noticed that always conditioning the AFHQv2, we apply transfer learning [26] from correspond-
generator with the rendering camera pose can lead to degen- ing FFHQ checkpoints; for our method on AFHQv2 5122 ,
erate solutions where the GAN produces 2D billboards an- we additionally use adaptive data augmentation [26]. For
gled towards the camera (see supplement). To prevent this, more results, please see the accompanying video.

6
FFHQ Cats Res. GIRAFFE π-GAN Lift. SG Ours Ours + TC
FID↓ ID↑ Depth↓ Pose↓ FID↓ 2
256 181 5 51 27 36
GIRAFFE 2562 31.5 0.64 0.94 .089 16.1 5122 161 1 — 26 35
π-GAN 1282 29.9 0.67 0.44 .021 16.0
Table 3. Runtime in frames per second at different rendering res-
Lift. SG 2562 29.8 0.58 0.40 .023 —
olutions. We compare variants of our approach with and without
Ours 2562 4.8 0.76 0.31 .005 3.88
tri-plane caching (TC). Run on a single RTX 3090 GPU.
Ours 5122 4.7 0.77 0.39 .005 2.77†
Table 2. Quantitative evaluation using FID, identity consistency FID ↓ FACS Smile Std. ↓
(ID), depth accuracy, and pose accuracy for FFHQ and AFHQ Naive model 5.5 0.069
Cats. Labelled is the image resolution of training and evaluation. + DD 6.5 0.054
†
Trained with adaptive data augmentation [26]. + DD, GPC (ours) 4.7 0.031
5.1. Comparisons Table 4. Dual-discrimination (DD) improves multi-view expres-
Baselines. We compare our methods against three state- sion consistency but hurts the model’s ability to capture pose-
correlated attributes for image quality. Adding generator pose con-
of-the-art methods for 3D-aware image synthesis: π-
ditioning (GPC) allows the model to improve upon both aspects.
GAN [4], GIRAFFE [49], and Lifting StyleGAN [59].
Reported at 5122 , with FFHQ.
Qualitative results. Fig. 6 presents selected examples 96 total depth samples per ray, suitable for applications
synthesized by our model with FFHQ and AFHQ at a such as real-time visualization. When rendering consecu-
resolution of 5122 , highlighting the image quality, view- tive frames of a static scene, we need not regenerate the
consistency, and diversity of outputs produced by our tri-plane features every frame; caching the generated fea-
method. Fig. 7 provides a qualitative comparison against tures is a simple tweak that improves render speed. The
baselines. While GIRAFFE synthesizes high-quality im- proposed approach is significantly faster than fully implicit
ages, reliance on view-inconsistent convolutions produces methods like π-GAN [4]. Although it is not as fast as Lift-
poor-quality shapes and identity shift—note the hairline in- ing StyleGAN [59] and GIRAFFE [49], we believe major
consistency between rendered views. π-GAN and Lifting improvements in image quality, geometry quality, and view-
StyleGAN generate adequate shapes and images but both consistency outweigh the increased compute cost.
struggle with photorealism and in capturing detailed shapes.
Our method synthesizes not only images that are higher
quality and more view-consistent but also higher-fidelity 3D
geometry as seen in the detailed glasses and hair strands. 5.2. Ablation study

Quantitative evaluations. Table 2 provides quantitative Without dual discrimination, generated images can in-
metrics comparing the proposed approach against baselines. clude multi-view inconsistencies due to the unconstrained
We measure image quality with Fréchet Inception Distance image-space super-resolution layers. We measure this ef-
(FID) [22] between 50k generated images and all available fect quantitatively by extracting smile-related Facial Action
real images. We evaluate shape quality by calculating MSE Coding System (FACS) [12] coefficients from videos pro-
against pseudo-ground-truth depth-maps (Depth) and poses duced by models with and without dual discrimination, us-
(Pose) estimated from synthesized images by [10]; a simi- ing a proprietary facial tracker. We measure the standard
lar evaluation was introduced by [59]. We assess multi-view deviation of smile coefficients for the same scene across
facial identity consistency (ID) by calculating the mean Ar- video frames. A view-consistent scene should exhibit lit-
cface [9] cosine similarity score between pairs of views of tle expression shift and thus produce little variation in smile
the same synthesized face rendered from random camera coefficients. This is validated in Table 4 showing that intro-
poses. Additional evaluation details are provided in the sup- ducing dual discrimination (second row) reduces the smile
plement. Our model demonstrates significant improvements coefficient variation versus the naive model (first row), in-
in FID across both datasets, bringing the 3D GAN to near dicating improved expression consistency. However, dual
the same level as StyleGAN2 5122 (2.97 for FFHQ [29] and discrimination also reduces image quality as seen by the
2.99 for Cats [26]) while also maintaining state-of-the-art slightly worse FID score, perhaps because the model is re-
view consistency, geometry quality, and pose accuracy. stricted from reproducing the pose-correlated attribute bi-
ases in the FFHQ dataset. By adding generator pose con-
Runtime. Table 3 compares rendering speed at inference ditioning (third row), we allow the generator to faithfully
running on a single NVIDIA RTX 3090 GPU. Our end- model pose-correlated attributes while decoupling them at
to-end approach achieves real-time framerates at 5122 fi- inference, leading to both the best FID score and view-
nal resolution with 1282 neural rendering resolution and consistent results.

7
Figure 9. We use PTI [57] to fit a target image and recover the
underlying 3D shape. Target (left); reconstructed image (center);
Figure 8. Style-mixing [27–29] with FFHQ 5122 . reconstructed shape (right). From a model trained on FFHQ 5122 .

5.3. Applications as image-to-image translation or Transformer-based mod-

Style mixing. Since our 3D representation is designed els, could enable new applications in conditional synthesis.
with the StyleGAN2 backbone from the ground up, it in- Ethical considerations. The single-view 3D reconstruc-
herits the well-studied properties of the StyleGAN2 latent tion or style mixing applications could be misused for gen-
space, allowing us to do semantic image manipulations. erating edited imagery of real people. Such misuse of image
Fig. 8 shows our method’s results for style mixing [27–29]. synthesis techniques poses a societal threat, and we do not
Single-view 3D reconstruction. Fig. 9 shows the appli- condone using our work with the intent of spreading misin-
cation of our learned latent space for single-view 3D recon- formation or tarnishing reputation. We also recognize a po-
struction. We use pivotal tuning inversion (PTI) [57] to fit tential lack of diversity in our faces results, stemming from
test images. The learned 3D prior over FFHQ enables sur- implicit biases of the datasets we process.
prisingly high-quality single-view geometry recovery. Fur-
ther exploration of few-shot 3D reconstruction and novel- Conclusion. By combining an efficient explicit–implicit
view-synthesis may prove a fruitful avenue for future work. neural representation with an expressive pose-aware convo-
lutional generator and a dual discriminator, our approach
6. Discussion takes significant steps towards photorealistic 3D-aware im-
Limitations and future work. Although our shapes show age synthesis and high-quality unsupervised shape gener-
significant improvements over those generated by previous ation. This may enable rapid prototyping of 3D models,
3D-aware GANs, they may still contain artifacts and lack more controllable image synthesis, and novel techniques for
finer details, such as individual teeth. To further improve shape reconstruction from temporal data.
the quality of the learned shapes, we could instill a stronger
geometry prior or regularize the density component of the Acknowledgements
radiance field following methods proposed by [51, 67, 69]. We thank David Luebke, Jan Kautz, Jaewoo Seo,
Our model requires knowledge of the camera pose dis- Jonathan Granskog, Simon Yuen, Alex Evans, Stan Birch-
tribution of the dataset. Although prior work has proposed field, Alexander Bergman, and Joy Hsu for feedback on
learning the pose distribution on the fly [49], others have drafts, Alex Chan, Giap Nguyen, and Trevor Chan for
noticed such methods can diverge [18], so it would be fruit- help with diagrams, and Colette Kress and Bryan Catan-
ful to explore this direction further. Pose conditioning aids zaro for allowing use of their photographs. This project
the generator in decoupling appearance from pose, but still was in part supported by Stanford HAI and a Samsung
does not fully disentangle the two. Furthermore, ambigui- GRO. Koki Nagano and Eric Chan were partially sup-
ties that can be explained by geometry remain unresolved. ported by DARPA’s Semantic Forensics (SemaFor) con-
For example, by creating concave eye sockets, the gener- tract (HR0011-20-3-0005). The views and conclusions con-
ator creates the illusion of eyes that “follow” the camera, tained in this document are those of the authors and should
an incorrect interpretation, though the renderings are view- not be interpreted as representing the official policies, either
consistent and reflect the underlying geometry. expressed or implied, of the U.S. Government. Distribution
We used StyleGAN 2, but other 2D backbones may find Statement “A” (Approved for Public Release, Distribution
success in our framework. Alternative backbones, such as Unlimited).

8
References rendering at 200fps. arXiv preprint arXiv:2103.10380, 2021.
2
[1] Matan Atzmon and Yaron Lipman. SAL: Sign agnostic
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
learning of shapes from raw data. In IEEE Conference on
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Computer Vision and Pattern Recognition (CVPR), 2020. 2
Yoshua Bengio. Generative adversarial nets. In Advances in
[2] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Neural Information Processing Systems (NeurIPS), 2014. 2,
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. 4
Mip-nerf: A multiscale representation for anti-aliasing neu-
[17] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and
ral radiance fields. In ICCV, 2021. 3, 4
Yaron Lipman. Implicit geometric regularization for learning
[3] Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, shapes. In International Conference on Machine Learning
Julian Straub, Steven Lovegrove, and Richard Newcombe. (ICML), 2020. 2
Deep local shapes: Learning local SDF priors for detailed [18] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian
3D reconstruction. In European Conference on Computer Theobalt. StyleNeRF: A style-based 3D-aware genera-
Vision (ECCV), 2020. 2 tor for high-resolution image synthesis. arXiv preprint
[4] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, arXiv:2110.08985, 2021. 3, 8
and Gordon Wetzstein. pi-GAN: Periodic implicit generative [19] Zekun Hao, Arun Mallya, Serge Belongie, and Ming-Yu Liu.
adversarial networks for 3D-aware image synthesis. In IEEE GANcraft: Unsupervised 3D neural rendering of minecraft
Conference on Computer Vision and Pattern Recognition worlds. In IEEE International Conference on Computer
(CVPR), 2021. 1, 2, 3, 5, 7 Vision (ICCV), 2021. 3
[5] Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning con- [20] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall,
tinuous image representation with local implicit image func- Jonathan T. Barron, and Paul Debevec. Baking neu-
tion. In IEEE Conference on Computer Vision and Pattern ral radiance fields for real-time view synthesis. In IEEE
Recognition (CVPR), 2021. 2 International Conference on Computer Vision (ICCV), 2021.
[6] Zhiqin Chen and Hao Zhang. Learning implicit fields 2
for generative shape modeling. In IEEE Conference on [21] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escap-
Computer Vision and Pattern Recognition (CVPR), 2019. 2 ing Plato’s cave: 3D shape from adversarial rendering. In
[7] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. IEEE International Conference on Computer Vision (ICCV),
Stargan v2: Diverse image synthesis for multiple domains. 2019. 1, 3
In Proceedings of the IEEE Conference on Computer Vision [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
and Pattern Recognition, 2020. 6 Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter.
[8] Thomas Davies, Derek Nowrouzezahrai, and Alec Jacobson. GANs trained by a two time-scale update rule converge to
Overfit neural networks as a compact shape representation. a nash equilibrium. In Advances in Neural Information
arXiv preprint arXiv:2009.09808, 2020. 2 Processing Systems (NeurIPS), 2017. 7
[9] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos [23] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, Jingwei
Zafeiriou. Arcface: Additive angular margin loss for deep Huang, Matthias Nießner, and Thomas Funkhouser. Lo-
face recognition. In CVPR, 2019. 7 cal implicit grid representations for 3D scenes. In IEEE
[10] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Conference on Computer Vision and Pattern Recognition
Jia, and Xin Tong. Accurate 3d face reconstruction with (CVPR), 2020. 2
weakly-supervised learning: From single image to image [24] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker.
set. In IEEE Computer Vision and Pattern Recognition SDFDiff: Differentiable rendering of signed distance fields
Workshops, 2019. 4, 6, 7 for 3D shape optimization. In IEEE Conference on
[11] Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Computer Vision and Pattern Recognition (CVPR), 2020. 2
Graham W. Taylor, and Joshua M. Susskind. Unconstrained [25] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
scene generation with locally conditioned radiance fields. Progressive growing of GANs for improved quality, stabil-
arXiv preprint arXiv:2104.00670, 2021. 2 ity, and variation. In International Conference on Learning
[12] Paul Ekman and Wallace V. Friesen. Facial Action Coding Representations (ICLR), 2018. 3
System: A Technique for the Measurement of Facial [26] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Movement. Consulting Psychologists Press, 1978. 7 Jaakko Lehtinen, and Timo Aila. Training generative adver-
[13] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, sarial networks with limited data. In Advances in Neural
Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Rud- Information Processing Systems (NeurIPS), 2020. 5, 6, 7
erman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. [27] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
Neural scene representation and rendering. Science, 2018. 2 Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-
[14] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3D shape free generative adversarial networks. In Advances in Neural
induction from 2D views of multiple objects. In International Information Processing Systems (NeurIPS), 2021. 1, 2, 5, 6,
Conference on 3D Vision, 2017. 1, 3 8
[15] Stephan J Garbin, Marek Kowalski, Matthew Johnson, Jamie [28] Tero Karras, Samuli Laine, and Timo Aila. A style-
Shotton, and Julien Valentin. FastNeRF: High-fidelity neural based generator architecture for generative adversarial net-

9
works. In IEEE Conference on Computer Vision and Pattern [43] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Recognition (CVPR), 2019. 1, 3, 6, 8 bastian Nowozin, and Andreas Geiger. Occupancy networks:
[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Learning 3D reconstruction in function space. In IEEE
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- Conference on Computer Vision and Pattern Recognition
ing the image quality of StyleGAN. In IEEE Conference on (CVPR), 2019. 2
Computer Vision and Pattern Recognition (CVPR), 2020. 1, [44] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack,
3, 4, 7, 8 Mahsa Baktashmotlagh, and Anders Eriksson. Implicit sur-
[30] Petr Kellnhofer, Lars Jebe, Andrew Jones, Ryan Spicer, Kari face representations as layers in neural networks. In IEEE
Pulli, and Gordon Wetzstein. Neural lumigraph render- International Conference on Computer Vision (ICCV), 2019.
ing. In IEEE Conference on Computer Vision and Pattern 2
Recognition (CVPR), 2021. 2 [45] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[31] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Koltun. Tanks and temples: Benchmarking large-scale scene Representing scenes as neural radiance fields for view
reconstruction. ACM Transactions on Graphics, 36(4), 2017. synthesis. In European Conference on Computer Vision
4 (ECCV), 2020. 1, 2, 3, 4, 5
[32] Taehee Brad Lee. Cat hipsterizer, 2018. https:// [46] Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas
github.com/kairess/cat_hipsterizer. 4, 6 Kurz, Joerg H. Mueller, Chakravarty R. Alla Chaitanya, An-
[33] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas ton S. Kaplanyan, and Markus Steinberger. DONeRF: To-
Geiger. Towards unsupervised learning of generative models wards Real-Time Rendering of Compact Neural Radiance
for 3D controllable image synthesis. In IEEE Conference on Fields using Depth Oracle Networks. Computer Graphics
Computer Vision and Pattern Recognition (CVPR), 2020. 3 Forum, 40(4), 2021. 2
[34] David B Lindell, Julien NP Martel, and Gordon Wetzstein. [47] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian
AutoInt: Automatic integration for fast neural volume ren- Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised
dering. In IEEE Conference on Computer Vision and Pattern learning of 3D representations from natural images. In IEEE
Recognition (CVPR), 2021. 2 International Conference on Computer Vision (ICCV), 2019.
[35] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and 1, 3, 5
Christian Theobalt. Neural sparse voxel fields. In Advances [48] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-
in Neural Information Processing Systems (NeurIPS), 2020. Liang Yang, and Niloy Mitra. BlockGAN: Learning 3D
2, 3 object-aware scene representations from unlabelled images.
[36] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. In Advances in Neural Information Processing Systems
Learning to infer implicit surfaces without 3D supervision. (NeurIPS), 2020. 1, 3
arXiv preprint arXiv:1911.00767, 2019. 2 [49] Michael Niemeyer and Andreas Geiger. GIRAFFE: Rep-
[37] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc resenting scenes as compositional generative neural feature
Pollefeys, and Zhaopeng Cui. DIST: Rendering deep im- fields. In IEEE Conference on Computer Vision and Pattern
plicit signed distance function with differentiable sphere Recognition (CVPR), 2021. 1, 2, 3, 5, 7, 8
tracing. In IEEE Conference on Computer Vision and Pattern [50] Michael Niemeyer, Lars Mescheder, Michael Oechsle,
Recognition (CVPR), 2020. 2 and Andreas Geiger. Differentiable volumetric rendering:
[38] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Learning implicit 3d representations without 3d supervi-
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- sion. In IEEE Conference on Computer Vision and Pattern
umes: Learning dynamic renderable volumes from images. Recognition (CVPR), 2020. 2
ACM Transactions on Graphics (SIGGRAPH), 2019. 2, 4 [51] Michael Oechsle, Songyou Peng, and Andreas Geiger.
[39] Julien N.P. Martel, David B. Lindell, Connor Z. Lin, Eric R. UNISURF: Unifying neural implicit surfaces and radiance
Chan, Marco Monteiro, and Gordon Wetzstein. ACORN: fields for multi-view reconstruction. In IEEE International
Adaptive coordinate networks for neural representation. Conference on Computer Vision (ICCV), 2021. 2, 8
ACM Transactions on Graphics (SIGGRAPH), 2021. 2 [52] Jeong Joon Park, Peter Florence, Julian Straub, Richard
[40] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Newcombe, and Steven Lovegrove. DeepSDF: Learning
Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- continuous signed distance functions for shape representa-
worth. NeRF in the wild: Neural radiance fields for uncon- tion. In IEEE Conference on Computer Vision and Pattern
strained photo collections. In IEEE Conference on Computer Recognition (CVPR), 2019. 2
Vision and Pattern Recognition (CVPR), 2021. 2 [53] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc
[41] N. Max. Optical models for direct volume rendering. Pollefeys, and Andreas Geiger. Convolutional occupancy
IEEE Transactions on Visualization and Computer Graphics networks. In European Conference on Computer Vision
(TVCG), 1995. 3, 5 (ECCV), 2020. 2
[42] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. [54] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and
Which training methods for GANs do actually converge? Francesc Moreno-Noguer. D-NeRF: Neural radiance fields
In International Conference on Machine Learning (ICML), for dynamic scenes. arXiv preprint arXiv:2011.13961, 2020.
2018. 4 2

10
[55] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper- construction. Advances in Neural Information Processing
vised representation learning with deep convolutional gener- Systems (NeurIPS), 2021. 8
ative adversarial networks. In International Conference on [68] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Free-
Learning Representations (ICLR), 2016. 3 man, and Joshua B. Tenenbaum. Learning a probabilistic
[56] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas latent space of object shapes via 3D generative-adversarial
Geiger. KiloNeRF: Speeding up neural radiance fields with modeling. In Advances in Neural Information Processing
thousands of tiny MLPs. In IEEE International Conference Systems (NeurIPS), 2016. 1, 3
on Computer Vision (ICCV), 2021. 2 [69] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
[57] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel ume rendering of neural implicit surfaces. arXiv preprint
Cohen-Or. Pivotal tuning for latent-based editing of real im- arXiv:2106.12052, 2021. 8
ages. arXiv preprint arXiv:2106.05744, 2021. 8 [70] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
[58] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Atzmon, Ronen Basri, and Yaron Lipman. Multiview neu-
Geiger. GRAF: Generative radiance fields for 3D-aware im- ral surface reconstruction by disentangling geometry and ap-
age synthesis. In Advances in Neural Information Processing pearance. In Advances in Neural Information Processing
Systems (NeurIPS), 2020. 1, 2, 3, 5 Systems (NeurIPS), 2020. 2
[59] Yichun Shi, Divyansh Aggarwal, and Anil K Jain. Lifting 2D [71] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
stylegan for 3D-aware face generation. In IEEE Conference Angjoo Kanazawa. PlenOctrees for real-time rendering of
on Computer Vision and Pattern Recognition (CVPR), 2021. neural radiance fields. In IEEE International Conference on
3, 5, 7 Computer Vision (ICCV), 2021. 2
[60] Vincent Sitzmann, Julien N.P. Martel, Alexander W. [72] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Bergman, David B. Lindell, and Gordon Wetzstein. Implicit Koltun. Nerf++: Analyzing and improving neural radiance
neural representations with periodic activation functions. fields. arXiv preprint arXiv:2010.07492, 2020. 2
In Advances in Neural Information Processing Systems [73] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian.
(NeurIPS), 2020. 2 CIPS-3D: A 3D-Aware Generator of GANs Based on
[61] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Conditionally-Independent Pixel Synthesis. arXiv preprint
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- arXiv:2110.09788, 2021. 3
Voxels: Learning persistent 3D feature embeddings. In IEEE [74] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu,
Conference on Computer Vision and Pattern Recognition Antonio Torralba, Joshua B. Tenenbaum, and William T.
(CVPR), 2019. 2, 4 Freeman. Visual object networks: Image generation with
[62] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet- disentangled 3D representations. In Advances in Neural
zstein. Scene representation networks: Continuous 3D- Information Processing Systems (NeurIPS), 2018. 1, 3
structure-aware neural scene representations. In Advances
in Neural Information Processing Systems (NeurIPS), 2019.
2
[63] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang,
Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron.
NeRV: Neural reflectance and visibility fields for relighting
and view synthesis. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2021. 2
[64] Attila Szabó, Givi Meishvili, and Paolo Favaro. Unsuper-
vised generative 3D shape learning from natural images.
arXiv preprint arXiv:1910.00287, 2019. 3
[65] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten
Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacob-
son, Morgan McGuire, and Sanja Fidler. Neural geomet-
ric level of detail: Real-time rendering with implicit 3D
shapes. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2021. 2
[66] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier fea-
tures let networks learn high frequency functions in low di-
mensional domains. In Advances in Neural Information
Processing Systems (NeurIPS), 2020. 2, 4
[67] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt,
Taku Komura, and Wenping Wang. NeuS: Learning neu-
ral implicit surfaces by volume rendering for multi-view re-

11
Supplemental Material
Efficient Geometry-aware 3D Generative Adversarial Networks

In this supplement, we first provide additional experi-

ments (Sec. 1) and visual results (Sec. 2). We follow with 1.0
details of our implementation (Sec. 3), including further de-
scriptions of model architecture and training process, as 0.8
well as hyperparameters. We discuss experiment details
percentage smiles

(Sec. 4), such as datasets and baselines, and further explana- 0.6
tions for experiments such as inversion. Lastly, we consider
artifacts (Sec. 5) that may be targets of future work. We 0.4
encourage readers to view the accompanying supplemental
video, which contains additional visual results, including a 0.2
live demonstration of real-time synthesis.
0.0
-1.0 -0.8 -0.6 -0.4 -0.2 -0.0 0.2 0.4 0.6 0.8 1.0
1. Additional experiments yaw (radians)
Figure 1. We plot the probability of smiling against head yaw an-
1.1. Analyzing pose/facial expression correlation in
gle, as measured by [38]. People looking at the camera are more
FFHQ likely to be smiling than people angled away, indicating a correla-
Fig. 1 plots the likelihood a subject from FFHQ [17] is tion between scene appearance and camera pose.
smiling (measured by [38]), against head yaw (computed
by [9]). The plot indicates that individuals facing towards
the camera are more likely to be smiling than are individuals
who are facing away from the camera. An intuitive explana-
tion for this phenomenon is that people who are knowingly
being photographed, as in portrait images, are more likely
to be smiling than people who are photographed candidly.
Left uncompensated for, this correlation between pose
and facial expressions incentivizes “expression warping”,
where the expressions of synthesized faces shift as we move
the camera. We propose dual discrimination (Section 4.3 of Figure 2. COLMAP [32, 33] reconstruction of 128 frames of syn-
thesized video (top) which followed an oval trajectory. The result-
the main paper) and generator pose conditioning (Section
ing dense, well-defined point cloud (bottom) is indicative of highly
4.4 of the main paper) to reduce such expression warping. multi-view-consistent rendering.
1.2. COLMAP reconstruction point-cloud of a synthesized video sequence (Fig. 2). We
To further validate the multi-view consistency of our reconstruct a video sequence of 128 frames, taken from an
method, we employ COLMAP [32, 33] to reconstruct a oval trajectory similar to the camera paths shown in the sup-
plemental video. We use COLMAP’s “automatic” recon-
* Equal contribution.
† Part of the work was
struction, without specifying camera parameters. The re-
done during an internship at NVIDIA.

1
sulting point cloud is dense and well-defined, indicating that
our 3D GAN produces highly multi-view-consistent render-
ings.
1.3. Regularizing generator pose conditioning

Figure 3. Naively applying generator pose conditioning results in

a degenerate solution because the generator is always aware of the
location of the rendering camera. Such an approach produces rea-
sonable renderings when taken from the “intended” viewing angle,
(i.e. the camera pose the generator was conditioned on). However,
if we freeze the conditioning information and move the camera at
inference, it is clear that the model has learned to produce “bill-
boards” angled towards the known location of the camera.

As described in Section 4.4 of the main paper, we regu-

larize generator pose conditioning by randomly swapping
the conditioning pose of the generator with another ran-
dom pose with 50% probability. Fig. 3 shows the result
of training a model with generator pose conditioning but
without any swapping regularization—the generator always
receives, as a conditioning input, the true pose of the ren-
dering camera. The model learns a degenerate solution in
which it creates a “billboard” angled towards the rendering
camera. We prevent this degenerate solution by randomly
swapping the conditioning camera pose with an alternative
pose sampled from the dataset pose distribution. For models Figure 4. In order to gauge robustness to the accuracy of the sup-
shown, we swap the conditioning vector with 100% proba- plied camera poses, we compare a baseline without discriminator
bility at the start of training; the swapping probability is pose conditioning against discriminator-pose-conditioned models
linearly decayed to 50% over the first 1M images. For the where camera extrinsics are corrupted by noise. Without discrimi-
remainder of training, we maintain 50% swapping probabil- nator pose conditioning, the model learns a degenerate solution in
which heads are drawn as a texture flattened onto a plane. Even
ity.
highly imprecise extrinsics (e.g. camera poses corrupted by three
1.4. Robustness to imprecise camera poses standard deviations of noise) are capable of resolving this degen-
erate solution and allow recovery of accurate 3D shapes.
Our method expects a dataset in which each image is
labeled with an approximate camera pose, in order to en- random noise. We calculate the 4×4 standard deviation ma-
able sampling camera poses from the dataset distribution trix, σ, by taking the standard deviation across the dataset
and discriminator pose conditioning. While such labelling of ground-truth 4 × 4 camera pose matrices. We train four
can be easily performed with pre-trained pose extractors on models with “imprecise” camera poses: (1 σ, 2 σ, 3 σ, 4 σ)
humans [9] and cats [20], extracting accurate poses may be where the input camera poses matrices are corrupted with
difficult for some datasets. This section evaluates reliance 1, 2, 3, and 4 standard deviations of Gaussian noise, respec-
on discriminator pose conditioning and on accurate cam- tively. We train these five ablations on FFHQ 2562 with
era poses. We train five additional models on FFHQ 2562 : a shortened training curriculum of 4M images, in order to
a “baseline” configuration without discriminator pose con- save computational resources.
ditioning, and four discriminator-pose-conditioned models Fig. 4 shows the results of this experiment. Without dis-
where camera poses are corrupted with increasing levels of criminator pose conditioning, the model falls into a degen-

2
FFHQ Cats Cars
FID↓ KID↓ ID↑ Depth↓ Pose↓ FID↓ KID↓ FID↓ KID↓
GIRAFFE 1282 — — — — — — — 27.3 1.703
GIRAFFE 2562 31.5 1.992 0.64 0.94 .089 16.1 2.723 — —
π-GAN 1282 29.9 3.573 0.67 0.44 .021 16.0 1.492 17.3 0.932
Lift. SG 2562 29.8 — 0.58 0.40 .023 — — — —
Ours 1282 — — — — — — — 2.75 0.097
Ours 2562 4.8 0.149 0.76 0.31 .005 3.88 0.091 — —
Ours 5122 4.7 0.132 0.77 0.39 .005 2.77† 0.041† — —
Table 1. Quantitative evaluation using FID, KID×100, identity consistency (ID), depth accuracy, and
pose accuracy for FFHQ [17] and FID, KID×100 for AFHQv2 Cats [7,16] and ShapeNet Cars [6,35].
Labeled is the image resolution of training and evaluation. † Trained with adaptive discriminator
augmentation [15].

erate solution in which it renders textures on a flat plane, pitch (Fig. 5a) because the dataset’s pitch range is even nar-
without properly capturing the 3D shape of scenes. Provid- rower.
ing even very imprecise camera poses is enough to break Our method, despite also using 2D convolutions, is less
this tendency; conditioning the discriminator on camera reliant on view-inconsistent convolutions for considering
poses distorted by three standard deviations of Gaussian the placement of features in the final image. By utilizing
noise still produces accurate 3D shapes. With extreme an expressive 3D representation as a “scaffold”, our method
noise (e.g. four standard deviations), some scenes main- provides more reasonable extrapolation to rare views in
tain the correct 3D structure while others are flattened onto both pitch and yaw than methods that more strongly depend
the plane. Our results indicate that while our method re- on image-space convolutions for image synthesis, such as
quires additional information to prevent collapse, only very GIRAFFE [29].
weak supervision is necessary. Future work may examine
this tendency further and discover ways to prevent this un- 1.6. Additional quantitative results
desirable behavior without requiring images to be labelled Table 1 is an expanded version of Table 2 of the main
with poses. manuscript that provides additional quantitative metrics, in-
cluding Kernel Inception Distance [2] for all datasets and
1.5. Extrapolation to steep camera angles
image quality evaluations for ShapeNet Cars. Strong rela-
Fig. 5 provides a visual comparison of our method tive performance on Cars, a dataset in which camera poses
against baselines for generating views from steep camera are distributed uniformly about the sphere, is evidence that
poses. We note that the FFHQ [17] dataset is primarily our method is not restricted to face-forward datasets like
composed of front-facing images—few images depict faces FFHQ [17] and AFHQv2 [7, 16].
from extreme yaw angles, and even fewer images depict
faces from extreme pitch angles. Nevertheless, reasonable 2. Additional visual results
extrapolation to the edges of the pose distribution is a desir-
able quality and indicates reliance on a robust 3D represen- Style mixing, in shapes. Fig. 6 shows the underlying
tation. shapes of the style mixing [17] examples in Fig. 8 of
the main manuscript. While mixed examples inherit most
Lifting StyleGAN [34], which represents scenes as a
of their shape structure from the modulations of the back-
textured mesh, demonstrates consistent rendering quality.
bone’s low-resolution layers, the modulations of the high-
However the steep camera angles reveal inaccurate 3D ge-
resolution layers can influence fine details in the shape, such
ometry (e.g. foreshortened faces) learned by the method.
as eye regions and hair patterns. The results were obtained
π-GAN [5], reasonably extrapolates to steep angles but ex-
from a model trained without style-mixing regularization.
hibits visible quality degradation at the edges of the pose
distribution. GIRAFFE [29], being highly reliant on view-
inconsistent convolutions, has difficulty reproducing angles Additional single image 3D reconstructions. Fig. 7 pro-
that are rarely seen in the dataset. If we force GIRAFFE vides additional 3D reconstructions of single test images
to extrapolate beyond the camera poses sampled at train- through Pivotal Tuning Inversion (PTI) [31] of a model
ing (e.g. the leftmost and rightmost images of Fig. 5b), we trained on FFHQ 5122 . A pipeline for high-fidelity, single-
receive degraded, view-inconsistent images rather than ren- image reconstruction of faces that does not require explicit
derings from steeper angles. The problem is amplified for 3D ground-truth training data opens the door for many

3
(a) Extrapolation to steep pitch angles.

(b) Extrapolation to steep yaw angles.

Figure 5. We compare methods in their extrapolation to steep camera viewing angles. La-
belled is the percentile for camera pitch or yaw. A yaw angle in the 96th percentile means
96% of training poses are less steep, i.e. 4% of training poses are beyond the given pose.

4
Additional selected examples synthesized with AFHQv2
Cats. Fig. 9 shows renderings and shapes for selected ex-
amples, synthesized by our method trained on AFHQv2
Cats [7, 16] 5122 .

Uncurated examples synthesized with AFHQv2 Cats.

Fig. 10 provides uncurated examples of cats produced by
GIRAFFE [29], π-GAN [5], and our method, trained at im-
age rsolutions of 2562 , 1282 , and 5122 , respectively.

Uncurated examples synthesized with FFHQ. Fig. 11

provides uncurated examples of faces produced by our
method, trained with FFHQ [17] 5122 . We apply trunca-
tion [4, 17, 25], with ψ = 0.5.
Figure 6. Style-mixing [16–18] shapes from a model trained on
FFHQ 5122 , without truncation. Aligns with Fig. 8 of the main Latent code interpolation. Fig. 12 provides linear inter-
manuscript, which shows color renderings of the same seeds. The polations between latent codes for selected examples pro-
result illustrates that while a mixed example inherits the majority
duced by our method trained on FFHQ 5122 . Our result
of its structure from its “coarse” input (i.e. modulations of layers
0-6), the “fine” input (i.e. modulations of layers 7-13) can influ-
illustrates that our 3D GAN inherits the well-behaved la-
ence the more delicate details of the shape (e.g. eye regions, hair tent space of the StyleGAN2 [18] backbone, which enables
patterns), in addition to having much control over the overall col- smooth interpolations in both color renderings and underly-
ors in rendered images. ing shapes.

Additional selected examples synthesized with FFHQ

Fig. 13 depicts renderings and shapes for selected exam-
ples, synthesized by our method trained on FFHQ 5122 .

3. Implementation details
We implemented our 3D GAN framework on top of the
official PyTorch implementation of StyleGAN2, an updated
version of which is available at https://github.com/
NVlabs/stylegan3. Most of our training parameters
are identical to those of StyleGAN2 [18], including the use
of equalized learning rates for the trainable parameters [14],
a minibatch standard deviation layer at the end of the dis-
criminator [14], exponential moving average of the genera-
Figure 7. Additional single-view 3D reconstructions of test images tor weights, and a non-saturating logistic loss [12] with R1
demonstrate a use for our generator’s learned prior over facial fea- regularization [26].
tures.

promising applications, such as photo-to-avatar creation. Two-stage training. In order to save computational re-
sources, we perform the majority of the training at a neural
rendering resolution of 642 , before gradually stepping the
resolution up to 1282 . Note that the final image resolution
Shapenet Cars. Fig. 8 contains uncurated renderings remains fixed throughout training (e.g. 2562 or 5122 ). We
from random camera poses for models trained with implement this simply by bilinearly resizing the raw neu-
ShapeNet Cars [6, 35]. This experiment serves as a demon- ral rendering IRGB to 1282 before it is operated on by the
stration that our method is capable of operating successfully super-resolution module. Thus, the super-resolution mod-
on datasets that include camera poses that span the entire ule always receives a 1282 -sized feature map as an input,
360◦ camera azimuth and 180◦ camera elevation distribu- regardless of the actual neural rendering resolution. In con-
tions, unlike 2.5D GANs [34], which are intended for face- trast to previous progressive growing strategies [5, 14] that
forward datasets. double the resolution in a single step, we gradually increase

5
Figure 8. Qualitative comparison of uncurated examples of cars. All methods are sampled with truncation [4, 17, 25], using ψ = 0.7.

the neural rendering resolution, pixel-by-pixel, over 1 mil- Backbone. Our backbone (i.e., StyleGAN2 generator)
lion images, i.e., (642 , 652 , 662 , ..., 1262 , 1272 , 1282 ). We follows the implementation of [18], with a mapping net-
continue training with the resolution fixed at 1282 for an work of 8 hidden layers. For all of our experiments (regard-
additional 1.5 million images, for a total of 2.5M iterations less of final image resolution), the backbone operates at a
of fine-tuning. This two-stage training procedure provides resolution of 2562 . We modify the output convolutions such
a roughly 2× speed-up versus training from scratch at full that they produce a 96-channel output feature image, which
resolution and produces similar results to training at full we reshape into three planes, each of shape 256 × 256 × 32.
neural rendering resolution from scratch. Unlike approaches that require pre-trained 2D image GANs
[34], we do not utilize pre-trained StyleGAN2 checkpoints
for the backbone; the entire pipeline is trained end-to-end.

6
Figure 9. Curated examples from a model trained on AFHQv2 [7, 16] 5122 .

For large datasets, such as FFHQ [17] and ShapeNet Cars sampling. For FFHQ [17] and AFHQv2 [7, 16], we use 48
[6, 35], we train from scratch with random initialization; uniformly-spaced and 48 importance samples per ray; for
for small datasets, such as AFHQv2 [7, 16], we follow pre- ShapeNet Cars, we use 64 uniformly-spaced and 64 impor-
vailing methodology [15] by fine-tuning from a checkpoint tance samples per ray. When rendering videos that feature
trained on a larger dataset. thin surfaces, we found it beneficial to increase the samples
per ray during inference to reduce flicker.

Decoder and volume rendering. Our decoder is imple-

mented as an MLP with a single hidden layer of 64 hidden
units and uses the softplus activation function. The decoder
takes as input a 32-channel aggregated feature vector; it pro- Super-resolution. We implement our super-resolution
duces a 33-channel vector that we split into a scalar density model with two ‘blocks’ of StyleGAN2’s modulated convo-
prediction and a 32-channel feature. We use neural volume lutions [18], with noise inputs disabled. The blocks contain
rendering [27] of features [29], with two-pass importance convolutions of channel-depth 128 and 64, respectively.

7
Figure 10. Uncurated examples of cats, for GIRAFFE [29] 2562 , π-GAN 1282 , and our method 5122 . All methods are sampled with
truncation [4, 17, 25], using ψ = 0.7.

Discriminator. Our discriminator is a StyleGAN2 [18] R1 Regularization. We use R1 regularization [26] with
with two modifications. First, to enable dual discrimination, γ = 1 for all datasets and resolutions, except for ShapeNet
we adjust the input layer to accept six-channel input images, Cars, where we use γ = 0.1. Regularization strengths were
rather than 3-channel input images. Fig. 14 provides a dia- informally chosen based on values that have shown success
gram that illustrates the creation of these six-channel inputs, with previous methods [15, 18].
for both real and generated images. Second, we condition
the discriminator on the camera parameters of the incoming Density Regularization. Further experiments, conducted
image to help prevent degenerate shape solutions; we follow after our initial submission, suggested that additional regu-
the class-conditional discriminator modifications of [15] to larization over the estimated density field reduced the preva-
inject this information. lence of undesirable seams and other shape artifacts. Sim-
ilar to the total variation regularization used in previous
work [23], our density regularization encourages smooth-
ness of the density field. For each generated scene in the
Mixed Precision. To speed up training, we use a similar batch, we randomly sample points x in the volume and also
mixed-precision methodology as [15]. We use FP16 in the sample additional ‘perturbed’ points that are offset with a
four highest resolution blocks of the discriminator and in small amount Gaussian noise δx. Our density regulariza-
both blocks of our super-resolution module. We do not use tion loss is an L1 loss that minimizes the difference between
FP16 in our generator backbone. the estimated densities σ(x) and σ(x + δx). We apply our

8
Figure 11. Images and geometry for seeds 0-31, synthesized using a model trained on FFHQ [17] 5122 . Sampled with truncation [17],
using ψ = 0.5.

density regularization over 1000 pairs of randomly sampled that increasing the number of samples per ray at inference
points every four training iterations. time can reduce unwanted flickering when rendering videos
that feature thin objects such as eye glasses. For clips shown
in the supplemental video, we double both the number of
Training. We train all models with a batch size of 32. We
coarse samples (from 48 to 96) and the number of fine sam-
use a discriminator learning rate of 0.002 and a generator
ples (from 48 to 96), bringing the total number of depth
learning rate of 0.0025. Following [16], we blur images
samples per ray to 192. Increasing the number of samples
as they enter the discriminator, gradually reducing the blur
per ray incurs a penalty to the rendering speed. When using
amount over the first 200K images. Unlike [18], we train
96 total depth samples per ray, frame rates are reduced to
without style-mixing regularization.
approximately 24 frames per second with tri-plane caching
Using the two-stage training discussed previously, we
– down from 36 frames per second when using the default
train at a resolution of 642 for 25M images and at 1282 for
48 samples. Images shown in the main manuscript were
an additional 2.5M images. Using a neural rendering reso-
synthesized without increasing the number of depth sam-
lution of 642 , our 3D GAN framework takes ∼24 seconds
ples along each ray.
to train on 1000 images (24 s/kimg) on 8 Tesla V100 GPUs;
this increases to 46 s/kimg at a neural rendering resolution
of 1282 . For reference, StyleGAN3-R [16] achieves train-
AFHQv2. Following [15], we fine-tune from FFHQ-
ing rates of 20 s/kimg on similar hardware.
trained models to achieve optimum performance on Cats.
Our total training time on 8 Tesla V100 GPUs is on the
Beginning from a checkpoint trained on FFHQ, we train for
order of 8.5 days (7 days of 642 training, plus 1.5 days of
6.2M images at a neural rendering resolution of 642 ; and for
1282 fine-tuning), compared to 6 days on similar hardware
an additional 2.6M images, while fine-tuning the neural ren-
for StyleGAN3-R.
dering resolution to 1282 . Because π-GAN and GIRAFFE
were not designed with the benefits of adaptive discrimina-
Inference-time depth samples. We use neural volume tor augmentation (ADA) [15], we also do not use ADA for
rendering [27] with two-pass importance sampling to render our method at 2562 , in an effort to keep comparisons across
feature images from our tri-plane representation. We found methods fair. We use adaptive discriminator augmentation

9
Figure 12. Linear interpolations between latent codes, showing renderings and shapes.

with its default settings, for our method only at 5122 . and trained until convergence with the parameters recom-
mended for analogous datasets.
4. Experiment details
Lifting StyleGAN [34] is a method for disentangling
4.1. Baselines and lifting a pre-trained StyleGAN2 image generator to
3D-aware face generation. The original Lifting StyleGAN
π-GAN [5] is a 3D-aware GAN that relies upon a FiLM-
manuscript reports results on a slightly tighter crop of
conditioned MLP with periodic activation functions for
FFHQ than we used. Because we had difficulty match-
camera-controllable synthesis. We utilized the official code
ing the quality of Lifting StyleGAN’s pre-trained model
(https : / / github . com / marcoamonteiro / pi -
when we trained it from scratch on our less-cropped dataset,
GAN) and trained until convergence with the parameters rec-
we instead used their official pre-trained model for their
ommended for analogous datasets.
tighter crops and the FID score reported in their manuscript.
GIRAFFE [29] is a 3D-aware GAN that incorporates a
We utilized the offical code, found here: (https://
compositional 3D scene representation to enable control-
github.com/seasonSH/LiftedGAN).
lable synthesis. We utilized the official code (https:
/ / github . com / autonomousvision / giraffe) StyleGAN2 is a style-based GAN that achieves state-

10
Figure 13. Additional selected examples, from a model trained on FFHQ [17] 5122 .

of-the-art image quality for 2D image synthesis and fea- 4.2. Dataset Details
tures a well-behaved latent space that enables image ma-
nipulation. We obtained a pre-trained checkpoint for Style- FFHQ We prepare our dataset by starting with the ”in-
GAN2 on FFHQ 5122 from the collection of official mod- the-wild” version of the FFHQ dataset [17], which is
els (https://catalog.ngc.nvidia.com/orgs/ composed of uncropped, original PNG images of people
nvidia/teams/research/models/stylegan2). sourced from Flickr. We use an off-the-shelf face detection
Following the recommended tuning of [15], we trained and pose-extraction pipeline [9] to both identify the face re-
both StyleGAN2 config F and the 512 × 512 config gion and label the image with a pose. We crop the images
from [15], sweeping R1 [26] regularization strength, γ = to roughly the same size as the original FFHQ dataset.
{0.2, 0.5, 1, 2, 5, 10, 20}. The best result for AFHQv2 was We assume fixed camera intrinsics across the entire
obtained with StyleGAN2 config F, after training for 10M dataset, with a focal length of 4.26 × image width, equiv-
images at γ = 1. alent to a standard portrait lens. We prune a small number
of images that resisted face detection; our final dataset con-
tains 69957 images. We augment the dataset with horizontal

11
Dataset Generator

Discriminator Synthesized
Downsample

Raw Rendering IRGB

( 128 x 128 x 3 )
Real Image Synthesized Final Image I+RGB
( 512 x 512 x 3 ) ( 512 x 512 x 3 )

Upsample
( 128 x 128 x 3 )
Upsample

Upsampled Raw Rendering

( 512 x 512 x 3 )

Real or Fake

Blurred Real Image

( 512 x 512 x 3 )

Figure 14. In dual-discrimination, we discriminate on a six-channel concatenation of the final image and the raw neural rendering, in order
to maintain consistency between high-resolution final images and view-consistent (but low resolution) neural renderings. This diagram
illustrates how we obtain a six-channel discriminator input tensor for both real and fake images. Our generator produces both a 5122 final
+
rendering (IRGB ) as well as the (1282 ) raw neural rendering (IRGB ). The raw rendering, IRGB is the first three channels of the 32-channel
rendered features, IF . We create a six-channel discriminator input by upsampling the raw image to 5122 and concatenating it with the
final image to form a (512 × 512 × 6) discriminator input tensor. For real images, we extract a 5122 real image from the dataset and
downsample it to the same size as IRGB to obtain an analogue for IRGB . We then upsample this image back to 5122 and concatenate it
with the original image to form a (512 × 512 × 6) discriminator input tensor. The downsample-then-upsample operation has the effect of
blurring the original image.

flips. ShapeNet Cars For additional validation, we compare

methods on ShapeNet Cars [6, 35] to evaluate performance
on a dataset that contains views from all angles. We adopted
the dataset and setup from [35], which is composed of 1282
AFHQv2 We used the AFHQv2 dataset [16], which is
resolution renderings of synthetic cars, each labelled with
a higher-quality version of the original AFHQ dataset [7].
camera parameters. The dataset contains 2457 unique cars;
AFHQv2 provides closeups for animal faces including cats,
each car is rendered from 50 views randomly sampled from
dogs, and wildlife. We use the ‘cats’ split, which con-
the entire sphere. We use the known camera parameters
tains approximately 5000 images, for our experiments. As
for each image and do not augment the dataset with image
with FFHQ, we assume fixed camera intrinsics across the
space augmentations.
dataset; for simplicity, we use identical intrinsics to FFHQ.
Camera poses were extracted via landmark detection [20]
and an open-source Perspective-n-Point algorithm [3]. We
augment the dataset with horizontal flips.

12
4.3. Single scene overfitting. zero mean, unit variance and calculate the L2 distance be-
tween them.
To illustrate the effectiveness of our architecture, we
evaluate the relative performance of the tri-plane 3D rep-
resentation against a comparable voxel-based hybrid repre- Multi-view consistency. We evaluate multi-view consis-
sentation and Mip-NeRF [1] on the Family scene of Tanks tency and face identity preservation for models trained on
& Temples [19] dataset as desribed in Section 3 of the main FFHQ [17] by measuring ArcFace [8] cosine similarity. For
manuscript. We use the pre-processed images, as well as the each method, we generate 1024 random faces and render
training/test split, of [22]. We use 512 uniformly-spaced two views of each face from poses randomly selected from
depth samples and 256 importance samples per ray and a the training dataset pose distribution. For each image pair,
ray batch size of 6400. The tri-planes are treated as learn- we measure facial identity similarity [8] and compute the
able parameters of shape 3×48×512×512. The dense voxel mean score.
parameters were chosen to optimize quality for comparable
parameter count as the tri-planes; the voxel features are of Pose accuracy. We evaluate pose accuracy with the help
shape 18×128×128×128. Both voxel and tri-plane hybrid of a pre-trained face reconstruction model [9]. With [9], we
representations are coupled with two-layer, 128 hidden unit detect pitch, yaw, and roll from 1024 generated images then
decoders with Fourier feature embeddings [36]. We train compute L2 loss against the ground truth poses to determine
voxel and cube representations for 200K iterations; we train each model’s pose drift.
Mip-NeRF for the recommended 1M iterations.
Runtime. We evaluate runtime for each model by calcu-
4.4. Pivotal tuning inversion. lating the average framerate over a 400 frame sequence. We
We use off-the-shelf face detection [9] to extract process frames consecutively, i.e., with batch size 1. In or-
appropriately-sized crops and camera extrinsics from test der to give each method a best-case-scenario, we ignore op-
images and we resize each cropped image to 5122 . We erations such as copying rendered frames from GPU to CPU
follow Pivotal Tuning Inversion (PTI) [31], optimizing the and saving files to disk.
latent code for 500 iterations, followed by fine-tuning the
generator weights for an additional 500 iterations. FACS estimation In Section 5.2 of the main paper, we
For inversion of grayscale images, we convert the gener- quantitatively measure the effect of dual discrimination and
ator’s 3-channel, RGB renderings to perceived luminance, generator pose conditioning at preserving facial expressions
Y , before computing image distance loss during optimiza- across multi-view face videos. To evaluate facial expres-
tion. This allows the generator’s prior to colorize the render- sions, we employ a proprietary facial tracker that mea-
ings. To compute single-channel luminance from 3-channel sures detailed movement of sub-regions of the face in terms
RGB images, we use Y = 0.299R + 0.587G + 0.114B. of Facial Action Coding System (FACS) [10] coefficients.
For grayscale optimization, we use 400 latent code inver- Specifically, our facial tracker measures all 53 FACS blend-
sion steps and 250 generator fine-tuning steps. shape coefficients defined in Li et al. [21] and we compared
the variability in the ‘mouthSmile L’ and ‘mouthSmile R’
4.5. Evaluation Metrics blendshape coefficients across the different videos.
FID and KID. We compute Fréchet Inception Distance 4.6. Visualization of Geometry
(FID) [13] and Kernel Inception Distance (KID) [2] image
quality metrics between 50k generated images and all train- To visualize shapes, we sample the volume to obtain a
ing images using the implementation provided in the Style- 5123 cube of density values and extract the surface of the
GAN3 [16] codebase. scene as a mesh using Marching Cubes [24]. We found
that a levelset between 0 and 10 generally yielded visu-
ally appealing results. Renderings of shapes shown in this
Geometry. We follow a similar procedure to [34] in the manuscript were generated using ChimeraX [11].
evaluation of geometry. We generate 1024 images and
depth maps from random poses that match the dataset pose 5. Discussion
distribution. With the application of a pre-trained 3D face
5.1. Shape artifacts
reconstruction model [9], we generate a “pseudo” ground-
truth depth map for each generated image. Next we limit Despite significant improvements in the quality of the 3D
both the generated depth maps and “pseudo” ground-truth geometry compared to previous methods, our synthesized
depth maps to the facial regions as defined by the recon- shapes are not free from artifacts, which are visible in geom-
struction model. Finally, we normalize all depth maps to etry renderings throughout the main paper and supplement

13
(e.g. Fig. 11, Fig. 13). Sunken eye sockets allow the illu-
sion of eyes that follow the viewing camera, even when the
geometry and neural renderings are view-consistent; such
“hollow face illusions” have demonstrated similar effects in
the physical world. Similarly, deep creases near the cor-
ners of mouths enable the creation of “view-inconsistent”
effects that in fact are faithful to the underlying shapes. Fu-
ture work that incorporates stronger dataset priors, e.g. that
eyeballs are convex, may help resolve these artifacts.
While our method produces more-detailed eyeglasses
than previous methods, it tends to produce “goggles”—the
sides of the eyeglasses are opaque where there should be
empty space. Future neural rendering methods that can
accurately model lens refraction may enable more faithful
reconstruction of eyeglasses and other objects that contain
transparent elements.
In some shapes and renderings generated by our method,
a seam is visible between the face and the rest of the head.
While we find the optional density regularization in Sec.
3 helps reduce such artifacts, we hypothesize that recent
hybrid-SDF rendering solutions [30, 37, 39], which have
shown promising results in robust geometry recovery from
images, may yield improved shapes with fewer artifacts.
In the interests of simplicity, we model the scene with a
single 3D representation, without any explicit background
handling. Consequently, the generator learns to represent
backgrounds of images with textured surfaces fused to fore-
ground objects. Future work that models backgrounds with
a separate 3D representation [28, 29, 40] may enable isola-
tion of foreground objects.

14
References ity, and variation. In International Conference on Learning
Representations (ICLR), 2018. 5
[1] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter
[15] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan.
Jaakko Lehtinen, and Timo Aila. Training generative adver-
Mip-nerf: A multiscale representation for anti-aliasing neu-
sarial networks with limited data. In Advances in Neural
ral radiance fields. In ICCV, 2021. 13
Information Processing Systems (NeurIPS), 2020. 3, 7, 8, 9,
[2] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Ar- 11
bel, and Arthur Gretton. Demystifying MMD GANs. [16] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
In International Conference on Learning Representations Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-
(ICLR), 2018. 3, 13 free generative adversarial networks. In Advances in Neural
[3] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Information Processing Systems (NeurIPS), 2021. 3, 5, 7, 9,
Software Tools, 2000. 12 12, 13
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large [17] Tero Karras, Samuli Laine, and Timo Aila. A style-
scale GAN training for high fidelity natural image synthe- based generator architecture for generative adversarial net-
sis. In International Conference on Learning Representations works. In IEEE Conference on Computer Vision and Pattern
(ICLR), 2019. 5, 6, 8 Recognition (CVPR), 2019. 1, 3, 5, 6, 7, 8, 9, 11, 13
[5] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, [18] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
and Gordon Wetzstein. pi-GAN: Periodic implicit generative Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
adversarial networks for 3D-aware image synthesis. In IEEE ing the image quality of StyleGAN. In IEEE Conference on
Conference on Computer Vision and Pattern Recognition Computer Vision and Pattern Recognition (CVPR), 2020. 5,
(CVPR), 2021. 3, 5, 10 6, 7, 8, 9
[6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, [19] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Koltun. Tanks and temples: Benchmarking large-scale scene
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: reconstruction. ACM Transactions on Graphics, 36(4), 2017.
An information-rich 3d model repository. arXiv preprint 13
arXiv:1512.03012, 2015. 3, 5, 7, 12 [20] Taehee Brad Lee. Cat hipsterizer, 2018. https://
[7] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. github.com/kairess/cat_hipsterizer. 2, 12
Stargan v2: Diverse image synthesis for multiple domains. [21] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara,
In Proceedings of the IEEE Conference on Computer Vision Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha
and Pattern Recognition, 2020. 3, 5, 7, 12 Prasad, Bipin Kishore, Jun Xing, et al. Learning formation
[8] Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos of physically-based face attributes. In IEEE Conference on
Zafeiriou. Arcface: Additive angular margin loss for deep Computer Vision and Pattern Recognition (CVPR), 2020. 13
face recognition. In CVPR, 2019. 13 [22] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and
[9] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Christian Theobalt. Neural sparse voxel fields. In Advances
Jia, and Xin Tong. Accurate 3d face reconstruction with in Neural Information Processing Systems (NeurIPS), 2020.
weakly-supervised learning: From single image to image 13
set. In IEEE Computer Vision and Pattern Recognition [23] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel
Workshops, 2019. 1, 2, 11, 13 Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol-
[10] Paul Ekman and Wallace V. Friesen. Facial Action Coding umes: Learning dynamic renderable volumes from images.
System: A Technique for the Measurement of Facial ACM Transactions on Graphics (SIGGRAPH), 2019. 8
Movement. Consulting Psychologists Press, 1978. 13 [24] William E Lorensen and Harvey E Cline. Marching cubes:
[11] Thomas D Goddard, Conrad C Huang, Elaine C Meng, A high resolution 3D surface construction algorithm. ACM
Eric F Pettersen, Gregory S Couch, John H Morris, and Transactions on Graphics (ToG), 1987. 13
Thomas E Ferrin. Ucsf chimerax: Meeting modern chal- [25] Marco Marchesi. Megapixel size image creation using gen-
lenges in visualization and analysis. Protein Science, erative adversarial networks, 2017. 5, 6, 8
27(1):14–25, 2018. 13 [26] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Which training methods for GANs do actually converge?
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and In International Conference on Machine Learning (ICML),
Yoshua Bengio. Generative adversarial nets. In Advances in 2018. 5, 8, 11
Neural Information Processing Systems (NeurIPS), 2014. 5 [27] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF:
Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Representing scenes as neural radiance fields for view
GANs trained by a two time-scale update rule converge to synthesis. In European Conference on Computer Vision
a nash equilibrium. In Advances in Neural Information (ECCV), 2020. 7, 9
Processing Systems (NeurIPS), 2017. 13 [28] Michael Niemeyer and Andreas Geiger. CAMPARI:
[14] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Camera-aware decomposed generative neural radiance
Progressive growing of GANs for improved quality, stabil- fields. arXiv preprint arXiv:2103.17269, 2021. 14

15
[29] Michael Niemeyer and Andreas Geiger. GIRAFFE: Rep-
resenting scenes as compositional generative neural feature
fields. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2021. 3, 5, 7, 8, 10, 14
[30] Michael Oechsle, Songyou Peng, and Andreas Geiger.
UNISURF: Unifying neural implicit surfaces and radiance
fields for multi-view reconstruction. In IEEE International
Conference on Computer Vision (ICCV), 2021. 14
[31] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
Cohen-Or. Pivotal tuning for latent-based editing of real im-
ages. arXiv preprint arXiv:2106.05744, 2021. 3, 13
[32] Johannes Lutz Schönberger and Jan-Michael Frahm.
Structure-from-motion revisited. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2016. 1
[33] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys,
and Jan-Michael Frahm. Pixelwise view selection for un-
structured multi-view stereo. In European Conference on
Computer Vision (ECCV), 2016. 1
[34] Yichun Shi, Divyansh Aggarwal, and Anil K Jain. Lifting 2D
stylegan for 3D-aware face generation. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2021.
3, 5, 6, 10, 13
[35] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet-
zstein. Scene representation networks: Continuous 3D-
structure-aware neural scene representations. In Advances
in Neural Information Processing Systems (NeurIPS), 2019.
3, 5, 7, 12
[36] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier fea-
tures let networks learn high frequency functions in low di-
mensional domains. In Advances in Neural Information
Processing Systems (NeurIPS), 2020. 13
[37] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt,
Taku Komura, and Wenping Wang. NeuS: Learning neu-
ral implicit surfaces by volume rendering for multi-view re-
construction. Advances in Neural Information Processing
Systems (NeurIPS), 2021. 14
[38] Jie Wu. Facial expression recognition pytorch, 2018.
https : / / github . com / WuJie1010 / Facial -
Expression-Recognition.Pytorch. 1
[39] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces. arXiv preprint
arXiv:2106.12052, 2021. 14
[40] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arXiv:2010.07492, 2020. 14

Accenture Complete Preparation Sheet
100% (1)
Accenture Complete Preparation Sheet
11 pages
VolumeGAN: High-Fidelity 3D Image Synthesis
No ratings yet
VolumeGAN: High-Fidelity 3D Image Synthesis
12 pages
Learning Efficient Point Cloud Generation For Dense 3D Object Reconstruction
No ratings yet
Learning Efficient Point Cloud Generation For Dense 3D Object Reconstruction
8 pages
3D Aware Synthesis Via Learning Textural and Structural Representations
No ratings yet
3D Aware Synthesis Via Learning Textural and Structural Representations
13 pages
3D-Adapter: Geometry-Consistent Multi-View Diffusion
No ratings yet
3D-Adapter: Geometry-Consistent Multi-View Diffusion
22 pages
3D-Consistent Image Generation
No ratings yet
3D-Consistent Image Generation
17 pages
What You See Is What You GAN: Rendering Every Pixel For High-Fidelity Geometry in 3D GANs
No ratings yet
What You See Is What You GAN: Rendering Every Pixel For High-Fidelity Geometry in 3D GANs
27 pages
3D Scene Rendering for Researchers
No ratings yet
3D Scene Rendering for Researchers
28 pages
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
No ratings yet
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
12 pages
Sitzmann 2020 CVPR
No ratings yet
Sitzmann 2020 CVPR
23 pages
ICCV2021 - In-Place Scene Labelling and Understanding With Implicit Scene Representation
No ratings yet
ICCV2021 - In-Place Scene Labelling and Understanding With Implicit Scene Representation
10 pages
3D Model Generation and Reconstruction Using Conditional Generative Adversarial Network
No ratings yet
3D Model Generation and Reconstruction Using Conditional Generative Adversarial Network
9 pages
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
No ratings yet
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
12 pages
Lift 3 D
No ratings yet
Lift 3 D
10 pages
3D-Aware Generation with F3D-Gaus
No ratings yet
3D-Aware Generation with F3D-Gaus
15 pages
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
No ratings yet
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
19 pages
3D Generative Models A Survey
No ratings yet
3D Generative Models A Survey
21 pages
Mescheder Occupancy Networks Learning 3D Reconstruction in Function Space CVPR 2019 Paper
No ratings yet
Mescheder Occupancy Networks Learning 3D Reconstruction in Function Space CVPR 2019 Paper
11 pages
3D Gaussians for Adaptive Rendering
No ratings yet
3D Gaussians for Adaptive Rendering
11 pages
Learning Adversarial 3D Model Generation With 2D Image Enhancer
No ratings yet
Learning Adversarial 3D Model Generation With 2D Image Enhancer
8 pages
Renderdiffusion: Image Diffusion For 3D Reconstruction, Inpainting and Generation
No ratings yet
Renderdiffusion: Image Diffusion For 3D Reconstruction, Inpainting and Generation
15 pages
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
No ratings yet
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
15 pages
Research Paper
No ratings yet
Research Paper
15 pages
LN3Diff: Scalable Latent Neural Fields Diffusion For Speedy 3D Generation
No ratings yet
LN3Diff: Scalable Latent Neural Fields Diffusion For Speedy 3D Generation
29 pages
Shap E: Generating Conditional 3D Implicit Functions: Heewoo Jun Alex Nichol
No ratings yet
Shap E: Generating Conditional 3D Implicit Functions: Heewoo Jun Alex Nichol
23 pages
Yariv Mosaic-SDF For 3D Generative Models CVPR 2024 Paper
No ratings yet
Yariv Mosaic-SDF For 3D Generative Models CVPR 2024 Paper
10 pages
Progress and Prospects in 3D Generative Ai
No ratings yet
Progress and Prospects in 3D Generative Ai
13 pages
Pixel2Mesh++: Multi-View 3D Mesh Generation Via Deformation
No ratings yet
Pixel2Mesh++: Multi-View 3D Mesh Generation Via Deformation
17 pages
Neus 2
No ratings yet
Neus 2
15 pages
Nerf Paper IA 3D
No ratings yet
Nerf Paper IA 3D
8 pages
High-Res 3D Creation via Gaussian Model
No ratings yet
High-Res 3D Creation via Gaussian Model
20 pages
Volumetric and Multi-View CNNs For Object Classification On 3D Data
No ratings yet
Volumetric and Multi-View CNNs For Object Classification On 3D Data
14 pages
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
No ratings yet
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
11 pages
Oh From 2D Portraits To 3D Realities Advancing GAN Inversion For CVPRW 2024 Paper
No ratings yet
Oh From 2D Portraits To 3D Realities Advancing GAN Inversion For CVPRW 2024 Paper
10 pages
Research Paper
No ratings yet
Research Paper
18 pages
MVD: M - D 3DG: Ream Ulti View Iffusion For Eneration
No ratings yet
MVD: M - D 3DG: Ream Ulti View Iffusion For Eneration
21 pages
DeepLearning L9
No ratings yet
DeepLearning L9
94 pages
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
No ratings yet
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
11 pages
Generation of 3D Textured Models From 2D Images
No ratings yet
Generation of 3D Textured Models From 2D Images
24 pages
Neural Watertight Manifold Meshes
No ratings yet
Neural Watertight Manifold Meshes
16 pages
GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
No ratings yet
GRM: Large Gaussian Reconstruction Model For Efficient 3D Reconstruction and Generation
29 pages
Ren XCube Large-Scale 3D Generative Modeling Using Sparse Voxel Hierarchies CVPR 2024 Paper
No ratings yet
Ren XCube Large-Scale 3D Generative Modeling Using Sparse Voxel Hierarchies CVPR 2024 Paper
11 pages
3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University
No ratings yet
3d-Aware Conditional Image Synthesis: Kangle Deng Gengshan Yang Deva Ramanan Jun-Yan Zhu Carnegie Mellon University
15 pages
Bridging Geometry-Coherent Text-to-3D Generation With Multi-View Diffusion Priors and Gaussian Splatting
No ratings yet
Bridging Geometry-Coherent Text-to-3D Generation With Multi-View Diffusion Priors and Gaussian Splatting
27 pages
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
No ratings yet
MVD-Fusion: Single-View 3D Via Depth-Consistent Multi-View Generation
11 pages
Sg-Nerf: Sparse-Input Generalized Neural Radiance Fields For Novel View Synthesis
No ratings yet
Sg-Nerf: Sparse-Input Generalized Neural Radiance Fields For Novel View Synthesis
13 pages
Hi3dgen Paper
No ratings yet
Hi3dgen Paper
17 pages
DeepSDF Learning Continuous Signed Distance Funct
No ratings yet
DeepSDF Learning Continuous Signed Distance Funct
10 pages
Turbo3D: Ultra-Fast Text-to-3D Generation
No ratings yet
Turbo3D: Ultra-Fast Text-to-3D Generation
12 pages
Progressive Learning of 3D Reconstruction Network From 2D GAN Data
No ratings yet
Progressive Learning of 3D Reconstruction Network From 2D GAN Data
12 pages
Unsupervised Learning with DCGANs
No ratings yet
Unsupervised Learning with DCGANs
15 pages
Park DeepSDF Learning Continuous Signed Distance Functions For Shape Representation CVPR 2019 Paper
No ratings yet
Park DeepSDF Learning Continuous Signed Distance Functions For Shape Representation CVPR 2019 Paper
10 pages
Rendernet: A Deep Convolutional Network For Differentiable Rendering From 3D Shapes
No ratings yet
Rendernet: A Deep Convolutional Network For Differentiable Rendering From 3D Shapes
17 pages
Deep Marching Tetrahedra A Hybrid Representation For High Resolution 3d Shape Synthesis
No ratings yet
Deep Marching Tetrahedra A Hybrid Representation For High Resolution 3d Shape Synthesis
15 pages
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
No ratings yet
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
14 pages
Zanuttigh 2017
No ratings yet
Zanuttigh 2017
5 pages
L07 Sv3D Unsupervised
No ratings yet
L07 Sv3D Unsupervised
62 pages
End-to-End Learning Local Multi-View Descriptors For 3D Point Clouds
No ratings yet
End-to-End Learning Local Multi-View Descriptors For 3D Point Clouds
15 pages
Deep Review and Analysis of Recent Nerfs: Original Paper
No ratings yet
Deep Review and Analysis of Recent Nerfs: Original Paper
32 pages
Phet Gas Law Simulation 2010
No ratings yet
Phet Gas Law Simulation 2010
8 pages
Csa FSP 22001
No ratings yet
Csa FSP 22001
12 pages
References: D Dy DZ D Dy DZ D DX DZ D DX DZ D D D D D Dy DZ Ydydz
No ratings yet
References: D Dy DZ D Dy DZ D DX DZ D DX DZ D D D D D Dy DZ Ydydz
5 pages
F2 Night Before Notes
No ratings yet
F2 Night Before Notes
11 pages
Anurag Tyagi Differentiations
No ratings yet
Anurag Tyagi Differentiations
10 pages
Combinatorial Proof of Derangement Identity
No ratings yet
Combinatorial Proof of Derangement Identity
5 pages
Algebra Notes From The Underground 1st Edition Paolo Aluffi Instant Download
No ratings yet
Algebra Notes From The Underground 1st Edition Paolo Aluffi Instant Download
82 pages
Residual Offset in Silicon Hall-Effect Sensor Analytical Formula Stress Effects and Implications For Octagonal Hall Plate Geometry
No ratings yet
Residual Offset in Silicon Hall-Effect Sensor Analytical Formula Stress Effects and Implications For Octagonal Hall Plate Geometry
9 pages
MATH Grade 4 Quarter 1 Module 7 FINAL
No ratings yet
MATH Grade 4 Quarter 1 Module 7 FINAL
32 pages
Statistical Methods: Descriptive & Inferential Statistics
No ratings yet
Statistical Methods: Descriptive & Inferential Statistics
48 pages
Module 4 Structural Theory 2
No ratings yet
Module 4 Structural Theory 2
19 pages
EC6303-Signals and Systems
No ratings yet
EC6303-Signals and Systems
10 pages
Com Lab Manual
100% (1)
Com Lab Manual
59 pages
5th Sem 2019 20 Maths
No ratings yet
5th Sem 2019 20 Maths
11 pages
Sorting Search New
No ratings yet
Sorting Search New
15 pages
Citric Acid
No ratings yet
Citric Acid
7 pages
Nernst Heat Theorem
No ratings yet
Nernst Heat Theorem
10 pages
Final Push Trig & Stats
No ratings yet
Final Push Trig & Stats
24 pages
Dimensional Analysis of QJM
No ratings yet
Dimensional Analysis of QJM
97 pages
Secondary 4 Mathematics Exam 2004
No ratings yet
Secondary 4 Mathematics Exam 2004
7 pages
EMBS Poster
No ratings yet
EMBS Poster
1 page
Post Lab #2
No ratings yet
Post Lab #2
7 pages
Holiday Assignment
No ratings yet
Holiday Assignment
2 pages
KUET - Academic Records
No ratings yet
KUET - Academic Records
4 pages
Pavement Condition Assessment Using Soft Computing Techniques
No ratings yet
Pavement Condition Assessment Using Soft Computing Techniques
18 pages
Unit1 PD
No ratings yet
Unit1 PD
8 pages
Otb Preparation Guide
No ratings yet
Otb Preparation Guide
14 pages
Mid Term 160907470
No ratings yet
Mid Term 160907470
39 pages
Finding Volume
No ratings yet
Finding Volume
9 pages