Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
58 views26 pages

Pivotal Tuning Inversion

pivotal tuning inversion

Uploaded by

Fei Yin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views26 pages

Pivotal Tuning Inversion

pivotal tuning inversion

Uploaded by

Fei Yin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Pivotal Tuning for Latent-based Editing of Real Images

Daniel Roich1 , Ron Mokady1 , Amit H. Bermano1 , and Daniel Cohen-Or1


1
The Blavatnik School of Computer Science, Tel Aviv University
arXiv:2106.05744v1 [cs.CV] 10 Jun 2021

Abstract Input Smile Afro Pose


Recently, a surge of advanced facial editing techniques
have been proposed that leverage the generative power of
a pre-trained StyleGAN. To successfully edit an image this
way, one must first project (or invert) the image into the
pre-trained generator’s domain. As it turns out, however,
StyleGAN’s latent space induces an inherent tradeoff be- Input Age Mouth No Beard
tween distortion and editability, i.e. between maintaining
the original appearance and convincingly altering some of
its attributes. Practically, this means it is still challenging
to apply ID-preserving facial latent-space editing to faces
which are out of the generator’s domain. In this paper,
we present an approach to bridge this gap. Our technique Input Smile Age Pose
slightly alters the generator, so that an out-of-domain im-
age is faithfully mapped into an in-domain latent code. The
key idea is pivotal tuning — a brief training process that
preserves the editing quality of an in-domain latent region,
while changing its portrayed identity and appearance. In
Pivotal Tuning Inversion (PTI), an initial inverted latent Input Smile Age Pose
code serves as a pivot, around which the generator is fined-
tuned. At the same time, a regularization term keeps nearby
identities intact, to locally contain the effect. This surgi-
cal training process ends up altering appearance features
that represent mostly identity, without affecting editing ca-
pabilities. To supplement this, we further show that piv- Figure 1. Pivotal Tuning Inversion (PTI) enables employing off-
otal tuning can also adjust the generator to accommodate the-shelf latent-based semantic editing techniques on real im-
a multitude of faces, while introducing negligible distor- ages using StyleGAN. PTI excels in identity preserving edits,
tion on the rest of the domain. We validate our technique portrayed through recognizable figures — Serena Williams and
through inversion and editing metrics, and show prefer- Robert Downey Jr. (top), and in handling faces which are clearly
able scores to state-of-the-art methods. We further qual- out-of-domain, e.g., due to heavy makeup (bottom).
itatively demonstrate our technique by applying advanced
edits (such as pose, age, or expression) to numerous im-
ages of well-known and recognizable identities. Finally, 1. Introduction
we demonstrate resilience to harder cases, including heavy
make-up, elaborate hairstyles and/or headwear, which oth- In recent years, unconditional image synthesis has made
erwise could not have been successfully inverted and edited huge progress with the emergence of Generative Adversar-
by state-of-the-art methods. Source code can be found at: ial Networks (GANs) [12]. In essence, GANs learn the
https://github.com/danielroich/PTI. domain (or manifold) of the desired image set and pro-
duce new samples from the same distribution. In particu-
lar, StyleGAN [17–19] is one of the most popular choices
for this task. Not only does it achieve state-of-the-art visual

1
Before PTI After PTI
Figure 2. An illustration of the PTI method. StyleGAN’s latent space is portrayed in two dimensions (see Tov et al. [38]), where the
warmer colors indicate higher densities of W , i.e. regions of higher editability. On the left, we illustrate the generated samples before
pivotal tuning. We can see the Editability-Distortion trade-off. A choice must be made between Identity ”A” and Identity ”B”. ”A”
resides in a more editable region but does not resemble the ”Real” image. ”B” resides in a less editable region, which causes artifacts, but
induces less distortion. On the right - After the pivotal tuning procedure. ”C” maintains the same high editing capabilities of ”A”, while
achieving even better similarity to ”Real” compared to ”B”.

fidelity and diversity, but it also demonstrates fantastic edit- better their editability is. Indeed, recent works [4, 38, 46]
ing capabilities due to an organically formed disentangled suggest a compromise between edibility and distortion, by
latent space. Using this property, many methods demon- picking latent codes in W+ which are more editable.
strate realistic editing abilities over StylGAN’s latent space In this paper, we introduce a novel approach to mitigate
[3, 7, 14, 27, 34, 36, 42], such as changing facial orienta- the distortion-editability trade-off, allowing convincing ed-
tions, expressions, or age, by traversing the learned mani- its on real images that are out-of-distribution. Instead of
fold. projecting the input image into the learned manifold, we
While impressive, these edits are performed strictly in augment the manifold to include the image by slightly alter-
the generator’s latent space, and cannot be applied to real ing the generator, in a process we call pivotal tuning. This
images that are out of its domain. Hence, editing a real adjustment is analogous to shooting a dart and then shifting
image starts with finding its latent representation. This pro- the board itself to compensate for a near hit.
cess, called GAN inversion, has recently drawn consider- Since StyleGAN training is expensive and the generator
able attention [1, 4, 24, 32, 38, 44]. Early attempts inverted achieves unprecedented visual quality, the popular approach
the image to W — StyleGAN’s native latent space. How- is to keep the generator frozen. In contrast, we propose pro-
ever, Abdal et al. [1] have shown that inverting real im- ducing a personalized version of the generator, that accom-
ages to this space results in distortion, i.e. a dissimilarity modates the desired input image or images. Our approach
between the given and generated images, causing artifacts consists of two main steps. First, we invert the input im-
such as identity loss, or an unnatural appearance. There- age to an editable latent code, using off-the-shelf inversion
fore, current inversion methods employ an extended latent techniques. This, of course, yields an image that is similar
space, often denoted as W+, which is more expressive and to the original, but not necessarily identical. In the second
induces significantly less distortion [1]. step, we perform Pivotal Tuning — we lightly tune the pre-
However, even though employing codes from W+ po- trained StyleGAN, such that the input image is generated
tentially produces great visual quality even for out-of- when using the pivot latent code found in the previous step
domain images, these codes suffer from weaker editability, (see Figure 2 for an illustration.). The key idea is that even
since they are not from the generator’s trained domain. Tov though the generator is slightly modified, the latent code
et al. [38] define this conflict as the distortion-editability keeps its editing qualities. As can be seen in our exper-
tradeoff, and show that the closer the codes are to W, the iments, the modified generator retains the editing capabili-

2
ties of the pivot code, while achieving unprecedented recon- cally aware edits. Wu et al. [42] discover disentangled edit-
struction quality. As we demonstrate, the pivotal tuning is ing controls in the space of channel wise style parameters.
a local operation in the latent space, shifting the identity of Other works [36, 37] focus on facial editing, as they utilize
the pivotal region to the desired one with minimal repercus- a prior in the form of a 3D morphable face model. Most re-
sions. To minimize side-effects even further, we introduce cently, Patashnik et al. [27] utilize a contrastive language-
a regularization term, enforcing only a surgical adaptation image pre-training (CLIP) models [31] to explore new edit-
of the latent space. This yields a version of the generator ing capabilities. In this paper, we demonstrate our inversion
StyleGAN that can edit multiple target identities without approach by utilizing these editing methods as downstream
interference. tasks. As seen in Section 4, our PTI process induces higher
In essence, our method extends the high quality editing visual quality for several of these popular approaches.
capabilities of the pretrained StyleGAN to images that are
out of its distribution, as demonstrated in Figure 1. We 2.2. GAN inversion
validate our approach through quantitative and qualitative As previously mentioned, in order to edit a real image
results, and demonstrate that our method achieves state- using latent manipulation, one must perform GAN inversion
of-the-art results for the task of StyleGAN inversion and [45], meaning one must find a latent vector from which the
real image editing. In Section 4, we show that not only do generator would generate the input image. Inversion meth-
we achieve better reconstruction, but also superior editabil- ods can typically be divided into optimization-based ones
ity. We show this through the utilization of several exist- — which directly optimize the latent code using a single
ing editing techniques, and achieve realistic editing even on sample [1, 8, 19, 21], or encoder-based ones — which train
challenging images. Furthermore, we confirm that using an encoder over a large number of samples [13, 23, 28].
our regularization restricts the pivotal tuning side effect to Many works consider specifically the task of StyleGAN in-
be local, with negligible effect on distant latent codes, and version, aiming at leveraging the high visual quality and ed-
that pivotal tuning can be applied for multiple images si- itability of this generator. Abdal et al. [2] demonstrate that it
multaneously to incorporate several identities into the same is not feasible to invert images to StyleGAN’s native latent
model (see Figure 3). Finally, we show through numerous space W without significant artifacts. Instead, it has been
challenging examples that our pivotal tuning-based inver- shown that the extended W+ is much more expressive, and
sion approach achieves completely automatic, fast, faithful, enables better image preservation. Menon et al. [24] use
and powerful editing capabilities. direct optimization for the task of super-resolution by in-
verting a low-resolution image to W+ space. Zhu et al. [44]
2. Related Work use a hybrid approach: first, an encoder is trained, then a di-
rect optimization is performed. Richardson et al. [32] were
2.1. Latent Space Manipulation the first to train an encoder for W+ inversion which was
Most real-life applications require control over the gen- demonstrated to solve a variety of image-to-image transla-
erated image. Such control can be obtained in the un- tion tasks.
conditional setting, by first learning the manifold, and
2.3. Distortion-editability tradeoff
then realizing image editing through latent space traver-
sal. Many works have examined semantic directions in Even though W+ inversion achieves minimal distor-
the latent spaces of pre-trained GANs. Some using full- tion, it has been shown that the results of latent manipu-
supervision in the form of semantic labels [10, 11, 34], lations over W+ inversions are inferior compared to the
others [15, 30, 35] find meaningful directions in a self- same manipulations over latent codes from StyleGAN’s na-
supervised fashion, and finally recent works present unsu- tive space W. Tov et al. [38] define this as the distortion-
pervised methods to achieve the same goal [14, 39, 40], re- editability tradeoff, and design an encoder that attempts to
quiring no manual annotations. find a ”sweet-spot” in this trade-off.
More specifically for StyleGAN, Shen et al. [34] use su- Similarly, the tradeoff was also demonstrated by Zhu et
pervision in the form of facial attribute labels to find mean- al. [46], who suggests an improved embedding algorithm
ingful linear directions in the latent space. Similar labels using a novel regularization method. StyleFlow [3] also
are used by Abdal et al. [3] to train a mapping network concludes that real image editing produces significant ar-
conditioned on these labels. Harkonen et al. [14] identify tifacts compared to images generated by StyleGAN. Both
latent directions based on Principal Component Analysis Zhu et al. and Tov et al. achieve better editability compared
(PCA). Shen et al. [33] perform eigenvector decomposition to previous methods but also suffer from more distortion.
on the generator’s weights to find edit directions without ad- In contrast, our method combines the editing quality of W
ditional supervision. Collins et al. [7] borrow parts of the inversions with highly accurate reconstructions, thus miti-
latent code of other samples to produce local and semanti- gating the distortion-editability tradeoff.

3
2.4. Generator Tuning popular W+ extension. We use an off-the-shelf inversion
method, as proposed by Karras et al. [19]. In essence, a di-
Typically, editing methods avoid altering StyleGAN, in
rect optimization is applied to optimize both latent code w
order to preserve its excellent performance. Some works,
and noise vector n to reconstruct the input image x, mea-
however, do take the approach we adopt as well, and tune
sured by the LPIPS perceptual loss function [43]. As de-
the generator. Pidhorskyi et al. [29] train both the encoder
scribed in [19], optimizing the noise vector n using a noise
and the StyleGAN generator, but their reconstruction results
regularization term improves the inversion significantly, as
suffer from significant distortion, as the StyleGAN tuning
the noise regularization prevents the noise vector from con-
step is too extensive. Bau et al. [5] propose a method for
taining vital information. This means that once wp has been
interactive editing which tunes the generator proposed by
determined, the n values play a minor role in the final vi-
Karras et al. [16] to reconstruct the input image. They
sual appearance. Overall, the optimization defined as the
claim, however, that directly updating the weights results
following objective:
in sensitivity to small changes in the input, which induces
unrealistic artifacts. In contrast, we show that after directly
wp , n = arg minLLPIPS (x, G(w, n; θ)) + λn Ln (n), (1)
updating the weights, our generator keeps its editing capa- w,n
bilities, and demonstrate this over a variety of editing tech-
niques. Pan et al. [26] invert images to BigGAN’s [6] latent where G(w, n, θ) is the generated image using a generator
space by optimizing a random noise vector and tuning the G with weights θ. Note that we do not use StyleGAN’s
generator simultaneously. Nonetheless, as we demonstrate mapping network (converting from Z to W). LLPIPS de-
in Section 4, optimizing a random vector decreases recon- notes the perceptual loss, Ln is a noise regularization term
struction and editability quality significantly for StyleGAN. and λn is a hyperparameter. At this step, the generator re-
mains frozen.
3. Method
3.2. Pivotal Tuning
Our method seeks to provide high quality editing for a
real image using StyleGAN. The key idea of our approach is Applying the latent code w obtained in the inversion,
that due to StyleGAN’s disentangled nature, slight and local produces an image that is similar to the original one x, but
changes to its produced appearance can be applied without may yet exhibit significant distortion. Therefore, in the sec-
damaging its powerful editing capabilities. Hence, given ond step, we unfreeze the generator and tune it to recon-
an image, possibly is out-of-distribution in terms of ap- struct the input image x given the latent code w obtained in
pearance (e.g., real identities, extreme lighting conditions, the first step, which we refer to as the pivot code wp . As we
heavy makeup, and/or extravagant hair and headwear), we demonstrate in Section 4, it is crucial to use the pivot code,
propose finding its closest editable point within the genera- since using random or mean latent codes lead to unsuccess-
tor’s domain. This pivotal point can then be pulled toward ful convergence. Let xp = G(wp ; θ∗ ) be the generated im-
the target, with only minimal effect in its neighborhood, age using wp and the tuned weights θ∗ . We fine tune the
and negligible effect elsewhere. In this section, we present generator using the following loss term:
a two-step method for inverting real images to highly ed-
itable latent codes. First, we invert the given input to wp
in the native latent space of StyleGAN, W. Then, we ap- Lpt = LLPIPS (x, xp ) + λL2 LL2 (x, xp ), (2)
ply a Pivotal Tuning on this pivot code wp to tune the pre-
trained StyleGAN to produce the desired image for input where the generator is initialized with the pretrained
wp . The driving intuition here is that since wp is close weights θ. At this step, wp is constant. The pivotal tun-
enough, training the generator to produce the input im- ing can trivially be extended to N images {xi }N
i=0 , given
age from the pivot can be achieved through augmenting the N inversion latent codes {wi }N
i=0 :
appearance-related weights only, without affecting the well-
behaved structure of StyleGAN’s latent space. N
1 X
3.1. Inversion Lpt = (LLPIPS (xi , xpi ) + λL2 LL2 (xi , xpi )), (3)
N i=1
The purpose of the inversion step is to provide a conve-
nient starting point for the Pivotal Tuning one (Section 3.2). where xpi = G(wi ; θ∗ ).
As previously stated, StyleGAN’s native latent space W Once the generator is tuned, we can edit the input image
provides the best editability. Due to this and since the distor- using any choice of latent-space editing techniques, such as
tion is diminished during Pivotal Tuning, we opted to invert those proposed by Shen et al. [34] or Harkonen et al. [14].
the given input image x to this space, instead of the more Numerous results are demonstrated in Section 4.

4
Real Image
Inversion
±Age
±Smile
Rotation

Figure 3. Real Images editing example using a Multi-ID Personalized StyleGAN. All depicted images are generated by the same model,
fine-tuned on political and industrial world leaders. As can be seen, applying various edit operations on these newly introduced, highly
recognizable identities preserves them well.

3.3. Locality Regularization Finally, we minimize the distance between the image
generated by feeding wr as input using the original weights
As we demonstrate in Section 4, applying pivotal tuning xr = G(wr ; θ) and the image generated using the currently
on a latent code indeed brings the generator to reconstruct tuned ones x∗r = G(wr ; θ∗ ):
the input image in high accuracy, and even enables success-
ful edits around it. At the same time, as we demonstrate in LR = LLPIPS (xr , x∗r ) + λR ∗
L2 LL2 (xr , xr ). (5)
Section 4.3, Pivotal tuning induces a ripple effect — the vi-
sual quality of images generated by non-local latent codes This can be trivially extended to Nr random latent codes:
is compromised. This is especially true when tuning for a
multitude of identities (see Figure. 14). To alleviate this side Nr
1 X
effect, we introduce a regularization term, that is designed LR = (LLPIPS (xr,i , x∗r,i ) + λR ∗
L2 LL2 (xr,i , xr,i )).
to restrict the PTI changes to a local region in the latent Nr i=1
space. In each iteration, we sample a normally distributed (6)
random vector z and use StyleGAN’s mapping network f to
produce a corresponding latent code wz = f (z). Then, we The new optimization is defined as:
interpolate between wz and the pivotal latent code wp us-
ing the interpolation parameter α, to obtain the interpolated θ∗ = arg minLpt + λR LR , (7)
θ∗
code wr :
where λRL2 , λR , Nr are constant positive hyperparameters.
wz − wp
wr = wp + α . (4) Additional discussion regarding the effects of different α
wz − wp 2 values can be found in the Supplementary Materials.

5
Original SG2 W+ e4e SG2 Ours Measure Ours e4e SG2 SG2 W+
LPIPS ↓ 0.09 0.4 0.4 0.34
MSE ↓ 0.014 0.05 0.08 0.043
MS SSIM ↓ 0.21 0.38 0.38 0.3
ID Similarity ↑ 0.9 0.75 0.8 0.85

Table 1. Quantitative reconstruction quality. Using a StyleGAN2


generator trained over the FFHQ dataset, we invert images from
the CelebA-HQ test set and measure their reconstruction using
four different metrics. All metrics indicate superior reconstruction
for our method.

al. [38], which uses the W+ space but seeks to remain rela-
Figure 4. Reconstruction of out-of-domain samples. Our method tively close to W. Each baseline inverts to a different part of
(right) reconstructs out-of-domain visual details (left), such as face the latent space, demonstrating the different aspects of the
paintings or hands, significantly better than state-of-the-art meth- distortion-editability trade-off. Note that we do not include
ods (middle). Richardson et al. [32] in our comparisons, since Tov et al.
have convincingly shown editing superiority, rendering this
Original SG2 W+ e4e SG2 Ours comparison redundant.

4.1. Reconstruction Quality


Qualitative evaluation. Figures 4 and 5 present a quali-
tative comparison of visual quality of inverted images. As
can be seen, even before considering editability, our method
achieves superior reconstruction results for all examples, es-
pecially for out-of-domain ones, as our method is the only
one to successfully reconstruct challenging details such as
Figure 5. Reconstruction quality comparison using examples from face painting or hands (Figure 4). Our method is also ca-
the CelebA-HQ Dataset. As can be seen, even for less challeng- pable of reconstructing fine-details which most people are
ing inputs, our method offers higher level reconstruction for un- sensitive to, such as the make-up, lighting, wrinkles, and
seen identities compared to the state-of-the-art. Zoom-in recom- more (Figure 5). For more visual results, see the Supple-
mended. mentary Materials.
Quantitative evaluation. For quantitative evaluation, we
employ the following metrics: pixel-wise distance using
4. Experiments M SE, perceptual similarity using LP IP S [43], structural
In this section, we justify the design choices made and similarity using M S − SSIM [41], and identity similar-
evaluate our method. For all experiments we use the ity by employing a pretrained face recognition network [9].
StyleGAN2 generator [19]. For facial images, we use a The results are shown in Table 1. As can be seen, the results
generator pre-trained over the FFHQ dataset [17], and we align with our qualitative evaluation as we achieve the best
use the CelebA-HQ dataset [16, 22] for evaluation. In addi- score for each metric by a substantial margin.
tion, we have also collected a handful of images of out-of-
domain and famous figures, to highlight our identity preser- 4.2. Editing Quality
vation capabilities, and the unprecedented extent of images Editing a facial image should preserve the original iden-
we can handle that could not be edited until now. tity while performing a meaningful and visually plausible
We start by qualitatively and quantitatively comparing modification. However, it has been shown [38, 46] that us-
our approach to current inversion methods, both in terms of ing less editable embedding spaces, such as W+, results
reconstruction quality and the quality of downstream edit- in better reconstruction, but also in less meaningful edit-
ing. We use the direct optimization scheme proposed by ing compared to the native W space. For example, using
Karras et al. [19] to invert real images to W space, which the same latent edit, rotating a face in W space results in a
we denote by SG2. A similar optimization is used to in- higher rotation angle compared to W+. Hence, in cases of
vert to the extended W+ space [1], denoted by SG2 W+. minimal effective editing, the identity may seem to be pre-
We also compare to e4e, the encoder designed by Tov et served rather well. Therefore, we evaluate editing quality

6
Original SG2 W+ e4e SG2 Ours Original StyleClip PTI+StyleClip

Figure 6. Editing comparison of images from the CelebA-HQ


dataset. We demonstrate the pose (top) and smile removal (bot-
tom) edits. The edits over SG2 W+ do not create the desired
effect, e.g., mouth is not closed in the bottom row. SG2 and e4e
achieve better editing, but lose the original identity. PTI achieves
high quality editing while preserving the identity. For more un-
curated examples, see the Supplementary Materials. Zoom-in rec-
ommended.

Original StyleClip PTI+StyleClip

Figure 8. Sequential editing of StyleClip and InterfaceGAN edits


with and without pivotal tuning inversion (PTI). Top row: ”Bob
cut hair”, smile, and rotation. Middle row: ”bowl cut hair” and
older. Bottom row: ”curly hair”, younger and rotation.

perform the edit, but loses the identity, and e4e provides
a compromise between the two. In all cases, our method
preserves identity the best and displays the same editing
quality as for W-based inversions. Figure 6 presents an
editing comparison over the CelebA-HQ dataset. We
also investigate our performance using images of other
iconic characters (Figures 1 and 9) and more challenging
out-of-domain facial images (Figure 10). The ability to
perform sequential editing is presented in Figures 12,
and 13. In addition, we demonstrate our ability to invert
multiple identities using the same generator in Figures 3
and 11. For more visual and uncurated results, see the
Figure 7. StyleClip editing demonstration. Using StyleClip [27] to Supplementary Materials. As can be seen, our method
perform the ”bowl cut” and ”mohawk” edits (middle column), a successfully performs meaningful edits, while preserving
clear improvement in identity preservation can be seen when first the original identity successfully.
employing PTI (right). The recent work of StyleClip [27] demonstrates unique
edits, driven by natural language. In Figures 7, and 8 we
demonstrate editing results using this model, and demon-
on two axes: identity preservation and editing magnitude. strate substantial identity preservation improvement, thus
Qualitative evaluation. We use the popular extending StyleClip’s scope to more challenging images.
GANSpace [14] and InterfaceGAN [34] methods for We use the mapper-based variant proposed by the paper,
latent-based editing. These approaches are orthogonal to where the edits are achieved by training a mapper network
ours, as they require the use of an inversion algorithm to edit input latent codes. Note that the StyleClip model is
to edit real images. As can be expected, the W+-based trained to handle codes returned by the e4e method. Hence,
method preserves the identity rather well, but fails to to employ this model, our PTI process uses e4e-based piv-
perform significant edits, the W-based one is able to ots instead of W ones. As can be expected, we observe

7
Original SG2 W+ e4e SG2 Ours

Figure 9. Editing comparison of famous figures collected from the web. We demonstrate the following edits (top to bottom): pose, mouth
closing, and smile. Similar to Figure 6, we again witness how SG2 W+ do not induce significant edits, and the others do not preserve
identity, in contrast to our approach, which achieves both.

that the editing capabilities of the e4e codes are preserved, well. As expected, W-based inversion induces a more sig-
while the inherent distortion caused by e4e is diminished nificant edit compared to W+ inversion for the same editing
using PTI. More results for this experiment can be found in operation. As can be seen, our approach yields a magnitude
the supplementary materials. that is almost identical to W’s, surpassing e4e and W+ in-
versions, which indicates we achieve high editability (first
Quantitative evaluation results are summarized in Table 2. row).
To measure the two aforementioned axes, we compare the
effects of the same latent editing operation between the var- In addition, we report the identity preservation for sev-
ious aforementioned baselines, and the effects of editing op- eral edits. We evaluate the identity change using a pre-
erations that yield the same editing magnitude. To evaluate trained facial recognition network [9], and the edits we re-
editing magnitude, we apply a single pose editing operation port for are smile, pose, and age. We report both the mean
and measure the rotation angle using Microsoft Face API identity preservation induced by each of these edits (second
[25], as proposed by Zhu et al. [46]. As the editability row), and the one induced by performing them sequentially
increases, the magnitude of the editing effect increases as one after the other (third row). Results indeed validate that

8
Edit Magnitude ours e4e SG2 SG2 W+ titative results. We measure the reconstruction of random
Pose 14.86 14.6 15 11.15 latent codes with and without the regularization compared
to using the original pretrained generator. To demonstrate
ID similarity, same edit that our regularization does not decrease the pivotal tuning
single edits 0.9 0.79 0.82 0.85 results, we also measure the reconstruction of the target im-
sequential edits 0.82 0.73 0.78 0.81 age. As can be seen, our regularization reduces the side
effects significantly while obtaining similar reconstruction
ID similarity, same magnitude for the target image.
±5 rotation 0.84 0.77 0.79 0.82
±10 rotation 0.78 0.72 0.75 0.77 4.4. Ablation study
Table 2. Editing evaluation. Top: we compare the edit magnitude An ablation analysis is presented in Figure 16. First, we
for the same latent edit over the different baselines, as proposed by show that using a pivot latent code from W+ space rather
Zhu et al. [46]. The conjecture is that more editable codes yield of W ((B)), results in less editability, as the editing is less
more significant change from the same latent edit. Middle rows: meaningful for both smile and pose. Skipping the initial
ID preservation is measured after editing. We have used rotation, inversion step and using the mean latent code ((C)) or a
smile, and age. We report the mean ID correlation for the different random latent code ((E)), results in substantially more dis-
edits (single edits), as well as the ID preservation after applying all
tortion compared to ours. Similar results were obtained by
three edits sequentially (sequential edits). Bottom: Id preservation
when applying an edit that yields the same effect for all baselines.
optimizing the pivot latent code in addition to the gener-
The yaw angle change is measured by Microsoft Face API [25]. As ator, initialized to mean latent code ((D)) or random la-
can be seen, our editability is similar to W-based inversion, while tent code ((F )) similar to Pan et al. [26]. In addition, we
our identity preservation is better even than W+-based inversions. demonstrate that optimizing the pivot latent code is not nec-
essary even when starting from an inverted pivot code. To
do this, we start from an inverted code wp and perform PTI
our method obtains better identity similarity compared to while allowing wp to change. We then feed the resulting
the baselines. code w˜p back to the original StyleGAN. Inspecting the two
Since the lack of editability might increase identity sim- images, produced by wp and w˜p over the same generator,
ilarity, as previously mentioned, we also measure the iden- we see negligible change: 0.015 ± 5e−6 for LPIPS and
tity similarity while performing rotation of the same magni- 0.0012 ± 1e−6 for MSE. We conclude that our choice for
tude. Expectedly, the identity similarity for W+ inversion pivot code is almost optimal and hence can lighten the opti-
decreases significantly when using a fixed rotation angle mization process by keeping the code fixed.
editing, demonstrating it is less editable compared to other
inversion methods. Overall, the quantitative results demon- 4.5. Implementation details
strate the distortion-editability tradeoff, as W+ inversion
achieves better ID similarity but lower edit magnitude, and For the initial inversion step, we use the same hyperpa-
W inversion achieves inferior ID similarity but higher edit rameters as described by Karras et al. [19], except for the
magnitude. In contrast, our method preserves the identity learning rate which is changed to 5e−3 . We run the inver-
well and provides highly editable embeddings, or in other sion for 450 iterations. Then, for pivotal tuning, we further
words, we alleviate the distortion-editability trade-off. optimize for 350 iterations with a learning rate of 3e−4 us-
ing the Adam [20] optimizer. For reconstruction, we use
4.3. Regularization λL2 = 1 and λLP IP S = 1 and for the regularization we use
Our locality regularization restricts the pivotal tuning α = 30, λR L2 = 1, λR = 0.1, and Nr = 1.
side effects, causing diminishing disturbance to distant la- All quantitative experiments were performed on the first
tent codes. We evaluate this effect by sampling random la- 1000 samples from the CelebA-HQ test set.
tent codes and comparing their generated images between Our two-step inversion takes less than 3 minutes on a sin-
the original and tuned generators. Visual results, presented gle Nvidia GeForce RTX 2080. The initial W-space inver-
in Figure 14, demonstrate that the regularization signifi- sion step takes approximately one minute, just like the SG2
cantly minimizes the change. The images generated without inversion does. The pivotal tuning takes less than a minute
regularization suffer from artifacts and ID shifting, while without regularization, and less than two with it. This train-
the images generated while employing the regularization ing time grows linearly with the number of inverted iden-
are almost identical to the original ones. We perform the tities. The SG2 W+ inversion takes 4.5 minutes for 2100
regularization evaluation using a model tuned to invert 12 iterations. The inversion time of e4e is less than a second,
identities, as the side effects are more substantial in the mul- as it is encoder-based and does not require optimization at
tiple identities case. In addition, Figure 15 presents quan- inference.

9
Original SG2 W+ e4e SG2 Ours

Figure 10. Editing of smile, age, and beard removal (top to bottom) comparison over out-of-domain images collected from the web. The
collected images portray unique hairstyles, hair colors, and apparel, along with unique facial features, such as heavy make-up and scars.
Even in these challenging cases, our results retain the original identity while enabling meaningful edits.

5. Conclusions mapper that approximates the PTI in a short forward pass.


This would diminish the current low computational cost that
We have presented Pivotal Tuning Inversion — an in- entails real image editing, situating StyleGAN as a practi-
version method that allows using latent-based editing tech- cal and accessible facial editing tool for the masses. In ad-
niques on practical, real-life facial images. In a sense, dition to a single-pass PTI process, in the future we plan
we break the notorious trade-off between reconstruction also to consider using a set of photographs of the individ-
and editability through personalization, or in other words ual for PTI. This would extend and stabilize the notion of
through surgical adjustments to the generator that address personalization of the target individual, compared to seeing
the desired image specifically well. This is achieved by just a single example. Another research direction is to take
leveraging the disentanglement between appearance and ge- PTI beyond the architecture of StyleGAN, for example to
ometry that naturally emerges from StyleGAN’s behavior. BigGAN [6] or other novel generative models.
In other words, we have demonstrated increased quality In general, we believe the presented approach of ad-hoc
at the cost of additional computation. As it turns out, this fine tuning a pretrained generator potentially bears merits
deal is quite lucrative: Our PTI optimization boosts perfor- for many other applications in editing and manipulations of
mance considerably, while entailing a computation cost of specific images or other generation-based tasks in Machine
around three minutes to incorporate a new identity — simi- Learning.
lar to what some of the current optimization-based inversion
methods require. Furthermore, we have shown that PTI can Acknowledgements
be successfully applied to several individuals. We envision
We thank Or Patashnik, Rinon Gal and Dani Lischinski
this mode of editing sessions to apply, for example, to a
for their help and useful suggestions.
casting team of a movie.
Nevertheless, it is still desirable to develop a trainable

10
Input
Inversion
+Age
±Smile

Figure 11. ”Friends” StyleGAN. We simultaneously invert multiple identities into StyleGAN latent space, while retaining high editability
and identity similarity.
Original
± Smile, Pose

Figure 12. Sequential editing. We perform pivotal tuning inversion followed by two edits sequentially: rotation and smile.

References [4] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle:


A residual-based stylegan encoder via iterative refinement.
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- CoRR, abs/2104.02699, 2021. URL https://arxiv.
age2stylegan: How to embed images into the stylegan latent org/abs/2104.02699.
space? In Proceedings of the IEEE international conference
on computer vision, pages 4432–4441, 2019. [5] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff,
Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Seman-
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- tic photo manipulation with a generative image prior. arXiv
age2stylegan++: How to edit the embedded images? In Pro- preprint arXiv:2005.07727, 2020.
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 8296–8305, 2020. [6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
scale gan training for high fidelity natural image synthesis.
[3] Rameen Abdal, Peihao Zhu, Niloy Mitra, and Peter Wonka. In International Conference on Learning Representations,
Styleflow: Attribute-conditioned exploration of stylegan- 2018.
generated images using conditional continuous normalizing
flows, 2020. [7] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk.

11
Original
Sequential

Figure 13. Additional sequential editing examples. Left: hair, pose


and smile edits. Middle: hair and age. Right: pose and smile.

Original w/o Regularization w/ regularization

Figure 15. Quantitative evaluation of the locality regularization.


To measure the magnitude of the side effect caused by pivotal tun-
ing, we sample random latent codes, denoted In-Domain, and eval-
uate the reconstruction, measured by MSE and LPIPS, compared
to the original pretrained generator. Our regularization reaches
similar target reconstruction scores while reducing the artifacts
caused to the entire domain substantially.

[10] Emily Denton, Ben Hutchinson, Margaret Mitchell, and


Timnit Gebru. Detecting bias with generative coun-
terfactual face attribute augmentation. arXiv preprint
arXiv:1906.06439, 2019.

[11] Lore Goetschalckx, Alex Andonian, Aude Oliva, and Phillip


Figure 14. Ablation of the Locality Regularization over random Isola. Ganalyze: Toward visual definitions of cognitive im-
latent codes for pivotal tuning applied on multiple identities. As age properties, 2019.
can be seen, without regularization the generated images suffer
from artifacts and ID shifts. These are almost completely removed [12] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
by employing our regularization. Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Proceedings
of the 27th International Conference on Neural Information
Editing in style: Uncovering the local semantics of gans. In Processing Systems - Volume 2, NIPS’14, page 2672–2680,
Proceedings of the IEEE/CVF Conference on Computer Vi- Cambridge, MA, USA, 2014. MIT Press.
sion and Pattern Recognition, pages 5771–5780, 2020.
[13] Shanyan Guan, Ying Tai, Bingbing Ni, Feida Zhu, Feiyue
Huang, and Xiaokang Yang. Collaborative learning for faster
[8] Antonia Creswell and Anil Anthony Bharath. Inverting the
stylegan embedding. arXiv preprint arXiv:2007.01758,
generator of a generative adversarial network. IEEE transac-
2020.
tions on neural networks and learning systems, 30(7):1967–
1974, 2018. [14] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and
Sylvain Paris. Ganspace: Discovering interpretable gan con-
[9] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos trols. arXiv preprint arXiv:2004.02546, 2020.
Zafeiriou. Arcface: Additive angular margin loss for deep
face recognition. In Proceedings of the IEEE Conference [15] Ali Jahanian, Lucy Chai, and Phillip Isola. On
on Computer Vision and Pattern Recognition, pages 4690– the”steerability” of generative adversarial networks. arXiv
4699, 2019. preprint arXiv:1907.07171, 2019.

12
Original (A) (B) (C) (D) (E) (F )

Figure 16. Ablation study. We apply the same edits of smile (top) and pose (bottom). (A) full approach. (B) We invert to W+ space at
the first step instead of W, which yields inferior editability. (C) the pivot latent code wp is replaced with the mean latent code µw . (D)
the pivot wp is replaced with a random latent code. (E) We optimize the pivot latent code wp along with the generator, initialized to the
mean latent code µw . (F ) We optimize the pivot latent code wp along with the generator, initialized to a random latent code. As can be
seen, (C) − (F ) results in significantly more distortion.

[16] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. [25] Microsoft. Azure face, 2020.
Progressive growing of gans for improved quality, stability,
and variation. arXiv preprint arXiv:1710.10196, 2017. [26] Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin,
Chen Change Loy, and Ping Luo. Exploiting deep genera-
[17] Tero Karras, Samuli Laine, and Timo Aila. A style-based tive prior for versatile image restoration and manipulation. In
generator architecture for generative adversarial networks. In European Conference on Computer Vision, pages 262–277.
Proceedings of the IEEE conference on computer vision and Springer, 2020.
pattern recognition, pages 4401–4410, 2019.
[27] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
[18] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
and Dani Lischinski. Styleclip: Text-driven manipulation of
Jaakko Lehtinen, and Timo Aila. Training generative adver-
stylegan imagery. arXiv preprint arXiv:2103.17249, 2021.
sarial networks with limited data. In Proc. NeurIPS, 2020.

[19] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, [28] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- Jose M Álvarez. Invertible conditional gans for image edit-
ing the image quality of stylegan. In Proceedings of the ing. arXiv preprint arXiv:1611.06355, 2016.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8110–8119, 2020. [29] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco
Doretto. Adversarial latent autoencoders. In Proceedings of
[20] Diederik P. Kingma and Jimmy Ba. Adam: A method for the IEEE/CVF Conference on Computer Vision and Pattern
stochastic optimization. CoRR, abs/1412.6980, 2015. Recognition, pages 14104–14113, 2020.

[21] Zachary C Lipton and Subarna Tripathi. Precise recovery of [30] Antoine Plumerault, Hervé Le Borgne, and Céline Hude-
latent vectors from generative adversarial networks. arXiv lot. Controlling generative models with continuous factors
preprint arXiv:1702.04782, 2017. of variations. arXiv preprint arXiv:2001.10238, 2020.
[22] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Deep learning face attributes in the wild. In Proceedings of
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
International Conference on Computer Vision (ICCV), De-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
cember 2015.
ing transferable visual models from natural language super-
[23] Junyu Luo, Yong Xu, Chenwei Tang, and Jiancheng Lv. vision. arXiv preprint arXiv:2103.00020, 2021.
Learning inverse mapping by autoencoder based generative
adversarial nets. In International Conference on Neural In- [32] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
formation Processing, pages 207–216. Springer, 2017. Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding
in style: a stylegan encoder for image-to-image translation.
[24] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, arXiv preprint arXiv:2008.00951, 2020.
and Cynthia Rudin. Pulse: Self-supervised photo upsam-
pling via latent space exploration of generative models. In [33] Yujun Shen and Bolei Zhou. Closed-form factorization of
Proceedings of the IEEE/CVF Conference on Computer Vi- latent semantics in gans. arXiv preprint arXiv:2007.06600,
sion and Pattern Recognition, pages 2437–2445, 2020. 2020.

13
[34] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In-
terpreting the latent space of gans for semantic face editing.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9243–9252, 2020.

[35] Nurit Spingarn-Eliezer, Ron Banner, and Tomer Michaeli.


Gan steerability without optimization. arXiv preprint
arXiv:2012.05328, 2020.

[36] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Flo-


rian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael
Zollhöfer, and Christian Theobalt. Stylerig: Rigging style-
gan for 3d control over portrait images. arXiv preprint
arXiv:2004.00121, 2020.

[37] Ayush Tewari, Mohamed Elgharib, Mallikarjun B R., Flo-


rian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael
Zollhöfer, and Christian Theobalt. Pie: Portrait image em-
bedding for semantic control, 2020.

[38] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and


Daniel Cohen-Or. Designing an encoder for stylegan image
manipulation. arXiv preprint arXiv:2102.02766, 2021.

[39] Andrey Voynov and Artem Babenko. Unsupervised discov-


ery of interpretable directions in the gan latent space. arXiv
preprint arXiv:2002.03754, 2020.

[40] Binxu Wang and Carlos R Ponce. A geometric analy-


sis of deep generative image models and its applications.
In International Conference on Learning Representations,
2021. URL https://openreview.net/forum?id=
GH7QRzUDdXG.

[41] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi-


scale structural similarity for image quality assessment. In
The Thrity-Seventh Asilomar Conference on Signals, Sys-
tems & Computers, 2003, volume 2, pages 1398–1402. Ieee,
2003.

[42] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace


analysis: Disentangled controls for stylegan image genera-
tion, 2020.

[43] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-


man, and Oliver Wang. The unreasonable effectiveness of
deep features as a perceptual metric, 2018.

[44] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
domain gan inversion for real image editing. arXiv preprint
arXiv:2004.00049, 2020.

[45] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and


Alexei A Efros. Generative visual manipulation on the nat-
ural image manifold. In European conference on computer
vision, pages 597–613. Springer, 2016.

[46] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
Improved stylegan embedding: Where are the good latents?,
2020.

14
Appendix B. Visual Results
Figures 19 and 20 demonstrate the inversion of multiple
A. Locality Regularization identities. To prevent the suspicion of cherry-picking, we
provide uncurated editing comparison results of the first 18
We show the effect of different α values when using the images from CelebA-HQ test set, provided in Figures 21 to
locality regularization. Figure 17 presents quantitative eval- 23. To further avoid picking, we perform the same three
uation and Figure 18 visually demonstrates the interpolated edits recurrently. Figures 24 to 27 present further com-
code wr . We measure both the reconstruction of the tar- parisons of editing quality over real images of recognizable
get image and the reconstruction of sampled random latent characters, and Figure 28 depicts reconstruction results. Fi-
codes, denoted as in-domain, before and after the tuning. nally, additional StyleClip editing results can be found Fig-
For small α values (e.g., α = 8), the image generated by ure 29. All additional results show that our method achieves
the interpolated code wr is very similar to the image gener- higher reconstruction and editing quality, even for challeng-
ated by the pivot code wp . Therefore, the regularization is ing images. This enables us to preserve the original identity
limited and less effective. For high α values (e.g., α = 60) successfully, while still maintaining high editability.
the interpolated code image is more or less equivalent to
simply using a random latent wz . Hence, the interpolated
code is less affected by the pivotal tuning, which decreases
the regularization constraint. Extremely high α values (e.g.,
α = 120) result in extremely unrealistic images, as can be
seen in Figure 18, which cause deterioration of the target
image reconstruction. Overall, we get the most effective
regularization using an interpolated image which is highly
similar to both pivot and the random images, e.g., α = 30
in Figure 18.

Figure 17. Quantitative evaluation of different α values for the lo-


cality regularization. We sample random latent codes, denoted In-
Domain, and evaluate the reconstruction, measured by MSE and
LPIPS, compared to the original pretrained generator. Similarly,
we measure the target image reconstruction.

15
Pivot Random α=2 α=8 α = 16 α = 30 α = 60 α = 120

Figure 18. Visual demonstration of the image generated by interpolating the pivot image and a random image for different α values. As
α increased, the image generated by the interpolated code is less similar to the pivot image and more to the random image, until reaching
extremely high α, where the generated image is no longer realistic.
Real Image
Inversion
Age
Smile
Rotation

Figure 19. FCB-StyleGAN. We invert multiple images of Barcelona Football Club players into a single StyleGAN latent space, and
demonstrate both high reconstruction and editing quality for the inverted identities.

16
Real Image
Inversion
Age
Smile
Rotation
Hair style

Figure 20. Modern-Family StyleGAN. We invert multiple images of the television show ”Modern Family” cast into a single StyleGAN
latent space, and demonstrate both high reconstruction and editing quality for the inverted identities.

17
Original SG2 W+ e4e SG2 Ours

Figure 21. Uncurated CelebA-HQ editing results. We evaluate the first identities in the test dataset using recurrent edits: smile, age, and
rotation. Here identities 0 − 5 are depicted.

18
Original SG2 W+ e4e SG2 Ours

Figure 22. uncurated CelebA-HQ editing results. We evaluate the first identities in the test dataset using recurrent edits: smile, age, and
rotation. Here identities 6-11 are depicted.

19
Original SG2 W+ e4e SG2 Ours

Figure 23. uncurated CelebA-HQ editing results. We evaluate the first identities in the test dataset using recurrent edits: smile, age, and
rotation. Here identities 12-17 are depicted.

20
Original SG2 W+ e4e SG2 Ours

Figure 24. Additional editing comparison over real images.

21
Original SG2 W+ e4e SG2 Ours

Figure 25. Additional editing comparison over real images.

22
Original SG2 W+ e4e SG2 Ours

Figure 26. Additional editing comparison over real images.

23
Original SG2 W+ e4e SG2 Ours
Inversion
Old
Young
Eyes Closed
No beard
Rotation

Figure 27. Comparison of various edits applied on the same challenging image. As can be seen, our method performs meaningful editing
while surpassing other methods in preserving fine details.

24
Original SG2 W+ e4e SG2 Ours

Figure 28. Additional reconstruction results over real images. Our method preserves out-of-distribution details, such as earrings or compli-
cated make-up.

25
Original StyleClip PTI+StyleClip Original StyleClip PTI+StyleClip

Figure 29. Additional StyleClip comparison results. We perform various hair edits using StyleClip with and without pivotal tuning inver-
sion.

26

You might also like