Pivotal Tuning Inversion
Pivotal Tuning Inversion
1
Before PTI After PTI
Figure 2. An illustration of the PTI method. StyleGAN’s latent space is portrayed in two dimensions (see Tov et al. [38]), where the
warmer colors indicate higher densities of W , i.e. regions of higher editability. On the left, we illustrate the generated samples before
pivotal tuning. We can see the Editability-Distortion trade-off. A choice must be made between Identity ”A” and Identity ”B”. ”A”
resides in a more editable region but does not resemble the ”Real” image. ”B” resides in a less editable region, which causes artifacts, but
induces less distortion. On the right - After the pivotal tuning procedure. ”C” maintains the same high editing capabilities of ”A”, while
achieving even better similarity to ”Real” compared to ”B”.
fidelity and diversity, but it also demonstrates fantastic edit- better their editability is. Indeed, recent works [4, 38, 46]
ing capabilities due to an organically formed disentangled suggest a compromise between edibility and distortion, by
latent space. Using this property, many methods demon- picking latent codes in W+ which are more editable.
strate realistic editing abilities over StylGAN’s latent space In this paper, we introduce a novel approach to mitigate
[3, 7, 14, 27, 34, 36, 42], such as changing facial orienta- the distortion-editability trade-off, allowing convincing ed-
tions, expressions, or age, by traversing the learned mani- its on real images that are out-of-distribution. Instead of
fold. projecting the input image into the learned manifold, we
While impressive, these edits are performed strictly in augment the manifold to include the image by slightly alter-
the generator’s latent space, and cannot be applied to real ing the generator, in a process we call pivotal tuning. This
images that are out of its domain. Hence, editing a real adjustment is analogous to shooting a dart and then shifting
image starts with finding its latent representation. This pro- the board itself to compensate for a near hit.
cess, called GAN inversion, has recently drawn consider- Since StyleGAN training is expensive and the generator
able attention [1, 4, 24, 32, 38, 44]. Early attempts inverted achieves unprecedented visual quality, the popular approach
the image to W — StyleGAN’s native latent space. How- is to keep the generator frozen. In contrast, we propose pro-
ever, Abdal et al. [1] have shown that inverting real im- ducing a personalized version of the generator, that accom-
ages to this space results in distortion, i.e. a dissimilarity modates the desired input image or images. Our approach
between the given and generated images, causing artifacts consists of two main steps. First, we invert the input im-
such as identity loss, or an unnatural appearance. There- age to an editable latent code, using off-the-shelf inversion
fore, current inversion methods employ an extended latent techniques. This, of course, yields an image that is similar
space, often denoted as W+, which is more expressive and to the original, but not necessarily identical. In the second
induces significantly less distortion [1]. step, we perform Pivotal Tuning — we lightly tune the pre-
However, even though employing codes from W+ po- trained StyleGAN, such that the input image is generated
tentially produces great visual quality even for out-of- when using the pivot latent code found in the previous step
domain images, these codes suffer from weaker editability, (see Figure 2 for an illustration.). The key idea is that even
since they are not from the generator’s trained domain. Tov though the generator is slightly modified, the latent code
et al. [38] define this conflict as the distortion-editability keeps its editing qualities. As can be seen in our exper-
tradeoff, and show that the closer the codes are to W, the iments, the modified generator retains the editing capabili-
2
ties of the pivot code, while achieving unprecedented recon- cally aware edits. Wu et al. [42] discover disentangled edit-
struction quality. As we demonstrate, the pivotal tuning is ing controls in the space of channel wise style parameters.
a local operation in the latent space, shifting the identity of Other works [36, 37] focus on facial editing, as they utilize
the pivotal region to the desired one with minimal repercus- a prior in the form of a 3D morphable face model. Most re-
sions. To minimize side-effects even further, we introduce cently, Patashnik et al. [27] utilize a contrastive language-
a regularization term, enforcing only a surgical adaptation image pre-training (CLIP) models [31] to explore new edit-
of the latent space. This yields a version of the generator ing capabilities. In this paper, we demonstrate our inversion
StyleGAN that can edit multiple target identities without approach by utilizing these editing methods as downstream
interference. tasks. As seen in Section 4, our PTI process induces higher
In essence, our method extends the high quality editing visual quality for several of these popular approaches.
capabilities of the pretrained StyleGAN to images that are
out of its distribution, as demonstrated in Figure 1. We 2.2. GAN inversion
validate our approach through quantitative and qualitative As previously mentioned, in order to edit a real image
results, and demonstrate that our method achieves state- using latent manipulation, one must perform GAN inversion
of-the-art results for the task of StyleGAN inversion and [45], meaning one must find a latent vector from which the
real image editing. In Section 4, we show that not only do generator would generate the input image. Inversion meth-
we achieve better reconstruction, but also superior editabil- ods can typically be divided into optimization-based ones
ity. We show this through the utilization of several exist- — which directly optimize the latent code using a single
ing editing techniques, and achieve realistic editing even on sample [1, 8, 19, 21], or encoder-based ones — which train
challenging images. Furthermore, we confirm that using an encoder over a large number of samples [13, 23, 28].
our regularization restricts the pivotal tuning side effect to Many works consider specifically the task of StyleGAN in-
be local, with negligible effect on distant latent codes, and version, aiming at leveraging the high visual quality and ed-
that pivotal tuning can be applied for multiple images si- itability of this generator. Abdal et al. [2] demonstrate that it
multaneously to incorporate several identities into the same is not feasible to invert images to StyleGAN’s native latent
model (see Figure 3). Finally, we show through numerous space W without significant artifacts. Instead, it has been
challenging examples that our pivotal tuning-based inver- shown that the extended W+ is much more expressive, and
sion approach achieves completely automatic, fast, faithful, enables better image preservation. Menon et al. [24] use
and powerful editing capabilities. direct optimization for the task of super-resolution by in-
verting a low-resolution image to W+ space. Zhu et al. [44]
2. Related Work use a hybrid approach: first, an encoder is trained, then a di-
rect optimization is performed. Richardson et al. [32] were
2.1. Latent Space Manipulation the first to train an encoder for W+ inversion which was
Most real-life applications require control over the gen- demonstrated to solve a variety of image-to-image transla-
erated image. Such control can be obtained in the un- tion tasks.
conditional setting, by first learning the manifold, and
2.3. Distortion-editability tradeoff
then realizing image editing through latent space traver-
sal. Many works have examined semantic directions in Even though W+ inversion achieves minimal distor-
the latent spaces of pre-trained GANs. Some using full- tion, it has been shown that the results of latent manipu-
supervision in the form of semantic labels [10, 11, 34], lations over W+ inversions are inferior compared to the
others [15, 30, 35] find meaningful directions in a self- same manipulations over latent codes from StyleGAN’s na-
supervised fashion, and finally recent works present unsu- tive space W. Tov et al. [38] define this as the distortion-
pervised methods to achieve the same goal [14, 39, 40], re- editability tradeoff, and design an encoder that attempts to
quiring no manual annotations. find a ”sweet-spot” in this trade-off.
More specifically for StyleGAN, Shen et al. [34] use su- Similarly, the tradeoff was also demonstrated by Zhu et
pervision in the form of facial attribute labels to find mean- al. [46], who suggests an improved embedding algorithm
ingful linear directions in the latent space. Similar labels using a novel regularization method. StyleFlow [3] also
are used by Abdal et al. [3] to train a mapping network concludes that real image editing produces significant ar-
conditioned on these labels. Harkonen et al. [14] identify tifacts compared to images generated by StyleGAN. Both
latent directions based on Principal Component Analysis Zhu et al. and Tov et al. achieve better editability compared
(PCA). Shen et al. [33] perform eigenvector decomposition to previous methods but also suffer from more distortion.
on the generator’s weights to find edit directions without ad- In contrast, our method combines the editing quality of W
ditional supervision. Collins et al. [7] borrow parts of the inversions with highly accurate reconstructions, thus miti-
latent code of other samples to produce local and semanti- gating the distortion-editability tradeoff.
3
2.4. Generator Tuning popular W+ extension. We use an off-the-shelf inversion
method, as proposed by Karras et al. [19]. In essence, a di-
Typically, editing methods avoid altering StyleGAN, in
rect optimization is applied to optimize both latent code w
order to preserve its excellent performance. Some works,
and noise vector n to reconstruct the input image x, mea-
however, do take the approach we adopt as well, and tune
sured by the LPIPS perceptual loss function [43]. As de-
the generator. Pidhorskyi et al. [29] train both the encoder
scribed in [19], optimizing the noise vector n using a noise
and the StyleGAN generator, but their reconstruction results
regularization term improves the inversion significantly, as
suffer from significant distortion, as the StyleGAN tuning
the noise regularization prevents the noise vector from con-
step is too extensive. Bau et al. [5] propose a method for
taining vital information. This means that once wp has been
interactive editing which tunes the generator proposed by
determined, the n values play a minor role in the final vi-
Karras et al. [16] to reconstruct the input image. They
sual appearance. Overall, the optimization defined as the
claim, however, that directly updating the weights results
following objective:
in sensitivity to small changes in the input, which induces
unrealistic artifacts. In contrast, we show that after directly
wp , n = arg minLLPIPS (x, G(w, n; θ)) + λn Ln (n), (1)
updating the weights, our generator keeps its editing capa- w,n
bilities, and demonstrate this over a variety of editing tech-
niques. Pan et al. [26] invert images to BigGAN’s [6] latent where G(w, n, θ) is the generated image using a generator
space by optimizing a random noise vector and tuning the G with weights θ. Note that we do not use StyleGAN’s
generator simultaneously. Nonetheless, as we demonstrate mapping network (converting from Z to W). LLPIPS de-
in Section 4, optimizing a random vector decreases recon- notes the perceptual loss, Ln is a noise regularization term
struction and editability quality significantly for StyleGAN. and λn is a hyperparameter. At this step, the generator re-
mains frozen.
3. Method
3.2. Pivotal Tuning
Our method seeks to provide high quality editing for a
real image using StyleGAN. The key idea of our approach is Applying the latent code w obtained in the inversion,
that due to StyleGAN’s disentangled nature, slight and local produces an image that is similar to the original one x, but
changes to its produced appearance can be applied without may yet exhibit significant distortion. Therefore, in the sec-
damaging its powerful editing capabilities. Hence, given ond step, we unfreeze the generator and tune it to recon-
an image, possibly is out-of-distribution in terms of ap- struct the input image x given the latent code w obtained in
pearance (e.g., real identities, extreme lighting conditions, the first step, which we refer to as the pivot code wp . As we
heavy makeup, and/or extravagant hair and headwear), we demonstrate in Section 4, it is crucial to use the pivot code,
propose finding its closest editable point within the genera- since using random or mean latent codes lead to unsuccess-
tor’s domain. This pivotal point can then be pulled toward ful convergence. Let xp = G(wp ; θ∗ ) be the generated im-
the target, with only minimal effect in its neighborhood, age using wp and the tuned weights θ∗ . We fine tune the
and negligible effect elsewhere. In this section, we present generator using the following loss term:
a two-step method for inverting real images to highly ed-
itable latent codes. First, we invert the given input to wp
in the native latent space of StyleGAN, W. Then, we ap- Lpt = LLPIPS (x, xp ) + λL2 LL2 (x, xp ), (2)
ply a Pivotal Tuning on this pivot code wp to tune the pre-
trained StyleGAN to produce the desired image for input where the generator is initialized with the pretrained
wp . The driving intuition here is that since wp is close weights θ. At this step, wp is constant. The pivotal tun-
enough, training the generator to produce the input im- ing can trivially be extended to N images {xi }N
i=0 , given
age from the pivot can be achieved through augmenting the N inversion latent codes {wi }N
i=0 :
appearance-related weights only, without affecting the well-
behaved structure of StyleGAN’s latent space. N
1 X
3.1. Inversion Lpt = (LLPIPS (xi , xpi ) + λL2 LL2 (xi , xpi )), (3)
N i=1
The purpose of the inversion step is to provide a conve-
nient starting point for the Pivotal Tuning one (Section 3.2). where xpi = G(wi ; θ∗ ).
As previously stated, StyleGAN’s native latent space W Once the generator is tuned, we can edit the input image
provides the best editability. Due to this and since the distor- using any choice of latent-space editing techniques, such as
tion is diminished during Pivotal Tuning, we opted to invert those proposed by Shen et al. [34] or Harkonen et al. [14].
the given input image x to this space, instead of the more Numerous results are demonstrated in Section 4.
4
Real Image
Inversion
±Age
±Smile
Rotation
Figure 3. Real Images editing example using a Multi-ID Personalized StyleGAN. All depicted images are generated by the same model,
fine-tuned on political and industrial world leaders. As can be seen, applying various edit operations on these newly introduced, highly
recognizable identities preserves them well.
3.3. Locality Regularization Finally, we minimize the distance between the image
generated by feeding wr as input using the original weights
As we demonstrate in Section 4, applying pivotal tuning xr = G(wr ; θ) and the image generated using the currently
on a latent code indeed brings the generator to reconstruct tuned ones x∗r = G(wr ; θ∗ ):
the input image in high accuracy, and even enables success-
ful edits around it. At the same time, as we demonstrate in LR = LLPIPS (xr , x∗r ) + λR ∗
L2 LL2 (xr , xr ). (5)
Section 4.3, Pivotal tuning induces a ripple effect — the vi-
sual quality of images generated by non-local latent codes This can be trivially extended to Nr random latent codes:
is compromised. This is especially true when tuning for a
multitude of identities (see Figure. 14). To alleviate this side Nr
1 X
effect, we introduce a regularization term, that is designed LR = (LLPIPS (xr,i , x∗r,i ) + λR ∗
L2 LL2 (xr,i , xr,i )).
to restrict the PTI changes to a local region in the latent Nr i=1
space. In each iteration, we sample a normally distributed (6)
random vector z and use StyleGAN’s mapping network f to
produce a corresponding latent code wz = f (z). Then, we The new optimization is defined as:
interpolate between wz and the pivotal latent code wp us-
ing the interpolation parameter α, to obtain the interpolated θ∗ = arg minLpt + λR LR , (7)
θ∗
code wr :
where λRL2 , λR , Nr are constant positive hyperparameters.
wz − wp
wr = wp + α . (4) Additional discussion regarding the effects of different α
wz − wp 2 values can be found in the Supplementary Materials.
5
Original SG2 W+ e4e SG2 Ours Measure Ours e4e SG2 SG2 W+
LPIPS ↓ 0.09 0.4 0.4 0.34
MSE ↓ 0.014 0.05 0.08 0.043
MS SSIM ↓ 0.21 0.38 0.38 0.3
ID Similarity ↑ 0.9 0.75 0.8 0.85
al. [38], which uses the W+ space but seeks to remain rela-
Figure 4. Reconstruction of out-of-domain samples. Our method tively close to W. Each baseline inverts to a different part of
(right) reconstructs out-of-domain visual details (left), such as face the latent space, demonstrating the different aspects of the
paintings or hands, significantly better than state-of-the-art meth- distortion-editability trade-off. Note that we do not include
ods (middle). Richardson et al. [32] in our comparisons, since Tov et al.
have convincingly shown editing superiority, rendering this
Original SG2 W+ e4e SG2 Ours comparison redundant.
6
Original SG2 W+ e4e SG2 Ours Original StyleClip PTI+StyleClip
perform the edit, but loses the identity, and e4e provides
a compromise between the two. In all cases, our method
preserves identity the best and displays the same editing
quality as for W-based inversions. Figure 6 presents an
editing comparison over the CelebA-HQ dataset. We
also investigate our performance using images of other
iconic characters (Figures 1 and 9) and more challenging
out-of-domain facial images (Figure 10). The ability to
perform sequential editing is presented in Figures 12,
and 13. In addition, we demonstrate our ability to invert
multiple identities using the same generator in Figures 3
and 11. For more visual and uncurated results, see the
Figure 7. StyleClip editing demonstration. Using StyleClip [27] to Supplementary Materials. As can be seen, our method
perform the ”bowl cut” and ”mohawk” edits (middle column), a successfully performs meaningful edits, while preserving
clear improvement in identity preservation can be seen when first the original identity successfully.
employing PTI (right). The recent work of StyleClip [27] demonstrates unique
edits, driven by natural language. In Figures 7, and 8 we
demonstrate editing results using this model, and demon-
on two axes: identity preservation and editing magnitude. strate substantial identity preservation improvement, thus
Qualitative evaluation. We use the popular extending StyleClip’s scope to more challenging images.
GANSpace [14] and InterfaceGAN [34] methods for We use the mapper-based variant proposed by the paper,
latent-based editing. These approaches are orthogonal to where the edits are achieved by training a mapper network
ours, as they require the use of an inversion algorithm to edit input latent codes. Note that the StyleClip model is
to edit real images. As can be expected, the W+-based trained to handle codes returned by the e4e method. Hence,
method preserves the identity rather well, but fails to to employ this model, our PTI process uses e4e-based piv-
perform significant edits, the W-based one is able to ots instead of W ones. As can be expected, we observe
7
Original SG2 W+ e4e SG2 Ours
Figure 9. Editing comparison of famous figures collected from the web. We demonstrate the following edits (top to bottom): pose, mouth
closing, and smile. Similar to Figure 6, we again witness how SG2 W+ do not induce significant edits, and the others do not preserve
identity, in contrast to our approach, which achieves both.
that the editing capabilities of the e4e codes are preserved, well. As expected, W-based inversion induces a more sig-
while the inherent distortion caused by e4e is diminished nificant edit compared to W+ inversion for the same editing
using PTI. More results for this experiment can be found in operation. As can be seen, our approach yields a magnitude
the supplementary materials. that is almost identical to W’s, surpassing e4e and W+ in-
versions, which indicates we achieve high editability (first
Quantitative evaluation results are summarized in Table 2. row).
To measure the two aforementioned axes, we compare the
effects of the same latent editing operation between the var- In addition, we report the identity preservation for sev-
ious aforementioned baselines, and the effects of editing op- eral edits. We evaluate the identity change using a pre-
erations that yield the same editing magnitude. To evaluate trained facial recognition network [9], and the edits we re-
editing magnitude, we apply a single pose editing operation port for are smile, pose, and age. We report both the mean
and measure the rotation angle using Microsoft Face API identity preservation induced by each of these edits (second
[25], as proposed by Zhu et al. [46]. As the editability row), and the one induced by performing them sequentially
increases, the magnitude of the editing effect increases as one after the other (third row). Results indeed validate that
8
Edit Magnitude ours e4e SG2 SG2 W+ titative results. We measure the reconstruction of random
Pose 14.86 14.6 15 11.15 latent codes with and without the regularization compared
to using the original pretrained generator. To demonstrate
ID similarity, same edit that our regularization does not decrease the pivotal tuning
single edits 0.9 0.79 0.82 0.85 results, we also measure the reconstruction of the target im-
sequential edits 0.82 0.73 0.78 0.81 age. As can be seen, our regularization reduces the side
effects significantly while obtaining similar reconstruction
ID similarity, same magnitude for the target image.
±5 rotation 0.84 0.77 0.79 0.82
±10 rotation 0.78 0.72 0.75 0.77 4.4. Ablation study
Table 2. Editing evaluation. Top: we compare the edit magnitude An ablation analysis is presented in Figure 16. First, we
for the same latent edit over the different baselines, as proposed by show that using a pivot latent code from W+ space rather
Zhu et al. [46]. The conjecture is that more editable codes yield of W ((B)), results in less editability, as the editing is less
more significant change from the same latent edit. Middle rows: meaningful for both smile and pose. Skipping the initial
ID preservation is measured after editing. We have used rotation, inversion step and using the mean latent code ((C)) or a
smile, and age. We report the mean ID correlation for the different random latent code ((E)), results in substantially more dis-
edits (single edits), as well as the ID preservation after applying all
tortion compared to ours. Similar results were obtained by
three edits sequentially (sequential edits). Bottom: Id preservation
when applying an edit that yields the same effect for all baselines.
optimizing the pivot latent code in addition to the gener-
The yaw angle change is measured by Microsoft Face API [25]. As ator, initialized to mean latent code ((D)) or random la-
can be seen, our editability is similar to W-based inversion, while tent code ((F )) similar to Pan et al. [26]. In addition, we
our identity preservation is better even than W+-based inversions. demonstrate that optimizing the pivot latent code is not nec-
essary even when starting from an inverted pivot code. To
do this, we start from an inverted code wp and perform PTI
our method obtains better identity similarity compared to while allowing wp to change. We then feed the resulting
the baselines. code w˜p back to the original StyleGAN. Inspecting the two
Since the lack of editability might increase identity sim- images, produced by wp and w˜p over the same generator,
ilarity, as previously mentioned, we also measure the iden- we see negligible change: 0.015 ± 5e−6 for LPIPS and
tity similarity while performing rotation of the same magni- 0.0012 ± 1e−6 for MSE. We conclude that our choice for
tude. Expectedly, the identity similarity for W+ inversion pivot code is almost optimal and hence can lighten the opti-
decreases significantly when using a fixed rotation angle mization process by keeping the code fixed.
editing, demonstrating it is less editable compared to other
inversion methods. Overall, the quantitative results demon- 4.5. Implementation details
strate the distortion-editability tradeoff, as W+ inversion
achieves better ID similarity but lower edit magnitude, and For the initial inversion step, we use the same hyperpa-
W inversion achieves inferior ID similarity but higher edit rameters as described by Karras et al. [19], except for the
magnitude. In contrast, our method preserves the identity learning rate which is changed to 5e−3 . We run the inver-
well and provides highly editable embeddings, or in other sion for 450 iterations. Then, for pivotal tuning, we further
words, we alleviate the distortion-editability trade-off. optimize for 350 iterations with a learning rate of 3e−4 us-
ing the Adam [20] optimizer. For reconstruction, we use
4.3. Regularization λL2 = 1 and λLP IP S = 1 and for the regularization we use
Our locality regularization restricts the pivotal tuning α = 30, λR L2 = 1, λR = 0.1, and Nr = 1.
side effects, causing diminishing disturbance to distant la- All quantitative experiments were performed on the first
tent codes. We evaluate this effect by sampling random la- 1000 samples from the CelebA-HQ test set.
tent codes and comparing their generated images between Our two-step inversion takes less than 3 minutes on a sin-
the original and tuned generators. Visual results, presented gle Nvidia GeForce RTX 2080. The initial W-space inver-
in Figure 14, demonstrate that the regularization signifi- sion step takes approximately one minute, just like the SG2
cantly minimizes the change. The images generated without inversion does. The pivotal tuning takes less than a minute
regularization suffer from artifacts and ID shifting, while without regularization, and less than two with it. This train-
the images generated while employing the regularization ing time grows linearly with the number of inverted iden-
are almost identical to the original ones. We perform the tities. The SG2 W+ inversion takes 4.5 minutes for 2100
regularization evaluation using a model tuned to invert 12 iterations. The inversion time of e4e is less than a second,
identities, as the side effects are more substantial in the mul- as it is encoder-based and does not require optimization at
tiple identities case. In addition, Figure 15 presents quan- inference.
9
Original SG2 W+ e4e SG2 Ours
Figure 10. Editing of smile, age, and beard removal (top to bottom) comparison over out-of-domain images collected from the web. The
collected images portray unique hairstyles, hair colors, and apparel, along with unique facial features, such as heavy make-up and scars.
Even in these challenging cases, our results retain the original identity while enabling meaningful edits.
10
Input
Inversion
+Age
±Smile
Figure 11. ”Friends” StyleGAN. We simultaneously invert multiple identities into StyleGAN latent space, while retaining high editability
and identity similarity.
Original
± Smile, Pose
Figure 12. Sequential editing. We perform pivotal tuning inversion followed by two edits sequentially: rotation and smile.
11
Original
Sequential
12
Original (A) (B) (C) (D) (E) (F )
Figure 16. Ablation study. We apply the same edits of smile (top) and pose (bottom). (A) full approach. (B) We invert to W+ space at
the first step instead of W, which yields inferior editability. (C) the pivot latent code wp is replaced with the mean latent code µw . (D)
the pivot wp is replaced with a random latent code. (E) We optimize the pivot latent code wp along with the generator, initialized to the
mean latent code µw . (F ) We optimize the pivot latent code wp along with the generator, initialized to a random latent code. As can be
seen, (C) − (F ) results in significantly more distortion.
[16] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. [25] Microsoft. Azure face, 2020.
Progressive growing of gans for improved quality, stability,
and variation. arXiv preprint arXiv:1710.10196, 2017. [26] Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin,
Chen Change Loy, and Ping Luo. Exploiting deep genera-
[17] Tero Karras, Samuli Laine, and Timo Aila. A style-based tive prior for versatile image restoration and manipulation. In
generator architecture for generative adversarial networks. In European Conference on Computer Vision, pages 262–277.
Proceedings of the IEEE conference on computer vision and Springer, 2020.
pattern recognition, pages 4401–4410, 2019.
[27] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
[18] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
and Dani Lischinski. Styleclip: Text-driven manipulation of
Jaakko Lehtinen, and Timo Aila. Training generative adver-
stylegan imagery. arXiv preprint arXiv:2103.17249, 2021.
sarial networks with limited data. In Proc. NeurIPS, 2020.
[19] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, [28] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- Jose M Álvarez. Invertible conditional gans for image edit-
ing the image quality of stylegan. In Proceedings of the ing. arXiv preprint arXiv:1611.06355, 2016.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8110–8119, 2020. [29] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco
Doretto. Adversarial latent autoencoders. In Proceedings of
[20] Diederik P. Kingma and Jimmy Ba. Adam: A method for the IEEE/CVF Conference on Computer Vision and Pattern
stochastic optimization. CoRR, abs/1412.6980, 2015. Recognition, pages 14104–14113, 2020.
[21] Zachary C Lipton and Subarna Tripathi. Precise recovery of [30] Antoine Plumerault, Hervé Le Borgne, and Céline Hude-
latent vectors from generative adversarial networks. arXiv lot. Controlling generative models with continuous factors
preprint arXiv:1702.04782, 2017. of variations. arXiv preprint arXiv:2001.10238, 2020.
[22] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Deep learning face attributes in the wild. In Proceedings of
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
International Conference on Computer Vision (ICCV), De-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
cember 2015.
ing transferable visual models from natural language super-
[23] Junyu Luo, Yong Xu, Chenwei Tang, and Jiancheng Lv. vision. arXiv preprint arXiv:2103.00020, 2021.
Learning inverse mapping by autoencoder based generative
adversarial nets. In International Conference on Neural In- [32] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
formation Processing, pages 207–216. Springer, 2017. Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding
in style: a stylegan encoder for image-to-image translation.
[24] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, arXiv preprint arXiv:2008.00951, 2020.
and Cynthia Rudin. Pulse: Self-supervised photo upsam-
pling via latent space exploration of generative models. In [33] Yujun Shen and Bolei Zhou. Closed-form factorization of
Proceedings of the IEEE/CVF Conference on Computer Vi- latent semantics in gans. arXiv preprint arXiv:2007.06600,
sion and Pattern Recognition, pages 2437–2445, 2020. 2020.
13
[34] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In-
terpreting the latent space of gans for semantic face editing.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9243–9252, 2020.
[44] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
domain gan inversion for real image editing. arXiv preprint
arXiv:2004.00049, 2020.
[46] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka.
Improved stylegan embedding: Where are the good latents?,
2020.
14
Appendix B. Visual Results
Figures 19 and 20 demonstrate the inversion of multiple
A. Locality Regularization identities. To prevent the suspicion of cherry-picking, we
provide uncurated editing comparison results of the first 18
We show the effect of different α values when using the images from CelebA-HQ test set, provided in Figures 21 to
locality regularization. Figure 17 presents quantitative eval- 23. To further avoid picking, we perform the same three
uation and Figure 18 visually demonstrates the interpolated edits recurrently. Figures 24 to 27 present further com-
code wr . We measure both the reconstruction of the tar- parisons of editing quality over real images of recognizable
get image and the reconstruction of sampled random latent characters, and Figure 28 depicts reconstruction results. Fi-
codes, denoted as in-domain, before and after the tuning. nally, additional StyleClip editing results can be found Fig-
For small α values (e.g., α = 8), the image generated by ure 29. All additional results show that our method achieves
the interpolated code wr is very similar to the image gener- higher reconstruction and editing quality, even for challeng-
ated by the pivot code wp . Therefore, the regularization is ing images. This enables us to preserve the original identity
limited and less effective. For high α values (e.g., α = 60) successfully, while still maintaining high editability.
the interpolated code image is more or less equivalent to
simply using a random latent wz . Hence, the interpolated
code is less affected by the pivotal tuning, which decreases
the regularization constraint. Extremely high α values (e.g.,
α = 120) result in extremely unrealistic images, as can be
seen in Figure 18, which cause deterioration of the target
image reconstruction. Overall, we get the most effective
regularization using an interpolated image which is highly
similar to both pivot and the random images, e.g., α = 30
in Figure 18.
15
Pivot Random α=2 α=8 α = 16 α = 30 α = 60 α = 120
Figure 18. Visual demonstration of the image generated by interpolating the pivot image and a random image for different α values. As
α increased, the image generated by the interpolated code is less similar to the pivot image and more to the random image, until reaching
extremely high α, where the generated image is no longer realistic.
Real Image
Inversion
Age
Smile
Rotation
Figure 19. FCB-StyleGAN. We invert multiple images of Barcelona Football Club players into a single StyleGAN latent space, and
demonstrate both high reconstruction and editing quality for the inverted identities.
16
Real Image
Inversion
Age
Smile
Rotation
Hair style
Figure 20. Modern-Family StyleGAN. We invert multiple images of the television show ”Modern Family” cast into a single StyleGAN
latent space, and demonstrate both high reconstruction and editing quality for the inverted identities.
17
Original SG2 W+ e4e SG2 Ours
Figure 21. Uncurated CelebA-HQ editing results. We evaluate the first identities in the test dataset using recurrent edits: smile, age, and
rotation. Here identities 0 − 5 are depicted.
18
Original SG2 W+ e4e SG2 Ours
Figure 22. uncurated CelebA-HQ editing results. We evaluate the first identities in the test dataset using recurrent edits: smile, age, and
rotation. Here identities 6-11 are depicted.
19
Original SG2 W+ e4e SG2 Ours
Figure 23. uncurated CelebA-HQ editing results. We evaluate the first identities in the test dataset using recurrent edits: smile, age, and
rotation. Here identities 12-17 are depicted.
20
Original SG2 W+ e4e SG2 Ours
21
Original SG2 W+ e4e SG2 Ours
22
Original SG2 W+ e4e SG2 Ours
23
Original SG2 W+ e4e SG2 Ours
Inversion
Old
Young
Eyes Closed
No beard
Rotation
Figure 27. Comparison of various edits applied on the same challenging image. As can be seen, our method performs meaningful editing
while surpassing other methods in preserving fine details.
24
Original SG2 W+ e4e SG2 Ours
Figure 28. Additional reconstruction results over real images. Our method preserves out-of-distribution details, such as earrings or compli-
cated make-up.
25
Original StyleClip PTI+StyleClip Original StyleClip PTI+StyleClip
Figure 29. Additional StyleClip comparison results. We perform various hair edits using StyleClip with and without pivotal tuning inver-
sion.
26