Diff Driven GAN
Diff Driven GAN
Jihyun Kim1,2 , Changjae Oh3 , Hoseok Do2 , Soohyun Kim2 , Kwanghoon Sohn1,4 *
1
Yonsei University 2 AI Lab, CTO Division, LG Electronics
3
Queen Mary University of London 4 Korea Institute of Science and Technology (KIST)
{hyunys21, khsohn}@yonsei.ac.kr [email protected] {hoseok.do, soohyun1.kim}@lge.com
arXiv:2405.04356v1 [cs.CV] 7 May 2024
Abstract Text condition “She has blond hair, straight hair, and wears heavy makeup.”
ReLU
ReLU
Conv
Conv
wavy hair, and wavy hair, and
FC
𝐡𝑡 ∙∙∙ 𝐰𝑡𝑚
mouth slightly open.” mouth slightly open.”
𝑐 𝑐
Our Model AbSMNet
𝐡0 𝐰0𝑚 𝐟𝑡 𝐅𝑡 𝐅𝑡
𝛾
Upsample
𝐰0 ∙∙∙ 𝐚𝑡 Max-pool 𝐀𝑡 Style 𝐰𝑡
Modulation
𝐟0 𝛾
𝐰0 𝐰0
𝛽 Average ഥ𝑡
𝐀 𝐅ത𝑡 Network 𝐰𝑡
𝛽
𝐚0 Average 𝐅ത𝑡
Pixel-wise multiplication
Pixel-wise addition
𝐼0′ 𝐼𝑇′ Frozen
𝐼0𝑑
Figure 2. Overview of our method. We use a diffusion-based encoder E, the middle and decoder blocks of a denoising U-Net, that extracts
the semantic features ht , intermediate features ft , and cross-attention maps at at denoising step t. We present the mapping network M
(Sec. 3.2) and the attention-based style modulation network (AbSMNet) T (Sec. 3.3) that are trained across t (Sec. 3.4). M converts ht
into the mapped latent code wtm , and T uses ft and at to control the facial attributes from the text prompt c and visual input x. The
modulation codes wtγ and wtβ are then used to scale and shift wtm to produce the final latent code, wt , that is fed to the pre-trained GAN
G. We obtain the generation output It′ from our model Y and we use the image I0d from the U-Net after the entire denoising process for
training T (Sec. 3.4). Note that only the networks with the dashed line ( ) are trainable, while others are frozen.
the intermediate features f and the cross-attention maps a spaces of E and G by using ht across the denoising steps t.
from the decoder blocks. h is then fed into the mapping net- Given ht , we design M that produces a 512-dimensional
work M, which transforms the rich semantic feature into latent code wtm ∈ RL×512 that can be mapped to the latent
a latent code wm . The Attention-based Style Modulation space of G:
Network (AbSMNet), T , takes f and a as input to gener- wtm = M(ht ). (2)
ate the modulation latent codes, wγ and wβ , that determine M is designed based on the structure of the map2style block
facial attributes related to the inputs. The latent code w is in pSp [42], as seen in Figure 2. This network consists of
then forwarded to the pre-trained GAN G that generates the convolutional layers downsampling feature maps and a fully
output image I ′ . Our model is trained across multiple de- connected layer producing the latent code wtm .
noising steps, and we use the denoising step t to indicate the
features and images obtained at each denoising step. With 3.3. Attention-based Style Modulation Network
this pipeline, we aim to estimate the latent code, wt∗ , that is By training M with learning-based GAN inversion, we can
used as input to G to render a GT image, I gt : obtain wtm and use it as input to the pre-trained GAN for
image generation. However, we observe that ht shows lim-
wt∗ = arg minL(I gt , G(wt )), (1)
wt itations in capturing fine details of the facial attributes due
where L(·, ·) measures the distance between I gt and the to its limited spatial resolution and data loss during the en-
rendered image, I ′ = G(wt ). We employ learning-based coding. Conversely, the feature maps of the DM’s decoder
GAN inversion that estimates the latent code from an en- blocks show rich semantic representations [53], benefiting
coder to reconstruct an image according to given inputs. from aggregating features from DM’s encoder blocks via
skip connections. We hence propose a novel Attention-
3.2. Mapping Network based Style Modulation Network (AbSMNet), T , that pro-
Our mapping network M aims to build a bridge between the duces style modulation latent codes, wtγ , wtβ ∈ RL×512 , by
latent space of the diffusion-based encoder E and that of the using ft and at from E. To improve reflecting the multi-
pre-trained GAN G. E uses a text prompt and a visual input, modal representations to the final latent code wt , we mod-
and these textual and image embeddings are aligned by the ulate wtm from M using wtγ and wtβ , as shown in Figure 2.
cross-attention layers [62]. The feature maps h from the We extract intermediate features, ft = {ftn }Nn=1 , from N
middle block of the denoising U-Net particularly contain different blocks, and cross-attention maps, at = {akt }K k=1 ,
rich semantics that resemble the latent space of the gen- from K different cross-attention layers of the n-th block, in
erator [28]. Here we establish the link between the latent E that is a decoder stage of denoising U-Net. The discrim-
𝑭𝑡 𝛽
Shift Net 𝛽𝑔
𝑭 1 − 𝛼𝑡
𝑡
Weighted sum
Concat
Scale Net 𝛾
𝐅𝑡 𝐅𝑡 𝛾
Multi-modal Shift Net
𝛽
𝐅𝑡 𝑙 map2style 𝐰𝑡
𝛾
inputs 1 − 𝛼𝑡
𝛽
𝛾𝑔 𝛼𝑡 𝐅
𝛽
(b) Cross-attention maps for individual denoising steps Scale Net 𝐅𝑡 𝑡
map2style 𝐰𝑡
𝛽
𝐅ത𝑡
𝑡=𝑇
𝛽𝑔 𝛽
Shift Net 𝐅𝑡 1− 𝛼𝑡
“The person has
arched eyebrows, Weighted sum
wavy hair, and
mouth slightly Figure 4. Style modulation network in T . The refined intermedi-
open.”
ate feature maps F̂t and F̄ˆ are used to capture local and global
t
t= 0 semantic representations, respectively. They are fed into the scale
𝐀0𝑡 𝐀1𝑡 𝐀2𝑡 ഥ𝑡
𝐀
and shift network, respectively. The weighted summations of these
(c) Example of an intermediate feature map
outputs are used as input to the map2style network, which finally
Output
generates the scale and shift modulation latent codes, wtγ , and wtβ .
each intermediate feature map of ft to same size intermedi- are estimated by the weighted summation:
H×W ×Cn
ate feature maps Ft = {Fnt }N n
n=1 , where Ft ∈ R γ
has H, W , and Cn as height, width and depth. F̂γt = αtγ F̂γt l + (1 − αtγ )F̂t g , (3)
β β
Moreover, at is used to amplify controlled facial at- F̂βt = αtβ F̂t g + (1 − αtβ )F̂t g ,
tributes as it incorporates semantically related information
in text and visual input. To match the dimension with Ft , where αtγ and αtβ are learnable weight parameters. Through
we convert at to At = {Ant }N n
n=1 , where At ∈ R
H×W ×Cn
, the map2style module, we then convert F̂γt and F̂βt into the
by max-pooling the output of the cross-attention layers in final scale, wtγ ∈ RL×512 , and shift, wtβ ∈ RL×512 , la-
each decoder block and upsampling the max-pooling out- tent codes. With these modulation latent codes, we achieve
puts. To capture the global representations, we addition- more precise control over facial details while corresponding
ally compute Āt ∈ RH×W ×1 by depth-wise averaging the to the input multi-modal inputs at the pixel level.
max-pooling output of at over each word in the text prompt Finally, the mapped latent code wtm from M is modu-
and upsampling it. As illustrated in Figures 3 (a) and (b), lated by wtγ and wtβ from T to get the final latent code wt
At and Āt represent the specific regions aligned with input that is used to obtain the generated image It′ as follows:
text prompt and visual input, such as semantic mask, across
denoising steps t. By a pixel-wise multiplication between wt = wtm ⊙ wtγ ⊕ wtβ , (4)
Ft and At , we can obtain the refined intermediate feature It′ = G(wt ). (5)
maps F̂t that emphasize the representations related to multi-
10132 5987 13044 9807 rebuttal
(a) “She has high cheekbones, (a) “This person has brown hair, (a) “He has brown hair, and
straight hair, black hair.” and eyeglasses.” wavy hair.”
Inputs
(b) “She has high cheekbones, (b) “This person has mustache.” (b) “He has black hair, and
straight hair, blond hair.” (c) “This person has gray hair, straight hair.”
(c) “He has blond hair, sideburns.” and eyeglasses.” (c) “He has black hair, and goatee.”
TediGAN
UaC
ControlNet
Collaborative
Ours
3.4. Loss Functions where the hyper-parameters λs(·) guide the effect of losses.
To optimize M and T , we use reconstruction loss, percep- Similar to Equation 6, we freeze M while training T .
tual loss, and identity loss for image generation, and reg- We further introduce a multi-step training strategy that
ularization loss [42] that encourages the latent codes to be considers the evolution of the feature representation in E
closer to the average latent code w̄. over the denoising steps. We observe that E tends to fo-
For training M, we use the GT image I gt as reference to cus more on text-relevant features in an early step, t = T ,
encourage the latent code wtm to generate a photo-realistic and structure-relevant features in a later step, t = 0. Fig-
image as follows: ure 3 (b) shows the attention maps Ā showing variations
across the denoising step. As the attention map, we can
LM = λm
0 ∥I
gt
− G(wtm )∥2 + (6) capture the textual and structural features by varying the de-
λm gt
1 ∥F(I ) − F(G(wt )∥2 +
m
noising steps. To effectively capture the semantic details of
λm gt m multi-modal conditions, our model is trained across multi-
2 (1 − cos(R(I ), R(G(wt ))))+
ple denoising steps.
λm
3 ∥E(zt , t, x, c) − w̄∥2 ,
“The person has “The person has “The person has “The person has
Inputs
brown hair, and black hair, and gray hair, and gray hair, and
sideburn.” wavy hair.” straight hair.” straight hair.”
IDE-3D
Ours
Input conditions Method Model Domain FID↓ LPIPS↓ SSIM↑ ID↑ ACC↑ mIoU↑
TediGAN [58] GAN 2D 54.83 0.31 0.62 0.63 81.68 40.01
IDE-3D [51] GAN 3D 39.05 0.40 0.41 0.54 47.07 10.98
UaC [35] Diffusion 2D 45.87 0.38 0.59 0.32 81.49 42.68
Text +
ControlNet [62] Diffusion 2D 46.41 0.41 0.53 0.30 82.42 42.77
semantic mask
Collaborative [19] Diffusion 2D 48.23 0.39 0.62 0.31 74.06 30.69
Ours GAN 2D 46.68 0.30 0.63 0.76 83.41 43.82
Ours GAN 3D 44.91 0.28 0.64 0.78 83.05 43.74
ControlNet [62] Diffusion 2D 93.26 0.52 0.25 0.21 - -
Text +
Ours GAN 2D 55.60 0.32 0.56 0.72 - -
scribble map
Ours GAN 3D 48.76 0.34 0.49 0.62 - -
Table 1. Quantitative results of multi-modal face image generation on CelebAMask-HQ [29] with annotated text prompts [58].
scriptions provided by [58] describing the facial attributes, Additionally, we assess the alignment accuracy between the
such as black hair, sideburns, and etc, corresponding to input semantic masks and results using mean Intersection-
the CelebAMask-HQ dataset. For the face image gener- over-Union (mIoU) and pixel accuracy (ACC) for the face
ation task using a scribble map, we obtain the scribble generation task using a semantic mask.
maps by applying PiDiNet [49, 50] to the RGB images in
CelebAMask-HQ. We additionally compute camera param- 4.2. Results
eters based on [4, 10] for 3D-aware image generation. Qualitative Evaluations. Figure 5 shows the visual com-
Comparisons. We compare our method with GAN-based parisons between ours and two existing methods for 2D face
models, such as TediGAN [58] and IDE-3D [51], and DM- image generation using a text prompt and a semantic mask
based models, such as Unite and Conquer (UaC) [35], as input. We use the same semantic mask with different
ControlNet [62], and Collaborative diffusion (Collabora- text prompts (a)-(c). TediGAN produces results consistent
tive) [19], for face generation task using a semantic mask with the text prompt as the latent codes are optimized us-
and a text prompt. IDE-3D is trained by a CLIP loss term ing the input text prompt. However, the results are incon-
like TediGAN to apply a text prompt for 3D-aware face im- sistent with the input semantic mask, as highlighted in the
age generation. ControlNet is used for face image gener- red boxes. UaC shows good facial alignment with the in-
ation using a text prompt and a scribble map. We use the put semantic mask, but the results are generated with unex-
official codes provided by the authors, and we downsample pected attributes, such as glasses, that are not indicated in
the results into 256 × 256 for comparison. the inputs. Collaborative and ControlNet produce inconsis-
Evaluation Metrics. For quantitative comparisons, we tent, blurry, and unrealistic images. Our model is capable of
evaluate the image quality and semantic consistency using preserving semantic consistency with inputs and generating
sampled 2k semantic mask- and scribble map-text prompt realistic facial images. As shown in Figure 5, our method
pairs. Frechet Inception Distance (FID) [17], LPIPS [63], preserves the structure of the semantic mask, such as the
and the Multiscale Structural Similarity (MS-SSIM) [56] hairline, face position, and mouth shape, while changing
are employed for the evaluation of visual quality and diver- the attributes through a text prompt.
sity, respectively. We also compute the ID similarity mean Figure 6 compares our method with IDE-3D [51] to val-
score (ID) [8, 57] before and after applying a text prompt. idate the performance of 3D-aware face image generation
Input text: Input text:
1. “This young woman has straight hair, and eyeglasses and wears lipstick.” 1. “This man has gray hair.”
2. “The
urns, and bags under eyesman.” has mustache, receding hairline, big nose, goatee, sideburns, 2. “He has double chin, sideburns, and bags under eyes.”
bushyblack
and has arched eyebrows, eyebrows,
hair.”and high cheekbones.”
3. “She has big lips, pointy nose, receding hairline, and arched eyebrows.”
3. “She wears heavy makeup and has arched eyebrows, black hair.”
4. “This man has mouth slightly open, and arched eyebrows. He is smiling.”
Input View Novel Views
1.
1.
2.
2.
3. 3.
𝑑 𝑔𝑡
𝑔𝑡
(a) Inputs (b) w/ 𝐼𝑡=0 (c) w/ 𝐼 (d) Ours
w/ 𝐼 (d) Ours
1.
1.
2.
2.
d) Ours
𝑔𝑡
(a) Inputs (b) w/ 𝐼0𝑑 (c) w/ 𝐼 (d) Ours (a) Inputs 𝑑
(b) w/ 𝐼𝑡=0 (c) w/ 𝐼
𝑔𝑡
(d) Ours
d 3.
Figure 9. Effect of using I from the denoising U-Net and the GT
image I gt in model training. Using text prompts (1, 2) with (a)
the semantic mask, we show face images using our model trained
and with (b) I0d , (c) I gt , and (d) both. Figure 10. Visual examples of 3D face style transfer. Our method
generates stylized multi-view images by mapping the latent fea-
We also show the advantages of using cross-attention maps tures of DM and GAN.
in our model. The quantitative and qualitative results are
presented in Table 2 and Figure 8, respectively. When using ble 2 (c) and (d). Using I gt also preserves more condition-
only M, we can generate face images that roughly preserve irrelevant features inferred by the ID scores in Table 2 (c)
the structures of a given semantic mask in Figure 8 (a), in- and (d). In particular, our method combines the strengths of
cluding the outline of the facial components (e.g. face, eye) two models as shown in Figure 9 (d) and Table 2 (e).
in Figure 8 (b). On the other hand, T enables the model to
express face attribute details effectively, such as hair colors 4.4. Limitations and Future Works
urs and mouth open, based on the multi-modal inputs in Fig- Our method can be extended to multi-modal face style
ure 8 (c). The FID and ACC scores are higher than the transfer (e.g. face → Greek statue) by mapping the latent
model using only M in Table 2 (b). We further present spaces of DM and GAN without CLIP losses and additional
the impact of adopting cross-attention maps to T for style dataset, as shown in Figure 10. For the 3D-aware face style
modulation. Figure 8 (d) shows how the attention-based transfer task, we train our model using I0d that replaces GT
modulation approach enhances the quality of results, par- image I gt in our loss terms. This method, however, is lim-
ticularly in terms of the sharpness of desired face attributes ited as it cannot transfer extremely distinct style attributes
and the overall consistency between the generated image from the artistic domain to the photo-realistic domain of
and multi-modal conditions. Table 2 (e) demonstrates the GAN. To better transfer the facial style in the 3D domain,
effectiveness of our method by showing improvements in we will investigate methods to map the diffusion features
FID, LPIPS, ID, and ACC. Our method, including both M related to the input pose into the latent space of GAN in
and T with cross-attention maps, significantly improves the future works.
FID showing our model’s ability to generate high-fidelity
images. From the improvement of the ID score, the cross- 5. Conclusion
attention maps enable relevantly applying the details of in- We presented the diffusion-driven GAN inversion method
put conditions to facial components. that translates multi-modal inputs into photo-realistic face
Model Training. We analyze the effect of loss terms LM images in 2D and 3D domains. Our method interprets the
and LT by comparing the performance with the model pre-trained GAN’s latent space and maps the diffusion fea-
trained using either I0d from the denoising U-Net or GT im- tures into this latent space, which enables the model to eas-
age I gt . The model trained using I0d produces the images in ily adopt multi-modal inputs, such as a visual input and a
Figure 9 (b), which more closely reflected the multi-modal text prompt, for face image generation. We also proposed to
conditions (a), such as “goatee” and “hair contour”. In Ta- train our model across the multiple denoising steps, which
ble 2 (c), the ACC score of this model is higher than the further improves the output quality and consistency with
model trained only using I gt in Table 2 (d). The images the multiple inputs. We demonstrated the capability of our
generated by the model trained with I gt in Figure 9 (c) method by using text prompts with semantic masks or scrib-
are more perceptually realistic, as evidenced by the lower ble maps as input for 2D or 3D-aware face image generation
LPIPS score compared to the model trained with I0d in Ta- and style transfer.
References [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[1] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: two time-scale update rule converge to a local nash equilib-
A residual-based stylegan encoder via iterative refinement. rium. NeurIPS, 2017. 6
In ICCV, 2021. 2
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended sion probabilistic models. NeurIPS, 2020. 1
diffusion for text-driven editing of natural images. In CVPR,
[19] Ziqi Huang, Kelvin CK Chan, Yuming Jiang, and Ziwei Liu.
2022. 2
Collaborative diffusion for multi-modal face generation and
[3] Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P
editing. In CVPR, 2023. 2, 6
Breckon, and Chris G Willcocks. Unleashing transform-
[20] Jaeseok Jeong, Mingi Kwon, and Youngjung Uh. Training-
ers: Parallel token prediction with discrete absorbing diffu-
free style transfer emerges from h-space in diffusion models.
sion for fast high-resolution image generation from vector-
In WACV, 2024. 2
quantized codes. In ECCV, 2022. 2
[21] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[4] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano,
generator architecture for generative adversarial networks. In
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
CVPR, 2019. 1, 2
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Effi-
cient geometry-aware 3d generative adversarial networks. In [22] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
CVPR, 2022. 2, 5, 6 Jaakko Lehtinen, and Timo Aila. Analyzing and improving
[5] Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and the image quality of stylegan. In CVPR, 2020. 1, 2, 5
Hongbo Fu. Deepfacedrawing: Deep generation of face im- [23] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
ages from sketches. TOG, 39(4):72–1, 2020. 2 Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Text-based real image editing with diffusion models. In
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- CVPR, 2023. 1
tive adversarial networks for multi-domain image-to-image [24] Gwanghyun Kim and Se Young Chun. Datid-3d: Diversity-
translation. In CVPR, 2018. 1 preserved domain adaptation using text-to-image diffusion
[7] Min Jin Chong and David Forsyth. Jojogan: One shot face for 3d generative model. In CVPR, 2023. 2
stylization. In ECCV, 2022. 1, 2 [25] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif-
[8] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos fusionclip: Text-guided diffusion models for robust image
Zafeiriou. Arcface: Additive angular margin loss for deep manipulation. In CVPR, 2022. 2
face recognition. In CVPR, 2019. 5, 6 [26] Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, and
[9] Kangle Deng, Gengshan Yang, Deva Ramanan, and Jun-Yan Jun-Yan Zhu. Dense text-to-image generation with attention
Zhu. 3d-aware conditional image synthesis. In CVPR, 2023. modulation. In ICCV, 2023. 2
2 [27] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwangrok Ryoo,
[10] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde and Seungryong Kim. 3d gan inversion with pose optimiza-
Jia, and Xin Tong. Accurate 3d face reconstruction with tion. In WACV, 2023. 2
weakly-supervised learning: From single image to image set. [28] Gihyun Kwon and Jong Chul Ye. Diffusion-based image
In CVPR Workshops, 2019. 6 translation using disentangled style and content representa-
[11] Prafulla Dhariwal and Alexander Nichol. Diffusion models tion. arXiv preprint arXiv:2209.15264, 2022. 2, 3
beat gans on image synthesis. NeurIPS, 2021. 1 [29] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo.
[12] Hoseok Do, EunKyung Yoo, Taehyeong Kim, Chul Lee, and Maskgan: Towards diverse and interactive facial image ma-
Jin Young Choi. Quantitative manipulation of custom at- nipulation. In CVPR, 2020. 5, 6
tributes on 3d-aware image synthesis. In CVPR, 2023. 1 [30] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding,
[13] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Wangmeng Zuo, and Shilei Wen. Stgan: A unified selec-
transformers for high-resolution image synthesis. In CVPR, tive transfer network for arbitrary image attribute editing. In
2021. 2 CVPR, 2019. 1
[14] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, [31] Zhian Liu, Maomao Li, Yong Zhang, Cairong Wang, Qi
Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- Zhang, Jue Wang, and Yongwei Nie. Fine-grained face swap-
guided domain adaptation of image generators. TOG, 41(4): ping via regional gan inversion. In CVPR, 2023. 2
1–13, 2022. 2 [32] Shitong Luo and Wei Hu. Diffusion probabilistic models for
[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 3d point cloud generation. In CVPR, 2021. 2
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [33] Tianxiang Ma, Kang Zhao, Jianxin Sun, Jing Dong, and Tie-
Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014. niu Tan. Free-style and fast 3d portrait synthesis. arXiv
1 preprint arXiv:2306.15419, 2023. 2
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, [34] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and
age editing with cross attention control. arXiv preprint editing with stochastic differential equations. arXiv preprint
arXiv:2208.01626, 2022. 2 arXiv:2108.01073, 2021. 2
[35] Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Ban- [51] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue
dara, and Vishal M Patel. Unite and conquer: Plug & play Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit-
multi-modal synthesis using diffusion models. In CVPR, ing for high-resolution 3d-aware portrait synthesis. TOG, 41:
2023. 2, 6 1–10, 2022. 1, 2, 6
[36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [52] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Daniel Cohen-Or. Designing an encoder for stylegan image
Mark Chen. Glide: Towards photorealistic image generation manipulation. TOG, 40(4):1–14, 2021. 2
and editing with text-guided diffusion models. arXiv preprint [53] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali
arXiv:2112.10741, 2021. 1, 2 Dekel. Plug-and-play diffusion features for text-driven
[37] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie image-to-image translation. In CVPR, 2023. 1, 3
Liu, Abhimitra Meka, and Christian Theobalt. Drag your [54] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
gan: Interactive point-based manipulation on the generative Sketch-guided text-to-image diffusion models. In SIG-
image manifold. In SIGGRAPH 2023, 2023. 1 GRAPH, 2023. 1
[38] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan [55] Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong
Zhu. Semantic image synthesis with spatially-adaptive nor- Chen, Dong Chen, Lu Yuan, and Houqiang Li. Seman-
malization. In CVPR, 2019. 2 tic image synthesis via diffusion models. arXiv preprint
[39] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, arXiv:2207.00050, 2022. 2
and Dani Lischinski. Styleclip: Text-driven manipulation of [56] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul-
stylegan imagery. In ICCV, 2021. 1, 2 tiscale structural similarity for image quality assessment. In
[40] Malsha V Perera and Vishal M Patel. Analyzing bias The Thirty-Seventh Asilomar Conference on Signals, Systems
in diffusion-based face generation models. arXiv preprint & Computers, 2003. 6
arXiv:2305.06402, 2023. 1 [57] Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhen-
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya tao Tan, Lu Yuan, Weiming Zhang, and Nenghai Yu. Hair-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, clip: Design your hair by text and reference image. In CVPR,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- 2022. 2, 6
ing transferable visual models from natural language super- [58] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu.
vision. In ICML. PMLR, 2021. 1, 2 Tedigan: Text-guided diverse face image generation and ma-
[42] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, nipulation. In CVPR, 2021. 1, 2, 6, 7
Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding [59] Xin Yang, Xiaogang Xu, and Yingcong Chen. Out-of-
in style: a stylegan encoder for image-to-image translation. domain gan inversion via invertibility decomposition for
In CVPR, 2021. 1, 3, 5 photo-realistic human face manipulation. In ICCV, 2023. 2
[43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [60] Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li,
Patrick Esser, and Björn Ommer. High-resolution image syn- Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz
thesis with latent diffusion models. In CVPR, 2022. 2 Oztireli, et al. 3d gan inversion with facial symmetry prior.
[44] Rohit Saha, Brendan Duke, Florian Shkurti, Graham W In CVPR, 2023. 2
Taylor, and Parham Aarabi. Loho: Latent optimization of [61] Yu Yin, Kamran Ghasedi, HsiangTao Wu, Jiaolong Yang,
hairstyles via orthogonalization. In CVPR, 2021. 2 Xin Tong, and Yun Fu. Nerfinvertor: High fidelity nerf-gan
[45] Chitwan Saharia, William Chan, Saurabh Saxena, Lala inversion for single-shot real image animation. In CVPR,
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, 2023. 2
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, [62] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
et al. Photorealistic text-to-image diffusion models with deep conditional control to text-to-image diffusion models. In
language understanding. NeurIPS, 2022. 1, 2 ICCV, 2023. 2, 3, 5, 6
[46] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Inter- [63] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
preting the latent space of gans for semantic face editing. In and Oliver Wang. The unreasonable effectiveness of deep
CVPR, 2020. 1 features as a perceptual metric. In CVPR, 2018. 5, 6
[47] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, [64] Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang,
Jiajun Wu, and Gordon Wetzstein. 3d neural field generation Chongyang Ma, Weiming Dong, and Changsheng Xu.
using triplane diffusion. In CVPR, 2023. 2 Inversion-based style transfer with diffusion models. In
[48] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- CVPR, 2023. 1
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based [65] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang
generative modeling through stochastic differential equa- Wang. Talking face generation by adversarially disentangled
tions. In ICLR, 2021. 1 audio-visual representation. In AAAI, 2019. 2
[49] Zhuo Su, Matti Pietikäinen, and Li Liu. Bird: Learning bi- [66] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
nary and illumination robust descriptor for face recognition. domain gan inversion for real image editing. In ECCV, 2020.
In BMVC, 2019. 6 2
[50] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, [67] Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka.
Qi Tian, Matti Pietikainen, and Li Liu. Pixel difference net- Barbershop: Gan-based image compositing using segmenta-
works for efficient edge detection. In ICCV, 2021. 6 tion masks. arXiv preprint arXiv:2106.01505, 2021. 2