Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
28 views10 pages

Diff Driven GAN

The document presents a novel method for multi-modal face image generation that integrates text prompts and visual inputs using a combination of Generative Adversarial Networks (GANs) and diffusion models (DMs). It introduces a mapping network and a style modulation network to effectively link the latent spaces of pre-trained GANs and DMs, enabling the generation of realistic 2D and 3D facial images. The proposed approach demonstrates improved performance over existing methods by capturing detailed textual and structural representations through a multi-denoising step training strategy.

Uploaded by

hoàng anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Diff Driven GAN

The document presents a novel method for multi-modal face image generation that integrates text prompts and visual inputs using a combination of Generative Adversarial Networks (GANs) and diffusion models (DMs). It introduces a mapping network and a style modulation network to effectively link the latent spaces of pre-trained GANs and DMs, enabling the generation of realistic 2D and 3D facial images. The proposed approach demonstrates improved performance over existing methods by capturing detailed textual and structural representations through a multi-denoising step training strategy.

Uploaded by

hoàng anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

rebuttal

Visual input (a) Oil painting (b) Watercolor

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Jihyun Kim1,2 , Changjae Oh3 , Hoseok Do2 , Soohyun Kim2 , Kwanghoon Sohn1,4 *
1
Yonsei University 2 AI Lab, CTO Division, LG Electronics
3
Queen Mary University of London 4 Korea Institute of Science and Technology (KIST)
{hyunys21, khsohn}@yonsei.ac.kr [email protected] {hoseok.do, soohyun1.kim}@lge.com
arXiv:2405.04356v1 [cs.CV] 7 May 2024

Abstract Text condition “She has blond hair, straight hair, and wears heavy makeup.”

We present a new multi-modal face image generation


method that converts a text prompt and a visual input, such Diffusion Ours GAN

as a semantic mask or scribble map, into a photo-realistic Visual condition


face image. To do this, we combine the strengths of Gen- Overview of our method
erative Adversarial networks (GANs) and diffusion models “The woman has bangs, brown hair. She is smiling.”
(DMs) by employing the multi-modal features in the DM
into the latent space of the pre-trained GANs. We present a
simple mapping and a style modulation network to link two
models and convert meaningful representations in feature 2D face image generation
maps and attention maps into latent codes. With GAN inver- “The chubby man has receding hairline, eyeglasses, gray hair, and double chin.”
sion, the estimated latent codes can be used to generate 2D
or 3D-aware facial images. We further present a multi-step
training strategy that reflects textual and structural repre-
sentations into the generated image. Our proposed network 3D-aware face image generation
produces realistic 2D, multi-view, and stylized face images, “Greek statue” “silver hair Elf” “Cartoon style” “Watercolor painting”

which align well with inputs. We validate our method


by using pre-trained 2D and 3D GANs, and our results
outperform existing methods. Our project page is avail-
able at https://github.com/1211sh/Diffusion- Face style transfer
driven_GAN-Inversion/.
Figure 1. We present a method to map the diffusion features to the
latent space of a pre-trained GAN, which enables diverse tasks in
1. Introduction multi-modal face image generation and style transfer. Our method
In recent years, multi-modal image generation has achieved can be applied to 2D and 3D-aware face image generation.
remarkable success, driven by the advancements in Genera-
tive Adversarial Networks (GANs) [15] and diffusion mod- inversion methods [51, 58] have poor alignment with inputs
els (DMs) [11, 18, 48]. Facial image processing has become as they neglect the correlation between multi-modal inputs.
a popular application for a variety of tasks, including face They struggle to map the different modalities into the latent
image generation [21, 39], face editing [6, 12, 30, 36, 37, space of the pre-trained GAN, such as by mixing the latent
46], and style transfer [7, 64]. Many tasks typically utilize codes or optimizing the latent code converted from a given
the pre-trained StyleGAN [21, 22], which can generate real- image according to the input text.
istic facial images and edit facial attributes by manipulating Recently, DMs have increased attention in multi-modal
the latent space using GAN inversion [39, 42, 58]. In these image generation thanks to the stability of training and
tasks, using multiple modalities as conditions is becoming a the flexibility of using multiple modalities as conditions.
popular approach, which improves the user’s controllability DMs [23, 53, 54] can control the multiple modalities and
in generating realistic face images. However, existing GAN render diverse images by manipulating the latent or atten-
tion features across the time steps. However, existing text-
* Corresponding author
This research was supported by the National Research Founda-
to-image DMs rely on an autoencoder and text encoder,
tion of Korea (NRF) grant funded by the Korea government (MSIP) such as CLIP [41], trained on unstructured datasets col-
(NRF2021R1A2C2006703). lected from the web [40, 45] that may lead to unrealistic
image generation. features in the denoising U-Net, which can generate images
Moreover, some approaches address multi-modal face with controlled facial attributes.
image generation in a 3D domain. In GAN inversion [14,
51], multi-view images can be easily acquired by manip- 2.2. Diffusion Model for Image Generation
ulating the latent code with pre-trained 3D GANs. While Many studies have introduced text-to-image diffusion mod-
DMs are inefficient in learning 3D representation, which els [36, 43, 45] that generate images by encoding multi-
has the challenge to generate multi-view images directly modal inputs, such as text and image, into latent features
due to the lack of 3D ground-truth (GT) data for train- via foundation models [41] and mapping them to the fea-
ing [32, 47]. They can be used as a tool to acquire training tures of denoising U-Net via an attention mechanism. Con-
datasets for 3D-aware image generation [24, 33]. trolNet [62] performs image generation by incorporating
In this paper, we present a versatile face generative various visual conditions (e.g., semantic mask, scribbles,
model that uses text and visual inputs. We propose an ap- edges) and text prompts. Image editing models using
proach that takes the strengths of DMs and GAN and gener- DMs [16, 20, 26, 28, 34] have exhibited excellent perfor-
ates photo-realistic images with flexible control over facial mance by controlling the latent features or the attention
attributes, which can be adapted to 2D and 3D domains, as maps of a denoising U-Net. Moreover, DMs can gener-
illustrated in Figure 1. Our method employs a latent map- ate and edit images by adjusting latent features over mul-
ping strategy that maps the diffusion features into the la- tiple denoising steps [2]. We focus on using latent features
tent space of a pre-trained GAN using multi-denoising step of DM, including intermediate features and cross-attention
learning, producing the latent code that encodes the details maps, across denoising steps to link them with the latent
of text prompts and visual inputs. space of GAN and develop a multi-modal face image gen-
In summary, our main contributions are: eration task.
(i) We present a novel method to link a pre-trained GAN
(StyleGAN [22], EG3D [4]) and DM (ControlNet [62]) 2.3. Multi-Modal Face Image Generation
for multi-modal face image generation. Face generative models have progressed by incorporating
(ii) We propose a simple mapping network that links pre- various modalities, such as text [25], semantic mask [38,
trained GAN and DM’s latent spaces and an attention- 55], sketch [5, 9], and audio [65]. Several methods adopt
based style modulation network that enables the use of StyleGAN, which can generate high-quality face images
meaningful features related to multi-modal inputs. and edit facial attributes to control the style vectors. The
(iii) We present a multi-denoising step training strategy that transformer-based models [3, 13] are also utilized, which
enhances the model’s ability to capture the textual and improves the performance of face image generation by han-
structural details of multi-modal inputs. dling the correlation between multi-modal conditions us-
(iv) Our model can be applied for both 2D- and 3D-aware face ing image quantization. A primary challenge faced in face
image generation without additional data or loss terms generative models is to modify the facial attributes based
and outperforms existing DM- and GAN-based methods. on given conditions while minimizing changes to other
attributes. Some methods [39, 57] edit facial attributes
2. Related Work by manipulating the latent codes in GAN models. Tedi-
GAN [58] controls multiple conditions by leveraging an
2.1. GAN Inversion encoder to convert an input image into latent codes and
GAN inversion approaches have gained significant popu- optimizing them with a pre-trained CLIP model. Recent
larity in the face image generation task [7, 31, 51, 59] works [19, 35] use DMs to exploit the flexibility of taking
using the pre-trained 2D GAN, such as StyleGAN [21, multiple modalities as conditions and generate facial im-
22]. This method has been extended to 3D-aware im- ages directly from DMs. Unlike existing methods, we use
age generation [27, 60, 61] by integrating 3D GANs, such the pre-trained DM [62] as an encoder to further produce
as EG3D [4]. GAN inversion can be categorized into the latent codes for the pre-trained GAN models.
learning-based, optimization-based, and hybrid methods.
Optimization-based methods [44, 67] estimate the latent 3. Method
code by minimizing the difference between an output and
an input image. Learning-based methods [1, 52] train an en- 3.1. Overview
coder that maps an input image into the latent space of the Figure 2 illustrates the overall pipeline of our approach.
pre-trained GAN. Hybrid methods [58, 66] combine these During the reverse diffusion process, we use the middle and
two methods, producing an initial latent code and then re- decoder blocks of a denoising U-Net in ControlNet [62] as
fining it with additional optimizations. Our work employs an encoder E. A text prompt c, along with a visual condition
a learning-based GAN inversion, where a DM serves as the x, are taken as input to the denoising U-Net. Subsequently,
encoder. We produce latent codes by leveraging semantic E produces the feature maps h from the middle block, and
Reverse Process of Diffusion
𝑡=0 𝑡=𝑇 Mapping Network
“This person has “This person has
arched eyebrows, arched eyebrows,

ReLU

ReLU
Conv
Conv
wavy hair, and wavy hair, and

FC
𝐡𝑡 ∙∙∙ 𝐰𝑡𝑚
mouth slightly open.” mouth slightly open.”
𝑐 𝑐
Our Model AbSMNet

𝐡0 𝐰0𝑚 𝐟𝑡 𝐅𝑡 𝐅෠𝑡
𝛾

Upsample
𝐰0 ∙∙∙ 𝐚𝑡 Max-pool 𝐀𝑡 Style 𝐰𝑡
Modulation
𝐟0 𝛾
𝐰0 𝐰0
𝛽 Average ഥ𝑡
𝐀 𝐅ത෠𝑡 Network 𝐰𝑡
𝛽

𝐚0 Average 𝐅ത𝑡

Pixel-wise multiplication
Pixel-wise addition
𝐼0′ 𝐼𝑇′ Frozen
𝐼0𝑑

Figure 2. Overview of our method. We use a diffusion-based encoder E, the middle and decoder blocks of a denoising U-Net, that extracts
the semantic features ht , intermediate features ft , and cross-attention maps at at denoising step t. We present the mapping network M
(Sec. 3.2) and the attention-based style modulation network (AbSMNet) T (Sec. 3.3) that are trained across t (Sec. 3.4). M converts ht
into the mapped latent code wtm , and T uses ft and at to control the facial attributes from the text prompt c and visual input x. The
modulation codes wtγ and wtβ are then used to scale and shift wtm to produce the final latent code, wt , that is fed to the pre-trained GAN
G. We obtain the generation output It′ from our model Y and we use the image I0d from the U-Net after the entire denoising process for
training T (Sec. 3.4). Note that only the networks with the dashed line ( ) are trainable, while others are frozen.

the intermediate features f and the cross-attention maps a spaces of E and G by using ht across the denoising steps t.
from the decoder blocks. h is then fed into the mapping net- Given ht , we design M that produces a 512-dimensional
work M, which transforms the rich semantic feature into latent code wtm ∈ RL×512 that can be mapped to the latent
a latent code wm . The Attention-based Style Modulation space of G:
Network (AbSMNet), T , takes f and a as input to gener- wtm = M(ht ). (2)
ate the modulation latent codes, wγ and wβ , that determine M is designed based on the structure of the map2style block
facial attributes related to the inputs. The latent code w is in pSp [42], as seen in Figure 2. This network consists of
then forwarded to the pre-trained GAN G that generates the convolutional layers downsampling feature maps and a fully
output image I ′ . Our model is trained across multiple de- connected layer producing the latent code wtm .
noising steps, and we use the denoising step t to indicate the
features and images obtained at each denoising step. With 3.3. Attention-based Style Modulation Network
this pipeline, we aim to estimate the latent code, wt∗ , that is By training M with learning-based GAN inversion, we can
used as input to G to render a GT image, I gt : obtain wtm and use it as input to the pre-trained GAN for
image generation. However, we observe that ht shows lim-
wt∗ = arg minL(I gt , G(wt )), (1)
wt itations in capturing fine details of the facial attributes due
where L(·, ·) measures the distance between I gt and the to its limited spatial resolution and data loss during the en-
rendered image, I ′ = G(wt ). We employ learning-based coding. Conversely, the feature maps of the DM’s decoder
GAN inversion that estimates the latent code from an en- blocks show rich semantic representations [53], benefiting
coder to reconstruct an image according to given inputs. from aggregating features from DM’s encoder blocks via
skip connections. We hence propose a novel Attention-
3.2. Mapping Network based Style Modulation Network (AbSMNet), T , that pro-
Our mapping network M aims to build a bridge between the duces style modulation latent codes, wtγ , wtβ ∈ RL×512 , by
latent space of the diffusion-based encoder E and that of the using ft and at from E. To improve reflecting the multi-
pre-trained GAN G. E uses a text prompt and a visual input, modal representations to the final latent code wt , we mod-
and these textual and image embeddings are aligned by the ulate wtm from M using wtγ and wtβ , as shown in Figure 2.
cross-attention layers [62]. The feature maps h from the We extract intermediate features, ft = {ftn }Nn=1 , from N
middle block of the denoising U-Net particularly contain different blocks, and cross-attention maps, at = {akt }K k=1 ,
rich semantics that resemble the latent space of the gen- from K different cross-attention layers of the n-th block, in
erator [28]. Here we establish the link between the latent E that is a decoder stage of denoising U-Net. The discrim-
𝑭𝑡 𝛽
Shift Net ෡ 𝛽𝑔
𝑭 1 − 𝛼𝑡
𝑡

Weighted sum

(a) Cross-attention maps averaging for all denoising steps


𝛾 𝛾
𝐅෠𝑡 𝑙 𝛼𝑡

Concat
Scale Net 𝛾
𝐅෠𝑡 𝐅෠𝑡 𝛾
Multi-modal Shift Net
𝛽
𝐅෠𝑡 𝑙 map2style 𝐰𝑡
𝛾
inputs 1 − 𝛼𝑡
𝛽
𝛾𝑔 𝛼𝑡 𝐅෠
𝛽
(b) Cross-attention maps for individual denoising steps Scale Net 𝐅෠𝑡 𝑡
map2style 𝐰𝑡
𝛽
𝐅ത෠𝑡
𝑡=𝑇
𝛽𝑔 𝛽
Shift Net 𝐅෠𝑡 1− 𝛼𝑡
“The person has
arched eyebrows, Weighted sum
wavy hair, and
mouth slightly Figure 4. Style modulation network in T . The refined intermedi-
open.”
ate feature maps F̂t and F̄ˆ are used to capture local and global
t
t= 0 semantic representations, respectively. They are fed into the scale
𝐀0𝑡 𝐀1𝑡 𝐀2𝑡 ഥ𝑡
𝐀
and shift network, respectively. The weighted summations of these
(c) Example of an intermediate feature map
outputs are used as input to the map2style network, which finally
Output
generates the scale and shift modulation latent codes, wtγ , and wtβ .

modal inputs as shown in Figure 3 (c). The improved aver-


𝐅𝑇1 𝐀1𝑇 𝐅෠𝑇1
age feature map F̄ ˆ ∈ RH×W ×1 is also obtained by mul-
t
Figure 3. Visualization of cross-attention maps and intermediate tiplying Āt with F̄t , where F̄t ∈ RH×W ×1 is obtained by
feature maps. (a) represents the semantic relation information be- first averaging the feature maps in Ft = {Fnt }N n=1 and then
tween an input text and an input semantic mask in the spatial do- ˆ
depth-wise averaging the outputs. F̂t and F̄t distinguish
main. The meaningful representations of inputs are shown across text- and structural-relevant semantic features, which im-
all denoising steps and N different blocks. (b) represents N differ- ˆ
proves the alignment with the inputs. We use F̂t and F̄ t
ent cross-attention maps, At , at denoising steps t = T and t = 0.
as input to the style modulation network that produces the
(c) shows the example of refined intermediate feature map F̂1T at
1st block and t = T that is emphasized corresponding to input modulation codes wtγ , and wtβ as shown in Figure 4. We
multi-modal conditions. The red and yellow regions of the map capture both local and global features by using F̂t , which
indicate higher attention scores. As the denoising step approaches consists of feature maps representing different local regions
T , the text-relevant features appear more clearly, and as the de- on the face, and F̄ ˆ , which implies representations of the
t
noising step t approaches 0, the features of the visual input are entire face. We concatenate N intermediate feature maps
more preserved. of F̂t , concat(F̂1t · · · F̂N
t ), and it is forward to the scale
and shift networks that consist of convolutional layers and
inative representations are represented more faithfully be- Leaky ReLU, forming the local modulation feature maps,
cause ft consists of N multi-scale feature maps that can cap- F̂γt l and F̂βt l . We also estimate global modulation feature
ture different sizes of facial attributes, which allows for finer
γ β
maps, F̂t g and F̂t g , by feeding F̄ ˆ to the scale and shift
t
control over face attributes. For simplicity, we upsample network. The final scale, F̂t , and shift, F̂βt , feature maps
γ

each intermediate feature map of ft to same size intermedi- are estimated by the weighted summation:
H×W ×Cn
ate feature maps Ft = {Fnt }N n
n=1 , where Ft ∈ R γ
has H, W , and Cn as height, width and depth. F̂γt = αtγ F̂γt l + (1 − αtγ )F̂t g , (3)
β β
Moreover, at is used to amplify controlled facial at- F̂βt = αtβ F̂t g + (1 − αtβ )F̂t g ,
tributes as it incorporates semantically related information
in text and visual input. To match the dimension with Ft , where αtγ and αtβ are learnable weight parameters. Through
we convert at to At = {Ant }N n
n=1 , where At ∈ R
H×W ×Cn
, the map2style module, we then convert F̂γt and F̂βt into the
by max-pooling the output of the cross-attention layers in final scale, wtγ ∈ RL×512 , and shift, wtβ ∈ RL×512 , la-
each decoder block and upsampling the max-pooling out- tent codes. With these modulation latent codes, we achieve
puts. To capture the global representations, we addition- more precise control over facial details while corresponding
ally compute Āt ∈ RH×W ×1 by depth-wise averaging the to the input multi-modal inputs at the pixel level.
max-pooling output of at over each word in the text prompt Finally, the mapped latent code wtm from M is modu-
and upsampling it. As illustrated in Figures 3 (a) and (b), lated by wtγ and wtβ from T to get the final latent code wt
At and Āt represent the specific regions aligned with input that is used to obtain the generated image It′ as follows:
text prompt and visual input, such as semantic mask, across
denoising steps t. By a pixel-wise multiplication between wt = wtm ⊙ wtγ ⊕ wtβ , (4)
Ft and At , we can obtain the refined intermediate feature It′ = G(wt ). (5)
maps F̂t that emphasize the representations related to multi-
10132 5987 13044 9807 rebuttal

(a) “She has high cheekbones, (a) “This person has brown hair, (a) “He has brown hair, and
straight hair, black hair.” and eyeglasses.” wavy hair.”
Inputs

(b) “She has high cheekbones, (b) “This person has mustache.” (b) “He has black hair, and
straight hair, blond hair.” (c) “This person has gray hair, straight hair.”
(c) “He has blond hair, sideburns.” and eyeglasses.” (c) “He has black hair, and goatee.”
TediGAN
UaC
ControlNet
Collaborative
Ours

(a) (b) (c) (a) (b) (c) (a) (b) (c)


Figure 5. Visual examples of the 2D face image generation using a text prompt and a semantic mask. For each semantic mask, we use
three different text prompts (a)-(c), resulting in different output images (a)-(c).

3.4. Loss Functions where the hyper-parameters λs(·) guide the effect of losses.
To optimize M and T , we use reconstruction loss, percep- Similar to Equation 6, we freeze M while training T .
tual loss, and identity loss for image generation, and reg- We further introduce a multi-step training strategy that
ularization loss [42] that encourages the latent codes to be considers the evolution of the feature representation in E
closer to the average latent code w̄. over the denoising steps. We observe that E tends to fo-
For training M, we use the GT image I gt as reference to cus more on text-relevant features in an early step, t = T ,
encourage the latent code wtm to generate a photo-realistic and structure-relevant features in a later step, t = 0. Fig-
image as follows: ure 3 (b) shows the attention maps Ā showing variations
across the denoising step. As the attention map, we can
LM = λm
0 ∥I
gt
− G(wtm )∥2 + (6) capture the textual and structural features by varying the de-
λm gt
1 ∥F(I ) − F(G(wt )∥2 +
m
noising steps. To effectively capture the semantic details of
λm gt m multi-modal conditions, our model is trained across multi-
2 (1 − cos(R(I ), R(G(wt ))))+
ple denoising steps.
λm
3 ∥E(zt , t, x, c) − w̄∥2 ,

where R(·) is pre-trained ArcFace network [8], F(·) is the 4. Experiments


feature extraction network [63], zt is noisy image, and the
hyper-parameters λm
4.1. Experimental Setup
(·) guide the effect of losses. Note that
we freeze T while training M. We use ControlNet [62] as the diffusion-based encoder that
For training T , we use I0d produced by the encoder E receives multi-modal conditions, including text and visual
into the reconstruction and perceptual losses. With these conditions such as a semantic mask and scribble map. The
losses, the loss LT encourages the network to control facial StyleGAN [22] and EG3D [4] are exploited as pre-trained
attributes while preserving the identity of I gt : 2D and 3D GAN, respectively. See the Supplementary Ma-
terial for the training details, the network architecture, and
LT = λs0 ∥I0d − G(wt )∥2 + (7) additional results.
λs1 ∥F(I0d ) − F(G(wt )∥2 + Datasets. We employ the CelebAMask-HQ [29] dataset
comprising 30,000 face RGB images and annotated seman-
λs2 (1 − cos(R(I gt ), R(G(wt ))))+
tic masks, including 19 facial-component categories such
λs3 ∥E(zt , t, x, c) − w̄∥2 , as skin, eyes, mouth, and etc. We also use textual de-
Ours
(a) (b) (c) (d)

“The person has “The person has “The person has “The person has
Inputs

brown hair, and black hair, and gray hair, and gray hair, and
sideburn.” wavy hair.” straight hair.” straight hair.”
IDE-3D
Ours

(a) (b) (c) (d)


Figure 6. Visual examples of the 3D-aware face image generation using a text and a semantic mask. We show the images generated with
inputs and arbitrary viewpoints.

Input conditions Method Model Domain FID↓ LPIPS↓ SSIM↑ ID↑ ACC↑ mIoU↑
TediGAN [58] GAN 2D 54.83 0.31 0.62 0.63 81.68 40.01
IDE-3D [51] GAN 3D 39.05 0.40 0.41 0.54 47.07 10.98
UaC [35] Diffusion 2D 45.87 0.38 0.59 0.32 81.49 42.68
Text +
ControlNet [62] Diffusion 2D 46.41 0.41 0.53 0.30 82.42 42.77
semantic mask
Collaborative [19] Diffusion 2D 48.23 0.39 0.62 0.31 74.06 30.69
Ours GAN 2D 46.68 0.30 0.63 0.76 83.41 43.82
Ours GAN 3D 44.91 0.28 0.64 0.78 83.05 43.74
ControlNet [62] Diffusion 2D 93.26 0.52 0.25 0.21 - -
Text +
Ours GAN 2D 55.60 0.32 0.56 0.72 - -
scribble map
Ours GAN 3D 48.76 0.34 0.49 0.62 - -

Table 1. Quantitative results of multi-modal face image generation on CelebAMask-HQ [29] with annotated text prompts [58].

scriptions provided by [58] describing the facial attributes, Additionally, we assess the alignment accuracy between the
such as black hair, sideburns, and etc, corresponding to input semantic masks and results using mean Intersection-
the CelebAMask-HQ dataset. For the face image gener- over-Union (mIoU) and pixel accuracy (ACC) for the face
ation task using a scribble map, we obtain the scribble generation task using a semantic mask.
maps by applying PiDiNet [49, 50] to the RGB images in
CelebAMask-HQ. We additionally compute camera param- 4.2. Results
eters based on [4, 10] for 3D-aware image generation. Qualitative Evaluations. Figure 5 shows the visual com-
Comparisons. We compare our method with GAN-based parisons between ours and two existing methods for 2D face
models, such as TediGAN [58] and IDE-3D [51], and DM- image generation using a text prompt and a semantic mask
based models, such as Unite and Conquer (UaC) [35], as input. We use the same semantic mask with different
ControlNet [62], and Collaborative diffusion (Collabora- text prompts (a)-(c). TediGAN produces results consistent
tive) [19], for face generation task using a semantic mask with the text prompt as the latent codes are optimized us-
and a text prompt. IDE-3D is trained by a CLIP loss term ing the input text prompt. However, the results are incon-
like TediGAN to apply a text prompt for 3D-aware face im- sistent with the input semantic mask, as highlighted in the
age generation. ControlNet is used for face image gener- red boxes. UaC shows good facial alignment with the in-
ation using a text prompt and a scribble map. We use the put semantic mask, but the results are generated with unex-
official codes provided by the authors, and we downsample pected attributes, such as glasses, that are not indicated in
the results into 256 × 256 for comparison. the inputs. Collaborative and ControlNet produce inconsis-
Evaluation Metrics. For quantitative comparisons, we tent, blurry, and unrealistic images. Our model is capable of
evaluate the image quality and semantic consistency using preserving semantic consistency with inputs and generating
sampled 2k semantic mask- and scribble map-text prompt realistic facial images. As shown in Figure 5, our method
pairs. Frechet Inception Distance (FID) [17], LPIPS [63], preserves the structure of the semantic mask, such as the
and the Multiscale Structural Similarity (MS-SSIM) [56] hairline, face position, and mouth shape, while changing
are employed for the evaluation of visual quality and diver- the attributes through a text prompt.
sity, respectively. We also compute the ID similarity mean Figure 6 compares our method with IDE-3D [51] to val-
score (ID) [8, 57] before and after applying a text prompt. idate the performance of 3D-aware face image generation
Input text: Input text:
1. “This young woman has straight hair, and eyeglasses and wears lipstick.” 1. “This man has gray hair.”
2. “The
urns, and bags under eyesman.” has mustache, receding hairline, big nose, goatee, sideburns, 2. “He has double chin, sideburns, and bags under eyes.”
bushyblack
and has arched eyebrows, eyebrows,
hair.”and high cheekbones.”
3. “She has big lips, pointy nose, receding hairline, and arched eyebrows.”
3. “She wears heavy makeup and has arched eyebrows, black hair.”
4. “This man has mouth slightly open, and arched eyebrows. He is smiling.”
Input View Novel Views

1.
1.

2.
2.

3. 3.

𝒯 (c) w/o 𝐴, 𝐴ҧ (d) Full model


(a) Inputs (b) w/o T (c) w/o A, Ā (d) Ours
4.
Figure 8. Effect of M and T . (b) shows the results using only M,
and (c) shows the effect of the cross-attention maps (A and Ā) in
(a) Inputs (b) ControlNet (c) Ours
T . The major changes are highlighted with the white boxes.
Figure 7. Visual examples of 3D-aware face image generation us-
ing text prompts and scribble maps. Using (1-4) the text prompts Method M T At I gt I0d FID↓ LPIPS↓ ID↑ ACC↑
and their corresponding (a) scribble maps, we compare the results (a) ✓ ✓ ✓ 62.08 0.29 0.62 81.09
of (b) ControlNet with (c) multi-view images generated by ours. (b) ✓ ✓ ✓ ✓ 48.68 0.28 0.66 82.86
(c) ✓ ✓ ✓ ✓ 54.27 0.31 0.58 80.58
using a semantic mask and a text prompt. We use the (d) ✓ ✓ ✓ ✓ 61.60 0.29 0.62 80.04
same semantic mask with different text prompts in Fig- (e) ✓ ✓ ✓ ✓ ✓ 44.91 0.28 0.78 83.05
ures 6 (a) and (b), and use the same text prompt with dif- Table 2. Ablation analysis on 3D-aware face image generation
ferent semantic masks in Figures 6 (c) and (d). The results using a text prompt and a semantic mask. We compare (a) and (b)
of IDE-3D are well aligned with the semantic mask with the with (e) to show the effect of our style modulation network and
frontal face. However, IDE-3D fails to produce accurate re- (c) and (d) with (e) to analyze the effect of I gt and I d in model
sults when the non-frontal face mask is used as input. More- training.
over, the results cannot reflect the text prompt. Our method
can capture the details provided by input text prompts and is reasonable that IDE-3D shows the highest FID score as
semantic masks, even in a 3D domain. the method additionally uses an RGB image as input to esti-
Figure 7 shows visual comparisons with ControlNet on mate the latent code for face generation. The LPIPS, SSIM,
2D face generation from a text prompt and a scribble map. and ID scores are significantly higher than IDE-3D, with
The results from ControlNet and our method are consistent scores higher by 0.116, 0.23, and 0.24, respectively. Our
with both the text prompt and the scribble map. ControlNet, method using 3D GAN exhibits superior ACC and mIoU
however, tends to over-emphasize the characteristic details scores for the 3D face generation task compared to IDE-
related to input conditions. Our method can easily adapt to 3D, with the score difference of 35.98% and 32.76%, likely
the pre-trained 3D GAN and produce photo-realistic multi- due to its ability to reflect textual representations into spa-
view images from various viewpoints. tial information. In face image generation tasks using a text
Quantitative Evaluations. Table 1 reports the quantitative prompt and a scribble map, our method outperforms Con-
results on CelebAMask-HQ with text prompts [58]. Our trolNet in FID, LPIPS, SSIM, and ID scores in both 2D and
method using text prompts and semantic masks shows per- 3D domains. Note that the ACC and mIoU scores are appli-
formance increases in all metrics in 2D and 3D domains, cable for semantic mask-based methods.
compared with TediGAN and UaC. Our model using 2D
GAN significantly improves LPIPS, ID, ACC, and mIoU 4.3. Ablation Study
scores, surpassing TediGAN, UaC, ControlNet, and Collab- We conduct ablation studies to validate the effectiveness of
orative, respectively. It demonstrates our method’s strong our contributions, including the mapping network M, the
ability to generate photo-realistic images while reflecting AbSM network T , and the loss functions LM and LT .
input multi-modal conditions better. For 3D-aware face im- Effectiveness of M and T . We conduct experiments with
age generation using a text prompt and a semantic mask, it different settings to assess the effectiveness of M and T .
2.

𝑑 𝑔𝑡
𝑔𝑡
(a) Inputs (b) w/ 𝐼𝑡=0 (c) w/ 𝐼 (d) Ours
w/ 𝐼 (d) Ours

big lips, Input text: Input text:


1. “This young person has goatee, mustache, big lips, and 1. “A photo of a face of a beautiful elf with
ws, and straight hair.” silver hair in live action movie.”
2. “She wears lipstick and has arched eyebrows, and 2. “A photo
“This ofyoung
a white Greekhas
person statue.”
goatee, mustache, big lips, and strai
mouth slightly open.” 3. “A photo of a face of a zombie.”

1.
1.

“She wears lipstick and has arched eyebrows, and slightly

2.
2.

d) Ours
𝑔𝑡
(a) Inputs (b) w/ 𝐼0𝑑 (c) w/ 𝐼 (d) Ours (a) Inputs 𝑑
(b) w/ 𝐼𝑡=0 (c) w/ 𝐼
𝑔𝑡
(d) Ours
d 3.
Figure 9. Effect of using I from the denoising U-Net and the GT
image I gt in model training. Using text prompts (1, 2) with (a)
the semantic mask, we show face images using our model trained
and with (b) I0d , (c) I gt , and (d) both. Figure 10. Visual examples of 3D face style transfer. Our method
generates stylized multi-view images by mapping the latent fea-
We also show the advantages of using cross-attention maps tures of DM and GAN.
in our model. The quantitative and qualitative results are
presented in Table 2 and Figure 8, respectively. When using ble 2 (c) and (d). Using I gt also preserves more condition-
only M, we can generate face images that roughly preserve irrelevant features inferred by the ID scores in Table 2 (c)
the structures of a given semantic mask in Figure 8 (a), in- and (d). In particular, our method combines the strengths of
cluding the outline of the facial components (e.g. face, eye) two models as shown in Figure 9 (d) and Table 2 (e).
in Figure 8 (b). On the other hand, T enables the model to
express face attribute details effectively, such as hair colors 4.4. Limitations and Future Works
urs and mouth open, based on the multi-modal inputs in Fig- Our method can be extended to multi-modal face style
ure 8 (c). The FID and ACC scores are higher than the transfer (e.g. face → Greek statue) by mapping the latent
model using only M in Table 2 (b). We further present spaces of DM and GAN without CLIP losses and additional
the impact of adopting cross-attention maps to T for style dataset, as shown in Figure 10. For the 3D-aware face style
modulation. Figure 8 (d) shows how the attention-based transfer task, we train our model using I0d that replaces GT
modulation approach enhances the quality of results, par- image I gt in our loss terms. This method, however, is lim-
ticularly in terms of the sharpness of desired face attributes ited as it cannot transfer extremely distinct style attributes
and the overall consistency between the generated image from the artistic domain to the photo-realistic domain of
and multi-modal conditions. Table 2 (e) demonstrates the GAN. To better transfer the facial style in the 3D domain,
effectiveness of our method by showing improvements in we will investigate methods to map the diffusion features
FID, LPIPS, ID, and ACC. Our method, including both M related to the input pose into the latent space of GAN in
and T with cross-attention maps, significantly improves the future works.
FID showing our model’s ability to generate high-fidelity
images. From the improvement of the ID score, the cross- 5. Conclusion
attention maps enable relevantly applying the details of in- We presented the diffusion-driven GAN inversion method
put conditions to facial components. that translates multi-modal inputs into photo-realistic face
Model Training. We analyze the effect of loss terms LM images in 2D and 3D domains. Our method interprets the
and LT by comparing the performance with the model pre-trained GAN’s latent space and maps the diffusion fea-
trained using either I0d from the denoising U-Net or GT im- tures into this latent space, which enables the model to eas-
age I gt . The model trained using I0d produces the images in ily adopt multi-modal inputs, such as a visual input and a
Figure 9 (b), which more closely reflected the multi-modal text prompt, for face image generation. We also proposed to
conditions (a), such as “goatee” and “hair contour”. In Ta- train our model across the multiple denoising steps, which
ble 2 (c), the ACC score of this model is higher than the further improves the output quality and consistency with
model trained only using I gt in Table 2 (d). The images the multiple inputs. We demonstrated the capability of our
generated by the model trained with I gt in Figure 9 (c) method by using text prompts with semantic masks or scrib-
are more perceptually realistic, as evidenced by the lower ble maps as input for 2D or 3D-aware face image generation
LPIPS score compared to the model trained with I0d in Ta- and style transfer.
References [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[1] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle: two time-scale update rule converge to a local nash equilib-
A residual-based stylegan encoder via iterative refinement. rium. NeurIPS, 2017. 6
In ICCV, 2021. 2
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
[2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended sion probabilistic models. NeurIPS, 2020. 1
diffusion for text-driven editing of natural images. In CVPR,
[19] Ziqi Huang, Kelvin CK Chan, Yuming Jiang, and Ziwei Liu.
2022. 2
Collaborative diffusion for multi-modal face generation and
[3] Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P
editing. In CVPR, 2023. 2, 6
Breckon, and Chris G Willcocks. Unleashing transform-
[20] Jaeseok Jeong, Mingi Kwon, and Youngjung Uh. Training-
ers: Parallel token prediction with discrete absorbing diffu-
free style transfer emerges from h-space in diffusion models.
sion for fast high-resolution image generation from vector-
In WACV, 2024. 2
quantized codes. In ECCV, 2022. 2
[21] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[4] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano,
generator architecture for generative adversarial networks. In
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J
CVPR, 2019. 1, 2
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Effi-
cient geometry-aware 3d generative adversarial networks. In [22] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
CVPR, 2022. 2, 5, 6 Jaakko Lehtinen, and Timo Aila. Analyzing and improving
[5] Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and the image quality of stylegan. In CVPR, 2020. 1, 2, 5
Hongbo Fu. Deepfacedrawing: Deep generation of face im- [23] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
ages from sketches. TOG, 39(4):72–1, 2020. 2 Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[6] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Text-based real image editing with diffusion models. In
Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- CVPR, 2023. 1
tive adversarial networks for multi-domain image-to-image [24] Gwanghyun Kim and Se Young Chun. Datid-3d: Diversity-
translation. In CVPR, 2018. 1 preserved domain adaptation using text-to-image diffusion
[7] Min Jin Chong and David Forsyth. Jojogan: One shot face for 3d generative model. In CVPR, 2023. 2
stylization. In ECCV, 2022. 1, 2 [25] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif-
[8] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos fusionclip: Text-guided diffusion models for robust image
Zafeiriou. Arcface: Additive angular margin loss for deep manipulation. In CVPR, 2022. 2
face recognition. In CVPR, 2019. 5, 6 [26] Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, and
[9] Kangle Deng, Gengshan Yang, Deva Ramanan, and Jun-Yan Jun-Yan Zhu. Dense text-to-image generation with attention
Zhu. 3d-aware conditional image synthesis. In CVPR, 2023. modulation. In ICCV, 2023. 2
2 [27] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwangrok Ryoo,
[10] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde and Seungryong Kim. 3d gan inversion with pose optimiza-
Jia, and Xin Tong. Accurate 3d face reconstruction with tion. In WACV, 2023. 2
weakly-supervised learning: From single image to image set. [28] Gihyun Kwon and Jong Chul Ye. Diffusion-based image
In CVPR Workshops, 2019. 6 translation using disentangled style and content representa-
[11] Prafulla Dhariwal and Alexander Nichol. Diffusion models tion. arXiv preprint arXiv:2209.15264, 2022. 2, 3
beat gans on image synthesis. NeurIPS, 2021. 1 [29] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo.
[12] Hoseok Do, EunKyung Yoo, Taehyeong Kim, Chul Lee, and Maskgan: Towards diverse and interactive facial image ma-
Jin Young Choi. Quantitative manipulation of custom at- nipulation. In CVPR, 2020. 5, 6
tributes on 3d-aware image synthesis. In CVPR, 2023. 1 [30] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding,
[13] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming Wangmeng Zuo, and Shilei Wen. Stgan: A unified selec-
transformers for high-resolution image synthesis. In CVPR, tive transfer network for arbitrary image attribute editing. In
2021. 2 CVPR, 2019. 1
[14] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, [31] Zhian Liu, Maomao Li, Yong Zhang, Cairong Wang, Qi
Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- Zhang, Jue Wang, and Yongwei Nie. Fine-grained face swap-
guided domain adaptation of image generators. TOG, 41(4): ping via regional gan inversion. In CVPR, 2023. 2
1–13, 2022. 2 [32] Shitong Luo and Wei Hu. Diffusion probabilistic models for
[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 3d point cloud generation. In CVPR, 2021. 2
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [33] Tianxiang Ma, Kang Zhao, Jianxin Sun, Jing Dong, and Tie-
Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014. niu Tan. Free-style and fast 3d portrait synthesis. arXiv
1 preprint arXiv:2306.15419, 2023. 2
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, [34] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and
age editing with cross attention control. arXiv preprint editing with stochastic differential equations. arXiv preprint
arXiv:2208.01626, 2022. 2 arXiv:2108.01073, 2021. 2
[35] Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Ban- [51] Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue
dara, and Vishal M Patel. Unite and conquer: Plug & play Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit-
multi-modal synthesis using diffusion models. In CVPR, ing for high-resolution 3d-aware portrait synthesis. TOG, 41:
2023. 2, 6 1–10, 2022. 1, 2, 6
[36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [52] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Daniel Cohen-Or. Designing an encoder for stylegan image
Mark Chen. Glide: Towards photorealistic image generation manipulation. TOG, 40(4):1–14, 2021. 2
and editing with text-guided diffusion models. arXiv preprint [53] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali
arXiv:2112.10741, 2021. 1, 2 Dekel. Plug-and-play diffusion features for text-driven
[37] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie image-to-image translation. In CVPR, 2023. 1, 3
Liu, Abhimitra Meka, and Christian Theobalt. Drag your [54] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or.
gan: Interactive point-based manipulation on the generative Sketch-guided text-to-image diffusion models. In SIG-
image manifold. In SIGGRAPH 2023, 2023. 1 GRAPH, 2023. 1
[38] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan [55] Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong
Zhu. Semantic image synthesis with spatially-adaptive nor- Chen, Dong Chen, Lu Yuan, and Houqiang Li. Seman-
malization. In CVPR, 2019. 2 tic image synthesis via diffusion models. arXiv preprint
[39] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, arXiv:2207.00050, 2022. 2
and Dani Lischinski. Styleclip: Text-driven manipulation of [56] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul-
stylegan imagery. In ICCV, 2021. 1, 2 tiscale structural similarity for image quality assessment. In
[40] Malsha V Perera and Vishal M Patel. Analyzing bias The Thirty-Seventh Asilomar Conference on Signals, Systems
in diffusion-based face generation models. arXiv preprint & Computers, 2003. 6
arXiv:2305.06402, 2023. 1 [57] Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhen-
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya tao Tan, Lu Yuan, Weiming Zhang, and Nenghai Yu. Hair-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, clip: Design your hair by text and reference image. In CVPR,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- 2022. 2, 6
ing transferable visual models from natural language super- [58] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu.
vision. In ICML. PMLR, 2021. 1, 2 Tedigan: Text-guided diverse face image generation and ma-
[42] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, nipulation. In CVPR, 2021. 1, 2, 6, 7
Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding [59] Xin Yang, Xiaogang Xu, and Yingcong Chen. Out-of-
in style: a stylegan encoder for image-to-image translation. domain gan inversion via invertibility decomposition for
In CVPR, 2021. 1, 3, 5 photo-realistic human face manipulation. In ICCV, 2023. 2
[43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [60] Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li,
Patrick Esser, and Björn Ommer. High-resolution image syn- Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz
thesis with latent diffusion models. In CVPR, 2022. 2 Oztireli, et al. 3d gan inversion with facial symmetry prior.
[44] Rohit Saha, Brendan Duke, Florian Shkurti, Graham W In CVPR, 2023. 2
Taylor, and Parham Aarabi. Loho: Latent optimization of [61] Yu Yin, Kamran Ghasedi, HsiangTao Wu, Jiaolong Yang,
hairstyles via orthogonalization. In CVPR, 2021. 2 Xin Tong, and Yun Fu. Nerfinvertor: High fidelity nerf-gan
[45] Chitwan Saharia, William Chan, Saurabh Saxena, Lala inversion for single-shot real image animation. In CVPR,
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, 2023. 2
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, [62] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
et al. Photorealistic text-to-image diffusion models with deep conditional control to text-to-image diffusion models. In
language understanding. NeurIPS, 2022. 1, 2 ICCV, 2023. 2, 3, 5, 6
[46] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Inter- [63] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
preting the latent space of gans for semantic face editing. In and Oliver Wang. The unreasonable effectiveness of deep
CVPR, 2020. 1 features as a perceptual metric. In CVPR, 2018. 5, 6
[47] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, [64] Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang,
Jiajun Wu, and Gordon Wetzstein. 3d neural field generation Chongyang Ma, Weiming Dong, and Changsheng Xu.
using triplane diffusion. In CVPR, 2023. 2 Inversion-based style transfer with diffusion models. In
[48] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- CVPR, 2023. 1
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based [65] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang
generative modeling through stochastic differential equa- Wang. Talking face generation by adversarially disentangled
tions. In ICLR, 2021. 1 audio-visual representation. In AAAI, 2019. 2
[49] Zhuo Su, Matti Pietikäinen, and Li Liu. Bird: Learning bi- [66] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-
nary and illumination robust descriptor for face recognition. domain gan inversion for real image editing. In ECCV, 2020.
In BMVC, 2019. 6 2
[50] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, [67] Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka.
Qi Tian, Matti Pietikainen, and Li Liu. Pixel difference net- Barbershop: Gan-based image compositing using segmenta-
works for efficient edge detection. In ICCV, 2021. 6 tion masks. arXiv preprint arXiv:2106.01505, 2021. 2

You might also like