Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views9 pages

CF Clip

Uploaded by

cheint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views9 pages

CF Clip

Uploaded by

cheint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Towards Counterfactual Image Manipulation via CLIP

Yingchen Yu Fangneng Zhan Rongliang Wu


[email protected] [email protected] [email protected]
Nanyang Technological University & Max Planck Institute for Informatics Nanyang Technological University
Alibaba Group Saarbrücken, Saarland, Germany Singapore
Singapore

Jiahui Zhang Shijian Lu∗ Miaomiao Cui


[email protected] [email protected] [email protected]
Nanyang Technological University Nanyang Technological University DAMO Academy, Alibaba Group
Singapore Singapore Beijing, China

Xuansong Xie Xian-Sheng Hua Chunyan Miao


[email protected] [email protected] [email protected]
DAMO Academy, Alibaba Group Zhejiang University Nanyang Technological University
Beijing, China Hangzhou, Zhejiang, China Singapore

Input

StyleCLIP

Ours

“Green Lipstick” “Purple Eyebrows” “Blue Dog” “Bald Dog” “Spotty Cat” “Mouse Ears”

Figure 1: Illustration of text-driven counterfactual manipulation: The state-of-the-art StyleCLIP [23] often struggles to meet
target counterfactual descriptions as it optimizes CLIP scores directly, which is susceptible to adversarial solutions. Our design
achieves more accurate and robust counterfactual editing via effective text embedding mapping and a novel contrastive loss
CLIP-NCE that comprehensively exploits semantic knowledge of CLIP.

∗ Corresponding author.
ABSTRACT
Leveraging StyleGAN’s expressivity and its disentangled latent
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
codes, existing methods can achieve realistic editing of different
for profit or commercial advantage and that copies bear this notice and the full citation visual attributes such as age and gender of facial images. An intrigu-
on the first page. Copyrights for components of this work owned by others than ACM ing yet challenging problem arises: Can generative models achieve
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a counterfactual editing against their learnt priors? Due to the lack
fee. Request permissions from [email protected]. of counterfactual samples in natural datasets, we investigate this
MM ’22, October 10–14, 2022, Lisboa, Portugal problem in a text-driven manner with Contrastive-Language-Image-
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9203-7/22/10. . . $15.00 Pretraining (CLIP), which can offer rich semantic knowledge even
https://doi.org/10.1145/3503161.3547935

3637
MM ’22, October 10–14, 2022, Lisboa, Portugal Yingchen Yu et al.

for various counterfactual concepts. Different from in-domain ma- only the identical CLIP-space directions (source to target text em-
nipulation, counterfactual manipulation requires more comprehen- bedding), which does not explicitly regularize the editing strength.
sive exploitation of semantic knowledge encapsulated in CLIP as Thus, it is prone to allow the model to over-edit the latent codes,
well as more delicate handling of editing directions for avoiding and consequently lead to inaccurate or excessive editing.
being stuck in local minimum or undesired editing. To this end, we We design CF-CLIP, a CLIP-based text-guided image manipula-
design a novel contrastive loss that exploits predefined CLIP-space tion network that allows accurate CounterFactual editing without
directions to guide the editing toward desired directions from dif- requiring additional training data or optimization of the entire
ferent perspectives. In addition, we design a simple yet effective generator. For counterfactual editing against the learnt prior, it is
scheme that explicitly maps CLIP embeddings (of target text) to indispensable to extract semantic knowledge of CLIP for discover-
the latent space and fuses them with latent codes for effective la- ing the editing direction in the latent space of a pretrained GAN.
tent code optimization and accurate editing. Extensive experiments To this end, we design a CLIP-based Noise Contrastive Estimation
show that our design achieves accurate and realistic editing while (CLIP-NCE) loss that explores the CLIP semantic knowledge com-
driving by target texts with various counterfactual concepts. prehensively. Instead of only emphasizing the identical directions
as in directional CLIP loss [9], CLIP-NCE maximizes the mutual in-
CCS CONCEPTS formation of selected positive pairs while minimizing it along other
• Computing methodologies → Computer vision. directions (i.e. negative pairs) in the CLIP space, which yields aux-
iliary information and facilitates latent code optimization toward
KEYWORDS desired editing directions.
In addition, editing accuracy is essential for manipulation tasks,
computer vision, deep learning, stylegan, clip, image manipulation
which means that manipulations should only take place in the
ACM Reference Format: target edit regions and faithfully reflect the target description. In
Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jiahui Zhang, Shijian Lu, this paper, we enhance the editing accuracy in two different ways.
Miaomiao Cui, Xuansong Xie, Xian-Sheng Hua, and Chunyan Miao. 2022. First, we augment the target images with different perspective
Towards Counterfactual Image Manipulation via CLIP. In Proceedings of views before computing CLIP-NCE. This introduces two benefits: 1)
the 30th ACM International Conference on Multimedia (MM ’22), October the augmentations preserve semantic information and prevent the
10–14, 2022, Lisboa, Portugal. ACM, New York, NY, USA, 9 pages. https:
model from converging to adversarial solutions; 2) the consistent
//doi.org/10.1145/3503161.3547935
semantics across different perspective views encourage the model
to discover the underneath geometry information in CLIP [18],
1 INTRODUCTION which helps for better understanding of image semantics as well as
Empowered by Generative Adversarial Networks (GANs) [10], we precise editing locally or globally. Second, we design a simple yet
have observed great advances in image manipulation in recent effective text embedding mapping (TEM) module to explicitly uti-
years [23, 32, 33, 35, 37, 42]. In particular, StyleGANs [14–16] keep lize the semantic knowledge of CLIP embeddings. The CLIP-space
extending the boundary of realistic image synthesis. Meanwhile, embedding of target text is separately mapped into the StyleGAN
their learnt latent spaces possess disentanglement properties, en- latent space to disentangle the semantics of text embeddings. After
abling various image manipulations by directly editing the latent fusing the disentangled text embeddings with input latent codes,
codes of pretrained StyleGAN [2, 11, 29, 34]. the model could leverage the semantic information to highlight
As StyleGAN is pretrained with images of a specific domain, the target-related latent codes in relevant StyleGAN layers. The
most related manipulation methods are restricted to certain domain- TEM thus enables the model to accurately locate the target editing
specific editing, e.g., changing hair styles for facial dataset. The regions and effectively suppress unwanted editing.
conventional way of moving beyond the domain is to retrain the The contributions of this work can be summarized in three as-
model with task-specific samples, but it becomes infeasible in coun- pects. First, we design CF-CLIP, a CLIP-based image manipulation
terfactual concept generation where training sample are scarce. framework that enables accurate and high-fidelity counterfactual
Eliminating the need of manual efforts or additional training data editing given target textual description. Second, we design a con-
has recently been explored for StyleGAN-based image manipulation. trastive loss in the CLIP space, dubbed CLIP-NCE, which provides
For example, StyleCLIP [23] achieves text-guided image manipula- comprehensive guidance and enables faithful counterfactual editing.
tion by editing StyleGAN latent codes under the sole supervision Third, we design a simple yet effective text embedding mapping
of CLIP. However, StyleCLIP optimizes CLIP scores directly, which module (TEM) that allows explicit exploitation of CLIP embeddings
is susceptible to adversarial solutions or get stuck in local mini- during latent code optimization to facilitate accurate editing.
mum [19]. Thus, it’s typically restricted to in-domain manipulation.
StyleGAN-NADA [9] enables out-of-domain generation with direc-
tional CLIP loss, which encourages generation diversity and avoids 2 RELATED WORK
adversarial solutions by aligning CLIP-space directions between
source and target text-image pairs. However, StyleGAN-NADA op- 2.1 Text-guided synthesis and manipulation
timizes the entire generator for zero-shot domain adaptation, which Most existing work in text-guided synthesis adopts conditional
often requires manual setting of optimization iterations for different GAN [21] that treats text embedding as conditions [38]. For ex-
cases to maintain manipulation accuracy and quality. Additionally, ample, Reed et al. [26] extracts text embeddings from a pretrained
the directional CLIP loss guides the latent code optimization with encoder. Xu et al. [36] employs an attention mechanism to improve

3638
Towards Counterfactual Image Manipulation via CLIP MM ’22, October 10–14, 2022, Lisboa, Portugal

the generation quality. Instead of training GANs from scratch, Tedi- Specifically, random noise 𝑧 ∈ R512 is sampled from a Gaussian
GAN [35] incorporates pretrained StyleGAN for text-guided image distribution and injected into a mapping network, which projects
synthesis and manipulation by mapping the text to the latent space 𝑧 into a learned latent space 𝑊 ∈ R512 . Consequently, the latent
of StyleGAN. In addition, a large-scale pretrained CLIP model [24] codes 𝑤 are injected into different layers of the synthesis network to
for joint vision-language representation has recently been released. control different semantics at different resolutions, which endows
With powerful CLIP representations, StyleCLIP [23] achieves flex- 𝑊 space with good disentanglement properties during training
ible and high-fidelity manipulation by exploring the latent space [7, 29]. Recent GAN-inversion studies present W+ ∈ R𝑛_𝑙𝑎𝑡𝑒𝑛𝑡 ×512
of StyleGAN. Instead of confining generated images to the trained space for fine-grained control of image semantics [1, 3, 27, 30],
domain, CLIPstyler [18] eliminates the necessity of StyleGAN and where 𝑛_𝑙𝑎𝑡𝑒𝑛𝑡 is the number of StyleGAN layers. Although the
allows for text-guided style transfer with arbitrary source images. synthesis generator is typically constrained by the strong prior
Beyond that, StyeGAN-NADA [9] optimizes the pretrained genera- learned from the training domain, the disentanglement of the latent
tive model with text conditions only, which enables out-of-domain codes suggests the feasibility of recombining different attributes to
manipulation via domain adaptation. As optimizing latent codes achieve counterfactual or out-of-domain manipulation. However,
for counterfactual manipulation is more challenging compared to discovering the proper editing directions remains a challenge. One
in-domain manipulation, the optimization of latent codes tend to promising solution is to incorporate the guidance from large-scale
get stuck in local minimal or excessive modification during training CLIP based on its powerful text-image representation.
and lead to inaccurate editing or artifacts. Our model can mitigate
this issue effectively by providing comprehensive guidance in the 3.2 CLIP
CLIP-space with the proposed CLIP-NCE loss. CLIP [24] is trained with over 400 million text-image pairs to learn
a joint representation of vision-language through contrastive learn-
2.2 Counterfactual image generation ing. It consists of two encoders that encode images and texts into
The early study of counterfactual image generation serves as a 512-dimensional embedding, respectively. By minimizing the cosine
tool for explaining image classifier. For example, Chang et al. [5] distance between encoded embeddings of large-scale text-image
presents generative infilling to produce counterfactual contents pairs, CLIP learns a powerful multi-modal embedding space where
within masked regions for aligning with the data distributions. For the semantic similarity between texts and images can be well mea-
robust out-of-domain classification, Sauer and Geiger [28] enables sured. Thanks to large-scale training, rich semantic knowledge is
counterfactual generation by disentangling object shape, texture, encapsulated in CLIP embedding, which could guide counterfactual
and background without direct supervision. However, these coun- or out-of-domain generation for pretrained generators. However,
terfactual generations often lack fidelity and consistency around effective exploitation of such powerful representations remains an
the mask boundary. This situation is greatly alleviated with the open research problem, which is also the focus of our study.
appearance of Dall-E [25], which has billions of parameters and
3.2.1 Global CLIP Loss. Recently, StyleCLIP [23] directly utilizes
is trained with large-scale text-image pairs. With such powerful
the CLIP-space cosine distance to guide image manipulation ac-
expressivity, Dall-E enables high-fidelity generation with a prompt-
cording to the semantic of target text, which can be formulated by
ing text, even with novel counterfactual concepts. Leveraging the
a global CLIP loss:
power of the large-scale pretrained model CLIP [24], Liu et al. [19]
present a training-free, zero-shot framework that can generate L𝑔𝑙𝑜𝑏𝑎𝑙 = 1 − 𝑐𝑜𝑠 (𝐸𝐼 (𝐼𝑒𝑑𝑖𝑡 ), 𝐸𝑇 (𝑡𝑡𝑔𝑡 )), (1)
counterfactual images. Similarly, Gal et al. [9] introduce a zero-
where 𝑐𝑜𝑠 stands for cosine similarity, 𝐸𝐼 and 𝐸𝑇 denote the im-
shot domain adaptation method that enables out-of-domain gen-
age encoder and text encoder of CLIP, respectively, 𝐼𝑒𝑑𝑖𝑡 is the
eration even when the target domain is counterfactual. However,
manipulated image, and 𝑡𝑡𝑔𝑡 is the target textual description.
the aforementioned methods mainly focus on the global style or
semantic changes when generating counterfactual images, while 3.2.2 Directional CLIP Loss. A known issue of global CLIP loss is
our proposed method could achieve more flexible and coherent adversarial solutions [19], which means that the model tends to fool
counterfactual manipulation both locally and globally. the CLIP classifier by adding meaningless pixel-level perturbations
to the image [9]. To mitigate such issue, a directional CLIP loss [9]
3 PRELIMINARIES is proposed to align the CLIP-space directions between the source
Similar to most existing CLIP-based manipulation methods [9, 23], and target text-image pairs. , which is defined as:
our method also leverages StyleGAN and CLIP, where the former Δ𝑇 = 𝐸𝑇 (𝑡𝑡𝑔𝑡 ) − 𝐸𝑇 (𝑡𝑠𝑟𝑐 ),
yields disentangled latent codes and high-fidelity generation, and Δ𝐼 = 𝐸𝐼 (𝐼𝑒𝑑𝑖𝑡 ) − 𝐸𝐼 (𝐼𝑠𝑟𝑐 ),
the latter provides powerful semantic knowledge for guiding the
editing direction. We therefore first briefly review the fundamentals L𝑑𝑖𝑟 = 1 − 𝑐𝑜𝑠 (Δ𝑇 , Δ𝐼 ), (2)
of StyleGAN and CLIP and highlight their relevant properties that where 𝐼𝑠𝑟𝑐 = 𝐺 (𝑤) is the source image, and 𝑡𝑠𝑟𝑐 denotes the source
allow counterfactual image editing. texts with neutral descriptions such as "Photo", "Face", and "Dog".

3.1 StyleGAN 4 METHOD


StyleGANs [15, 16] are unconditional generative models that can Given a text prompt that describes the target counterfactual ma-
synthesize high resolution images progressively from random noises. nipulation together with a source image, our goal is to faithfully

3639
MM ’22, October 10–14, 2022, Lisboa, Portugal Yingchen Yu et al.

𝐼)*+ 𝐼!"#$

fc Mapper StyleGAN

𝑤 ∆𝑤 𝑤′
Augment
fc 𝑡$'$
fc Text 𝐸( (𝑡$'$ ) “Green
Mapper Lipstick”
CLIP

fc 𝑒$ 𝓛𝑵𝑪𝑬
𝐼%&'
𝑀" (𝑒 ! ) 𝑒! TEM

Figure 2: The framework of the proposed text-driven counterfactual image manipulation: Given a randomly sampled or
inverted latent code 𝑤, a Text Mapper (detailed structures shown at the bottom left) first projects the CLIP embedding of the
target text 𝑡𝑡𝑔𝑡 (i.e., 𝑒 𝑡 ) to a latent space and fuses it with 𝑤 to highlight the editing in relevant generator layers. The fused
embedding is then propagated to a Mapper to yield residual Δ𝑤 for editing. With the manipulated latent code 𝑤 ′ , manipulated
images are generated by StyleGAN which are further augmented and encoded in the CLIP space for loss calculation. We exploit
the semantic knowledge in CLIP to discover the editing direction under the guidance of the proposed CLIP-NCE loss L𝑁𝐶𝐸 .

produce manipulated images against the strong prior of the pre- in approaching its goal from two perspectives. First, it becomes
trained generator. Due to the lack of counterfactual examples in harder for the model to fool the CLIP with adversarial solutions,
natural dataset, we achieve this goal in a zero-shot manner by lever- because now it has to simultaneously produce appropriate pertur-
aging the semantic knowledge of the pretrained model CLIP as the bations across most of the randomly augmented images. Second,
sole supervision. Since manipulating the latent codes for counter- we conjecture that during the large-scale pretraining process, CLIP
factual manipulation is typically more challenging than in-domain may learn to model the geometry information of the same object
manipulation, how to comprehensively extract the semantic in- under different views. Hence, the multiple views provided by the
formation encapsulated in CLIP and carefully guide the editing perspective augmentation yield CLIP representations with geome-
directions in the latent space are focal points of our approach. try information for different views, which help the model explore
semantic information of the CLIP model in a 3D-structure-aware
4.1 Pipeline manner.
The framework of our proposed approach is illustrated in Fig. 2. Therefore, the augmented images can be formulated as:
With the target text prompt 𝑡𝑡𝑔𝑡 , the text embeddings 𝐸𝑇 (𝑡𝑡𝑔𝑡 )
𝐼𝑎𝑢𝑔 = 𝑎𝑢𝑔(𝐼𝑒𝑑𝑖𝑡 ), (3)
extracted by the text encoder of CLIP is firstly forwarded to the
Text Embedding Mapping (TEM) module, which is trained to project where 𝑎𝑢𝑔(·) is the random perspective augmentation.
the encoded text embeddings into the latent space for more explicit As mentioned in Section 3.2, StyleCLIP [23] adopted global CLIP
guidance from the CLIP model. Subsequently, the projected feature loss for text-guided manipulation. Despite the adversarial solutions
is concatenated with the latent code 𝑤, which can be randomly issue, global CLIP loss may also lead to a local minimal issue during
generated or inverted from real images. The concatenated latent optimization. Taking "Green Lipstick" as an example, the global
codes is then fed into a fully-connected layer for feature fusion. CLIP loss could be stuck at a local minimum such that the model
Similarly to StyleCLIP [23], we employ a mapper to obtain the changes other more entangled areas into green (e.g., hair), other
residual Δ𝑤 from the fused features. After summing it back to than the target "lipstick". The directional CLIP loss [9] helps to
the source latent code 𝑤, the resulting 𝑤 ′ = 𝑤 + Δ𝑤 is passed to mitigate this issue by aligning the CLIP-space directions between
the synthesis network of StyleGAN to produce the manipulated the text-image pairs of source and target. However, the CLIP-space
images 𝐼𝑠𝑟𝑐 . Next, the manipulated images 𝐼𝑠𝑟𝑐 are augmented with directions from source to target texts are almost fixed, directional
different perspective views, and the resulting images 𝐼𝑎𝑢𝑔 are used CLIP loss may over-emphasize the identical directions and neglect
to calculate the proposed CLIP-NCE loss. the other useful information encapsulated in the CLIP-space em-
beddings, which tends to allow the latent codes to travel too far
4.2 CLIP-NCE and result in excessive or inaccurate editing.
Inspired by CLIPStyler [18], a random perspective augmentation is Therefore, we propose a novel CLIP-based Noise Contrastive
applied to the manipulated image 𝐼𝑒𝑑𝑖𝑡 before calculating the CLIP- Estimation (CLIP-NCE) to maximize/minimize the mutual informa-
NCE loss, and this augmentation scheme facilitates our framework tion between positive/negative pairs, which allows us to leverage

3640
Towards Counterfactual Image Manipulation via CLIP MM ’22, October 10–14, 2022, Lisboa, Portugal

provide more comprehensive guidance. Hence, we define negative


samples 𝐾 − as the CLIP-space directions from source image to
𝑬𝑰 (𝑰𝒔𝒓𝒄 ) various neural text descriptions, which may prevent the model
𝑲,
𝑰
from producing lazy manipulations that are classified as neutral
“Blue Dog” descriptions like the source text.
𝑸
𝑬𝑻 (𝒕)*) )

l
Hence, we design the negative samples 𝐾 − to avoid the editing

Pul
sh
Pu toward the CLIP embeddings with neutral text descriptions:
𝑲+ 𝐾 − = 𝐸𝑇 (𝑡𝑠𝑟𝑐 ) − 𝐸𝐼 (𝐼𝑠𝑟𝑐 )
Pull
(6)

𝑲,
𝑻 𝑬𝑰 (𝑰𝒂𝒖𝒈 )
To increase the diversity of source texts, we adopt prompt engineer-
ing [24] to fill 𝑡𝑠𝑟𝑐 in text templates such as "a photo of a {Dog}."
According to [4, 22], maximizing mutual information allows
diverse yet plausible solutions and helps to avoid averaged solutions
𝑬𝑻 (𝒕𝒔𝒓𝒄 ) for different instances.
Prompts of “Dog” Therefore, we adopt InfoNCE loss [31] for the carefully defined
terms. By maximizing the mutual information between the two
Figure 3: Illustration of CLIP-NCE: The contrastive loss is selected positive pairs and minimizing the mutual information
computed in the CLIP space, where the direction from the between negative pairs in the CLIP-space, CLIP-NCE loss provides
source to the manipulated embedding (blue arrow) forms a comprehensive guide of CLIP and significantly facilitates the
the query for optimization. Two types of positive samples optimization of latent codes. Specifically, the proposed CLIP-NCE
are used (orange arrows), where 𝐾𝑇+ encourages the query loss can be formulated as follows:
to align with the direction from source to target texts and 𝑒
(𝑄 ·𝐾 + /𝜏 )
𝑇
𝐾𝐼+ regularizes the latent codes from traveling too far. The L𝑁𝐶𝐸 = − log (𝑄 ·𝐾 + /𝜏 ) Í −
𝑒 𝑇 + 𝐾 − 𝑒 (𝑄 ·𝐾 /𝜏 )
purple arrow highlights negative samples from the source +
(𝑄 ·𝐾 /𝜏 )
𝑒 𝐼
image to prompts of source text. By pulling positive pairs and − log (𝑄 ·𝐾 + /𝜏 ) Í −
, (7)
𝑒 𝐼 + 𝐾 − 𝑒 (𝑄 ·𝐾 /𝜏 )
pushing negative pairs, CLIP-NCE enables comprehensive
exploitation of CLIP representations. where 𝜏 is the temperature and and is set to 0.1 in our approach.

4.3 Text Embedding Mapping


the semantic information of the CLIP model in a more comprehen-
To enhance editing accuracy by explicitly incorporating the seman-
sive manner. In order to apply the CLIP-NCE loss, we need to define
tic knowledge of CLIP embeddings, we design a simple yet effective
the query, positive and negative samples like common contrastive
text embedding mapping (TEM) module to map the CLIP-space
loss [31, 39–41]. The CLIP-NCE then pulls positive samples towards
embeddings of target text 𝑡𝑡𝑔𝑡 into the latent space and fuse them
the query 𝑄 and pushes negative samples far away from it, as illus-
with the original latent code 𝑤. Given 𝑡𝑡𝑔𝑡 , we first embed it into
trated in Fig. 3. Firstly, we define the query of the contrastive loss
a 512-dim vector in CLIP space with the text encoder 𝐸𝑇 of CLIP,
as follows:
i.e. 𝑒 𝑡 = 𝐸𝑇 (𝑡𝑡𝑔𝑡 ). To disentangle text embeddings and allows key
𝑄 = 𝐸𝐼 (𝐼𝑎𝑢𝑔 ) − 𝐸𝐼 (𝐼𝑠𝑟𝑐 ), (4) editing directions to be propagated to the corresponding generator
𝑄 represents the CLIP-space direction from augmented image to layers, we employ a text mapper consisting of 𝑛_𝑙𝑎𝑡𝑒𝑛𝑡 mapping
source image, and it is the only term that reflects the optimization of networks, each of which has 4 consecutive fully-connected layers
the network during training. Next, positive samples are formulated and projects text embeddings into the latent space corresponding
as follows, which consist of two components: to its layer. The text mapper can be defined as:
𝐾𝑇+ = 𝐸𝑇 (𝑡𝑡𝑔𝑡 ) − 𝐸𝑇 (𝑡𝑠𝑟𝑐 ),
𝑀𝑒 (𝑒 𝑡 ) = (𝑀𝑒1 (𝑒 𝑡 ), 𝑀𝑒2 (𝑒 𝑡 ), . . . , 𝑀𝑒𝑛_𝑙𝑎𝑡𝑒𝑛𝑡 (𝑒 𝑡 )), (8)
𝐾𝐼+ = 𝐸𝑇 (𝑡𝑡𝑔𝑡 ) − 𝐸𝐼 (𝐼𝑠𝑟𝑐 ). (5)
where 𝑛_𝑙𝑎𝑡𝑒𝑛𝑡 refers to the number of StyleGAN layers. Afterward,
The first term 𝐾𝑇+ represents the CLIP-space direction from target we concatenate the projected embedding with the original latent
text to source text. Similarly to directional CLIP loss [9], it is used code 𝑤 and fuse it with a fully-connected layer 𝑀 𝑓 to obtain the
to regularize the editing direction to align with the text embedding fused embedding that lies in the latent space of StyleGAN:
direction from source text to target text. The second term 𝐾𝐼+ is
defined as the CLIP-space direction from source image to target 𝑒 𝑓 = 𝑀 𝑓 (𝑀𝑒 (𝑒 𝑡 ) ⊕ 𝑤), (9)
text. Sharing the same minus term −𝐸𝐼 (𝐼𝑠𝑟𝑐 ) with 𝑄, the positive
samples encourage the editing directions pointing to the target text where ⊕ is the concatenate operation. Lastly, we apply the mapper
embedding, which helps to regularize the editing strength of the of StyleCLIP [23] to obtain the modified latent code:
latent codes when the model over-emphasizes the text embedding 𝑤 ′ = 𝑤 + 𝑀𝑡 (𝑒 𝑓 ), (10)
direction 𝐾𝑇+ .
To best distill the semantic information of CLIP model, lever- where 𝑀𝑡 is the mapper that consists of 3 groups (coarse, medium,
aging as many as useful information in CLIP-space may help to and fine) of mapping networks to yield the target latent codes.

3641
MM ’22, October 10–14, 2022, Lisboa, Portugal Yingchen Yu et al.

“Purple Eyebrows” “Cyberpunk” “Blue Goatee” “Green Lipstick” “Elf Ears” “Gamora” “Rainbow Hair”

Input

TediGAN

StyleCLIP

StyleGAN
-NADA

Ours

Figure 4: Qualitative comparisons with the state-of-the-art over CelebA-HQ [13]: Rows 1-2 show target texts for counterfactual
manipulation and input images respectively, and the rest rows show manipulations by different methods. The proposed
CF-CLIP achieves clearly better visual photorealism and more faithful manipulation with respect to the target texts.

4.4 Loss Functions 5 EXPERIMENTS


In addition to the proposed CLIP-NCE loss, we follow SyleCLIP 5.1 Settings
[23], to employ a latent norm loss and a identity loss for face-related
5.1.1 Datasets.
manipulation to regularize the editing so that the identity could
• FFHQ [15]: It consists of 70,000 high-quality face images crawled
be persevered. The latent norm loss is the 𝐿2 distance in latent
from Flickr with considerable variations. We adopt a StyleGAN
space, i.e. L𝐿2 = ||𝑤 − 𝑤 ′ || 2 . Identity loss uses the pretrained face
model pretrained over this dataset as our synthesis generator for
recognition model, ArcFace 𝑅 [8], which is defined as:
facial image manipulation.
L𝐼 𝐷 = 1 − 𝑐𝑜𝑠 (𝑅(𝐼𝑒𝑑𝑖𝑡 ), 𝑅(𝐼𝑠𝑟𝑐 )) (11) • CelebA-HQ [13]: It is a high-quality version of the human face
dataset CelebA [20] with 30,000 aligned face images. We follows
As the identity loss does not work for non-facial datasets, We em- the StyleCLIP [23] to train and test our model on the inverted latent
ploy a perceptual loss [12] to penalize the semantic discrepancy: codes of CelebA-HQ face images with e4e [30] model.
• AFHQ [6]: It contains high quality aligned animal images. The
L𝑝𝑒𝑟𝑐 = ||Φ(𝐼𝑒𝑑𝑖𝑡 ) − Φ(𝐼𝑠𝑟𝑐 )|| 1, (12)
dataset provides 3 domains of cat, dog, and wild life, each yielding
where Φ is the activation of the relu5_2 layer of the VGG-19 model. 5,000 images. We conduct experiments on cat and dog domains
Therefore, the overall loss function of our approach is formulated with the corresponding pretrained StyleGAN models.
as a weighted combination of the aforementioned losses. 5.1.2 Compared Methods.
L = 𝜆𝑁𝐶𝐸 L𝑁𝐶𝐸 + 𝜆𝐿2 L𝐿2 + 𝜆𝐼 𝐷 L𝐼 𝐷 + 𝜆𝑝𝑒𝑟𝑐 L𝑝𝑒𝑟𝑐 , (13) • TediGAN [35] is the state-of-the-art text-driven face generation
and manipulation method. It performs latent optimization for each
where 𝜆𝑁𝐶𝐸 , 𝜆𝐿2 , 𝜆𝐼 𝐷 and 𝜆𝑝𝑒𝑟𝑐 are empirically set at 0.3, 0.8, 0.2 latent code and target text. We set the number of optimization
and 0.01, respectively, in our implementation. Note that we set 𝜆𝐼 𝐷 iterations at 200 based on official TediGAN implementation, and
to 0 for non-facial datasets and 𝜆𝑝𝑒𝑟𝑐 to 0 for facial datasets. conduct comparisons on face-related manipulation only.

3642
Towards Counterfactual Image Manipulation via CLIP
“Green Tongue”
MM ’22, October 10–14, 2022, Lisboa, Portugal

“Pink Cat”
“Mohawk Hairstyle”

“Purple Nose”
“Tiger Stripes”

“Lion Hair”
Input StyleCLIP StyleGAN-NADA Ours Input StyleCLIP StyleGAN-NADA Ours

Figure 5: Visual comparison with the state-of-the-art over Figure 6: Visual comparison with the state-of-the-art over
AFHQ Dog [6]. The target texts appear on the left column AFHQ Cat [6]. The target texts appear on the left column and
and the input images are sampled randomly. the input images are sampled randomly.

• StyleCLIP [23] is the first text-drive image manipulation method generates faithful and high-fidelity manipulations with minimal
that leverages the power of CLIP. It presents 3 different techniques identity loss for various counterfactual text descriptions.
for image manipulation. We applied the StyleCLIP mapper to con- We also compare CF-CLIP with StyleCLIP [23] and StyleGAN-
duct comparisons with the official implementation. NADA [9] on AFHQ [6] Dog and Cat domains. As shown in Fig. 5,
• StyleGAN-NADA [9] enables text-driven, out-of-domain gener- StyleCLIP [23] hardly meets any counterfactual descriptions and
ator adaptation with CLIP. As it is almost impossible to find the tends to produce unwanted semantic changes. StyleCLIP-NADA
optimal setting for each case, we empirically set the number of demonstrates promising editing directions toward target texts, but
optimization iterations at 300 as the image quality becomes worse the image quality and object semantics are degraded. In contrast,
afterwards. A StyleCLIP mapper is trained for each case based on CF-CLIP yields faithful and visually appealing manipulation align-
the shifted generator following the official implementation. ing with the target texts, with great preservation of image quality
and identity information. For the AFHQ Cat domain, Fig. 6 suggests
5.1.3 Implementation Details. We trained on inverted latent codes that StyleCLIP [23] makes global changes disregarding the target
of CelebA-HQ [13] for facial-related manipulation and randomly editing regions. StyleGAN-NADA can produce reasonable manipu-
sampled latent codes for other domains. We adopt Adam [17] opti- lation towards the target descriptions, but suffers from excessive
mizer with a learning rate of 0.5 as in [23]. The number of training or inaccurate editing. Despite slight identity loss, CF-CLIP editing
iterations is uniformly set to 50,000 with a batch size of 2 for every faithfully aligns with the target descriptions with fair realism.
case. All experiments were conducted on 4 NVIDIA(R) V100 GPUs.
5.3 User Study
5.2 Results We also performed subjective user studies to compare the proposed
Figs. 4, 5 and 6 show qualitative results over CelebA-HQ [13], AFHQ CF-CLIP with state-of-the-art text-guided manipulation methods.
[6] Dog and Cat, respectively. More illustrations and quantitative The studies were performed over datasets CelebA-HQ [13], AFHQ
analysis are shared in Supplementary Materials. [6] Dog & Cat. For each dataset, we randomly sampled 3 manipu-
We first compare CF-CLIP with TediGAN [35], StyleCLIP [23] lated images from each of 8 different counterfactual editing descrip-
and StyleGAN-NADA [9] on CelebA-HQ [13] with counterfactual tions for each benchmarking method, and then shuffled their orders,
manipulation on facial images. The driving texts are designed to which yields 24 groups of results for evaluation. The images are
cover both global (e.g., "Cyberpunk") and local (e.g., "Elf Ears") ma- then presented to 30 recruited volunteers who are separately asked
nipulations. As shown in Fig. 4, TediGAN [35] tends to produce for two tasks. The first task is to measure Edit Accuracy, where
artifacts instead of meaningful changes toward the target. Style- the subjects need to provide the ranking for each method based
CLIP [23] struggles to yield promising counterfactual manipulation on how well the manipulated images by different methods meet
especially for local manipulations, e.g. it edits the background and the target descriptions. The second task requires the subjects to
cloth instead of the target regions for "Purple Eyebrows" and "Green rank the same images according to their Visual Realism, i.e., how
Lipstick". StyleGAN-NADA enables out-of-domain manipulation naturally the counterfactual concepts fit with the manipulated im-
and its manipulations generally meet the target descriptions, but it ages. For both tasks, 1 means the best and 4 the worst (3 for AFHQ
often suffers from degradation in both identity and image quality Cat and Dog as TediGAN [35] is only for face-related manipula-
even after fine-tuning with StyleCLIP mapper. CF-CLIP instead tion). Table 1 shows experimental results. It can be observed that

3643
MM ’22, October 10–14, 2022, Lisboa, Portugal Yingchen Yu et al.

Table 1: User study results over CelebA-HQ [13], AFHQ [6] Dog and Cat. The Edit Accuracy refers to whether the manipulations
faithfully reflect the target text descriptions, and Visual Realism denotes the photorealism of the manipulated image. The
numbers in the table are average rankings of users’ preferences, the lower the better.

CelebA-HQ[13] AFHQ Dog [6] AFHQ Cat [6]


Methods
Edit Accuracy ↓ Visual Realism ↓ Edit Accuracy ↓ Visual Realism ↓ Edit Accuracy ↓ Visual Realism ↓
TediGAN [35] 3.23 2.99 N/A N/A N/A N/A
StyleCLIP [23] 2.73 2.51 2.40 2.29 2.46 2.28
StyleGAN-NADA [9] 2.35 2.60 1.94 1.99 1.82 1.97
Ours 1.69 1.89 1.66 1.71 1.71 1.74
“Purple Eyebrows”
“Green Tongue”

Input w/ ℒ!"#$%" w/ ℒ&'( w/o TEM w/o aug w/ Affine w/ Crop Ours

Figure 7: Qualitative ablation studies of CF-CLIP over CelebA-HQ [13] and AFHQ Dog [6]: The part in the blue box shows the
effectiveness of the proposed CLIP-NCE loss and TEM. The part in the orange box shows the effectiveness of our augmentation
scheme. The last column shows the CF-CLIP manipulation.

CF-CLIP outperforms the state-of-the-art consistently in both tasks performs random crop augmentation and then resizes images back
over different datasets. to their original resolution. The augmentation details are provided
in the Supplementary Materials. Without perspective augmentation,
5.4 Ablation Study the model loses certain understanding of the 3D structure of target
We study the individual contributions of our technical designs object/face and fails to locate the target regions (e.g., eyeshadow in
through several ablation studies as illustrated in Fig. 7. More ab- first case and dog mouth in second one). 2D affine transformations
lation studies and analysis can be found in the Supplementary on the manipulated images cannot fully substitute the effects of
Materials. perspective transformations as illustrated in both cases. We can also
observe that the random cropping has very limited effects, which
5.4.1 CLIP-NCE and TEM. We trained three models to examine tends to lead to undesired editing (e.g., green eyes for the second
the proposed CLIP-NCE loss and TEM module: 1) w/ L𝑔𝑙𝑜𝑏𝑎𝑙 that case) due to the increased local views from random cropping.
replaces CLIP-NCE loss with the global CLIP loss [23]; 2) w/ L𝑑𝑖𝑟
that replaces CLIP-NCE loss with the directional CLIP loss [9]; 3) 6 CONCLUSION
w/o TEM that removes the proposed TEM module. As shown in Fig.
This paper presents CF-CLIP, a novel text-guided image manipula-
7, using global CLIP loss L𝑔𝑙𝑜𝑏𝑎𝑙 tends to manipulate the color of
tion framework that achieves realistic and faithful counterfactual
eyes instead of the target eyebrows in the "Purple Eyebrows" case,
editing by leveraging the great potential of CLIP. To comprehen-
and gets stuck in local minimum in the second case. The directional
sively explore the rich semantic information of CLIP for counter-
CLIP loss L𝑑𝑖𝑟 leads to excessive editing of all related areas in
factual concepts, we propose a novel contrastive loss, CLIP-NCE to
the first case, and fails to edit the tongue color in the second case.
guide the editing from different perspectives based on predefined
Without TEM, the model fails to locate target areas accurately, e.g.,
CLIP-space directions. In addition, a simple yet effective text embed-
it edits the eye color instead of the eyebrows. The ablation studies
ding mapping (TEM) module is designed to explicitly leverage the
show that both CLIP-NCE loss and TEM module are indispensable
CLIP embeddings of target text for more precise editing on latent
for achieving accurate and realistic counterfactual manipulation.
codes in relevant generator layers. Extensive experiments show
5.4.2 Augmentation. We also trained three models to examine that our approach could produce realistic and faithful editing given
the effect of perspective augmentation: 1) w/o aug that removes the target text with counterfactual concepts. Moving forward, more
the augmentation scheme and calculates the loss directly on the challenging counterfactual manipulations with drastic semantic
generated images; 2) w/ Affine the replaces the random perspective changes will be explored to further unleash the user’s creativity.
augmentation with random affine augmentation; 3) w/ Crop that

3644
Towards Counterfactual Image Manipulation via CLIP MM ’22, October 10–14, 2022, Lisboa, Portugal

REFERENCES [22] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2stylegan: How to learning for unpaired image-to-image translation. In European Conference on
embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF Computer Vision. Springer, 319–345.
International Conference on Computer Vision. 4432–4441. [23] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski.
[2] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. 2021. Styleflow: 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of
Attribute-conditioned exploration of stylegan-generated images using condi- the IEEE/CVF International Conference on Computer Vision. 2085–2094.
tional continuous normalizing flows. ACM Transactions on Graphics (TOG) 40, 3 [24] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
(2021), 1–21. Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. 2021. Restyle: A residual- et al. 2021. Learning transferable visual models from natural language supervision.
based stylegan encoder via iterative refinement. In Proceedings of the IEEE/CVF In International Conference on Machine Learning. PMLR, 8748–8763.
International Conference on Computer Vision. 6711–6720. [25] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
[4] Alex Andonian, Taesung Park, Bryan Russell, Phillip Isola, Jun-Yan Zhu, and Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation.
Richard Zhang. 2021. Contrastive feature loss for image prediction. In Proceedings In International Conference on Machine Learning. PMLR, 8821–8831.
[26] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
of the IEEE/CVF International Conference on Computer Vision. 1934–1943.
and Honglak Lee. 2016. Generative adversarial text to image synthesis. In Inter-
[5] Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud.
national conference on machine learning. PMLR, 1060–1069.
2018. Explaining image classifiers by counterfactual generation. arXiv preprint
[27] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav
arXiv:1807.08024 (2018).
Shapiro, and Daniel Cohen-Or. 2021. Encoding in style: a stylegan encoder
[6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. Stargan v2:
for image-to-image translation. In Proceedings of the IEEE/CVF Conference on
Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF
Computer Vision and Pattern Recognition. 2287–2296.
conference on computer vision and pattern recognition. 8188–8197.
[28] Axel Sauer and Andreas Geiger. 2021. Counterfactual generative networks. arXiv
[7] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. 2020. Editing in style:
preprint arXiv:2101.06046 (2021).
Uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference
[29] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. 2020. Interfacegan: In-
on Computer Vision and Pattern Recognition. 5771–5780.
terpreting the disentangled face representation learned by gans. IEEE transactions
[8] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface:
on pattern analysis and machine intelligence (2020).
Additive angular margin loss for deep face recognition. In Proceedings of the
[30] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. 2021.
IEEE/CVF conference on computer vision and pattern recognition. 4690–4699.
Designing an encoder for stylegan image manipulation. ACM Transactions on
[9] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or.
Graphics (TOG) 40, 4 (2021), 1–14.
2021. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv
[31] Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning
preprint arXiv:2108.00946 (2021).
with contrastive predictive coding. arXiv e-prints (2018), arXiv–1807.
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
[32] Rongliang Wu and Shijian Lu. 2020. LEED: Label-Free Expression Editing via
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
Disentanglement. In European Conference on Computer Vision. Springer, 781–798.
nets. Advances in neural information processing systems 27 (2014).
[33] Rongliang Wu, Gongjie Zhang, Shijian Lu, and Tao Chen. 2020. Cascade ef-gan:
[11] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020.
Progressive facial expression editing with local focuses. In Proceedings of the
Ganspace: Discovering interpretable gan controls. Advances in Neural Information
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5021–5030.
Processing Systems 33 (2020), 9841–9850.
[34] Zongze Wu, Dani Lischinski, and Eli Shechtman. 2021. Stylespace analysis: Dis-
[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-
entangled controls for stylegan image generation. In Proceedings of the IEEE/CVF
time style transfer and super-resolution. In European Conference on Computer
Conference on Computer Vision and Pattern Recognition. 12863–12872.
Vision. Springer, 694–711.
[35] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. 2021. TediGAN: Text-
[13] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive
Guided Diverse Face Image Generation and Manipulation. In Proceedings of the
growing of gans for improved quality, stability, and variation. International
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2256–2265.
Conference on Learning Representations (2018).
[36] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang,
[14] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko
and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with
Lehtinen, and Timo Aila. 2021. Alias-free generative adversarial networks. Ad-
attentional generative adversarial networks. In Proceedings of the IEEE conference
vances in Neural Information Processing Systems 34 (2021).
on computer vision and pattern recognition. 1316–1324.
[15] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar-
[37] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Kaiwen Cui, Aoran Xiao, Shijian
chitecture for generative adversarial networks. In Proceedings of the IEEE/CVF
Lu, and Ling Shao. 2021. Bi-level feature alignment for versatile image translation
conference on computer vision and pattern recognition. 4401–4410.
and manipulation. arXiv preprint arXiv:2107.03021 (2021).
[16] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and
[38] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, and Shijian Lu. 2021.
Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In
Multimodal image synthesis and editing: A survey. arXiv preprint arXiv:2112.13592
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
(2021).
8110–8119.
[39] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, and
[17] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
Changgong Zhang. 2022. Marginal Contrastive Correspondence for Guided
mization. arXiv preprint arXiv:1412.6980 (2014).
Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision
[18] Gihyun Kwon and Jong Chul Ye. 2021. Clipstyler: Image style transfer with a
and Pattern Recognition. 10663–10672.
single text condition. arXiv preprint arXiv:2112.00374 (2021).
[40] Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Rongliang Wu, and Shijian Lu. 2022.
[19] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang
Modulated contrast for versatile image synthesis. In Proceedings of the IEEE/CVF
Liu. 2021. FuseDream: Training-Free Text-to-Image Generation with Improved
Conference on Computer Vision and Pattern Recognition. 18280–18290.
CLIP+ GAN Space Optimization. arXiv preprint arXiv:2112.01573 (2021).
[41] Jiahui Zhang, Shijian Lu, Fangneng Zhan, and Yingchen Yu. 2021. Blind im-
[20] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face
age super-resolution via contrastive representation learning. arXiv preprint
attributes in the wild. In International Conference on Computer Vision. 3730–3738.
arXiv:2107.00708 (2021).
[21] Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets.
[42] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. 2016.
arXiv preprint arXiv:1411.1784 (2014).
Generative visual manipulation on the natural image manifold. In ECCV. Springer,
597–613.

3645

You might also like