Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views23 pages

Re Version

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views23 pages

Re Version

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ReVersion: Diffusion-Based Relation Inversion from Images

Ziqi Huang∗ Tianxing Wu∗ Yuming Jiang Kelvin C.K. Chan Ziwei LiuB
S-Lab, Nanyang Technological University
{ziqi002, twu012, yuming002, chan0899, ziwei.liu}@ntu.edu.sg
arXiv:2303.13495v1 [cs.CV] 23 Mar 2023

exemplar images <R> “Michael Jackson “cat <R> canvas” “Spiderman <R> “hamburger <R> vase” “cat <R> building”
ReVersion <R> wall” wall”

exemplar images <R> “cat <R> cat” “panda <R> panda” “otter <R> otter” “rabbit <R> child” “Batman <R> Batman”
ReVersion

exemplar images <R> “Spiderman <R> “hamster <R> “sea <R> cup” “panda <R> pot” “rabbit <R> cup”
ReVersion basket” paper bag”
Figure 1: We propose a new task, Relation Inversion: Given a few exemplar images, where a relation co-exists in every
image, we aim to find a relation prompt hRi to capture this interaction, and apply the relation to new entities to synthesize
new scenes. The above images are generated by our ReVersion framework.

Abstract Our key insight is the “preposition prior” - real-world


Diffusion models gain increasing popularity for their relation prompts can be sparsely activated upon a set of
generative capabilities. Recently, there have been surging basis prepositional words. Specifically, we propose a novel
needs to generate customized images by inverting diffusion relation-steering contrastive learning scheme to impose two
models from exemplar images. However, existing inversion critical properties of the relation prompt: 1) The relation
methods mainly focus on capturing object appearances. prompt should capture the interaction between objects, en-
How to invert object relations, another important pillar forced by the preposition prior. 2) The relation prompt
in the visual world, remains unexplored. In this work, we should be disentangled away from object appearances. We
propose ReVersion for the Relation Inversion task, which further devise relation-focal importance sampling to em-
aims to learn a specific relation (represented as “relation phasize high-level interactions over low-level appearances
prompt”) from exemplar images. Specifically, we learn a (e.g., texture, color). To comprehensively evaluate this new
relation prompt from a frozen pre-trained text-to-image dif- task, we contribute ReVersion Benchmark, which provides
fusion model. The learned relation prompt can then be ap- various exemplar images with diverse relations. Extensive
plied to generate relation-specific images with new objects, experiments validate the superiority of our approach over
backgrounds, and styles. existing methods across a wide range of visual relations.
∗ indicates equal contribution. Project page and code are available.

1
1. Introduction mark serves as an evaluation tool for future research in the
Relation Inversion task. Results on a variety of relations
Recently, Text-to-image (T2I) diffusion models [34, 33, demonstrate the power of preposition prior and our ReVer-
37] have shown promising results and enabled subsequent sion framework.
explorations of various generative tasks. There have been Our contributions are summarized as follows:
several attempts [9, 36, 25] to invert a pre-trained text-to-
image model, obtaining a text embedding representation to • We study a new problem, Relation Inversion, which
capture the object in the reference images. While existing requires learning a relation prompt for a relation that
methods have made substantial progress in capturing object co-exists in several exemplar images. While existing
appearances, such exploration for relations is rare. Captur- T2I inversion methods mainly focus on capturing ap-
ing object relation is intrinsically a harder task as it requires pearances, we take the initiative to explore relation, an
the understanding of interactions between objects as well as under-explored yet important pillar in the visual world.
the composition of an image, and existing inversion meth- • We propose the ReVersion framework, where the
ods are unable to handle the task due to entity leakage from relation-steering contrastive learning scheme steers
the reference images. Yet, this is an important direction that the relation prompt using our “preposition prior”, and
worths our attention. effectively disentangles the learned relation away from
In this paper, we study the Relation Inversion task, object appearances. Relation-focal importance sam-
whose objective is to learn a relation that co-exists in the pling further emphasizes high-level relations over low-
given exemplar images. Specifically, with objects in each level details.
exemplar image following a specific relation, we aim to ob-
• We contribute the ReVersion Benchmark, which
tain a relation prompt in the text embedding space of the
serves as a diagnostic and benchmarking tool for the
pre-trained text-to-image diffusion model. By composing
new task of Relation Inversion.
the relation prompt with user-devised text prompts, users
are able to synthesize images using the corresponding rela-
2. Related Work
tion, with customized objects, styles, backgounds, etc.
To better represent high-level relation concepts with the Diffusion Models. Diffusion models [15, 42, 44, 34, 12,
learnable prompt, we introduce a simple yet effective prepo- 43] have become a mainstream approach for image synthe-
sition prior. The preposition prior is based on a premise and sis [5, 6, 28] apart from GANs [10], and have shown success
two observations in the text embedding space. Specifically, in various domains such as video generation [13, 48, 41,
we find that 1) prepositions are highly related to relations, 17], image restoration [38, 16], and many more [3, 11, 1, 2].
2) prepositions and words of other Parts-of-Speech are indi- In the diffusion-based approach, models are trained using
vidually clustered in the text embedding space, and 3) com- score-matching objectives [20, 49] at various noise levels,
plex real-world relations can be expressed with a basic set and sampling is done via iterative denoising. Text-to-Image
of prepositions. Our experiments show that this language- (T2I) diffusion models [33, 34, 6, 12, 22, 30, 37] demon-
based prior can be effectively used as a high-level guidance strated impressive results in converting a user-provided text
for the relation prompt optimization. description into images. Motivated by their success, we
Based on our preposition prior, we propose the Re- build our framework on a state-of-the-art T2I diffusion
Version framework to tackle the Relation Inversion prob- model, Stable Diffusion [34].
lem. Notably, we design a novel relation-steering con- Relation Modeling. Relation modeling has been ex-
trastive learning scheme to steer the relation prompt to- plored in discriminative tasks such as scene graph gener-
wards a relation-dense region in the text embedding space. ation [51, 24, 40, 21, 52, 53] and visual relationship detec-
A set of basis prepositions are used as positive samples tion [27, 54, 55]. These works aim to detect visual relations
to pull the embedding into the sparsely activated region, between objects in given images and classify them into a
while words of other Parts-of-Speech (e.g., nouns, adjec- predefined, closed-set of relations. However, the finite rela-
tives) in text descriptions are regarded as negatives so that tion category set intrinsically limits the diversity of captured
the semantics related to object appearances are disentan- relations. In contrast, Relation Inversion regards relation
gled away. To encourage attention on object interactions, modeling as a generative task, aiming to capture arbitrary,
we devise a relation-focal importance sampling strategy. It open-world relations from exemplar images and apply the
constrains the optimization process so that high-level inter- resulting relation for content creation.
actions rather than low-level details are emphasized, effec- Diffusion-Based Inversion. Given a pre-trained T2I dif-
tively leading to better relation inversion results. fusion model, inversion [9, 36, 25, 23] aims to find a text
As the first attempt in this direction, we further con- embedding vector to express the concepts in the given ex-
tribute the ReVersion Benchmark, which provides vari- emplar images. For example, given several images of a par-
ous exemplar images with diverse relations. The bench- ticular “cat statue”, Textual Inversion [9] learns a new word
paired
Exemplar Images Coarse Descriptions Relation-Steering Contrastive Learning
add
noise “woman <R> man”
“dog <R> dog, with trees” Prepositions
“woman <R> man in blue”
Text Embedding Space
Denoising “a woman <R> a woman”
Activated
Loss
Prepositions
Adj.
Relation-Focal Importance Sampling backprop gradients
to optimize <R>
predict
…… Steering Loss
Sampling
Probability added noise

Text-to-Image <R>
Noise Level noisy images Diffusion Model …… Noun.
less noisy samples highly noisy samples ……
sample less sample more

Figure 2: ReVersion Framework. Given exemplar images and their entities’ coarse descriptions, our ReVersion framework
optimizes the relation prompt hRi to capture the relation that co-exists in all the exemplar images. During optmization,
the relation-focal importance sampling strategy encourages hRi to focus on high-level relations, and the relation-steering
contrastive learning scheme induces the relation prompt hRi towards our preposition prior and away from entities or appear-
ances. Upon optimization, hRi can be used as a word in new sentences to make novel entities interact via the relation in
exemplar images.

to describe the appearance of this item - finding a vector in 4. The ReVersion Framework
LDM [34]’s text embedding space, so that the new word can
be composed into new sentences to achieve personalized 4.1. Preliminaries
creation. Rather than inverting the appearance information Stable Diffusion. Diffusion models are a class of gener-
(e.g., color, texture), our proposed Relation Inversion task ative models that gradually denoise the Gaussian prior xT
extracts high-level object relations from exemplar images, to the data x0 (e.g., a natural image). The commonly used
which is a harder problem as it requires comprehending im- training objective LDM [15] is:
age compositions and object relationships. h i
2
LDM (θ) := Et,x0 , k − θ (xt , t)k , (1)

where xt is an noisy image constructed by adding noise  ∼


3. The Relation Inversion Task N (0, I) to the natural image x0 , and the network θ (·) is
trained to predict the added noise. To sample data x0 from
Relation Inversion aims to extract the common relation a trained diffusion model θ (·), we iteratively denoise xt
hRi from several exemplar images. Let I = {I1 , I2 , ...In } from t = T to t = 0 using the predicted noise θ (xt , t) at
be a set of exemplar images, and Ei,A and Ei,B be two each timestep t.
dominant entities in image Ii . In Relation Inversion, we as- Latent Diffusion Model (LDM) [34], the predecessor
sume that the entities in each exemplar image interacts with of Stable Diffusion, mainly introduced two changes to the
each other through a common relation R. A set of coarse vanilla diffusion model [15]. First, instead of directly mod-
descriptions C = {c1 , c2 , ...cn } is associated to the exem- eling the natural image distribution, LDM models images’
plar images, where “ci = Ei,A hRi Ei,B ” denotes the cap- projections in autoencoder’s compressed latent space. Sec-
tion corresponding to image Ii . Our objective is to optimize ond, LDM enables text-to-image generation by feeding en-
the relation prompt hRi such that the co-existing relation coded text input to the UNet [35] θ (·). The LDM loss is:
can be accurately represented by the optimized prompt.
h i
2
An immediate application of Relation Inversion is LLDM (θ) := Et,x0 , k − θ (xt , t, τθ (c))k , (2)
relation-specific text-to-image synthesis. Once the prompt
is acquired, one can generate images with novel objects in- where x is the autoencoder latents for images, and τθ (·) is a
teracting with each other following the specified relation. BERT text encoder [4] that encodes the text descriptions c.
More generally, this task reveals a new direction of inferring Stable Diffusion extends LDM by training on the larger
relations from a set of exemplar images. This could poten- LAION dataset [39], and changing the trainable BERT text
tially inspire future research in representation learning, few- encoder to the pre-trained CLIP [32] text encoder.
shot learning, visual relation detection, scene graph genera- Inversion on Text-to-Image Diffusion Models. Existing
tion, and many more. inversion methods focus on appearance inversion. Given
Relations
catching
sitting on
crossing
picking
swinging
Nouns Prepositions looking at
riding
talking to
cleaning Basis
Prep. Prepositional
Noun. Words

ac ard

nd

de wa o

w ith
ss

g st
al a r

in n
to

w s
on

th
upup
ve on
amside

be ath

yo n

rn r d
be ato t

nd
ou i

ne p

tw de

do by

in
ar ant

a
agafte

u
un to t
on

w
i
be ee
on in

ith
ro

ea

rs
be si
o
Verb.

ab
Adj.
Adv.
Adjectives Pronoun. Figure 4: Sparse Activation. We visualize the cosine simi-
Conj.
Interj. larities between real-world relations and basis prepositional
words, and observe that relation is generally sparsely ac-
tivated w.r.t. the basis prepositions. Note that each row of
Figure 3: Part-of-Speech (POS) Clustering. We use t- similarity scores are sparsely distributed, with few peak val-
SNE [47] to visualize word distribution in CLIP’s input em- ues in red.
bedding space, where hRi is optimized in our ReVersion
framework. We observe that words of the same Part-of- the preposition subspace (i.e., the red region in Figure 3).
Speech (POS) are closely clustered together, while words Observation II: Sparse activation. As shown in Figure 4,
of different POS are generally at a distance from each other. feature similarity between the a real-world relation and the
prepositional words are sparsely distributed, and the acti-
vated prepositions are usually related to this relation’s se-
several images that all contain a specific entity, they [9, 36,
mantic meaning. For example, for the relation “swing-
25] find a text embedding V* for the pre-trained T2I model.
ing”, the sparsely activated prepositions are “underneath”,
The obtained V* can then be used to generate this entity in
“down”, “beneath”, “aboard’, etc., which together collab-
different scenarios.
oratively describe the “swinging” interaction. This pattern
In this work, we aim to capture object relations instead.
suggests that only a subset of prepositions should be acti-
Given several exemplar images which share a common rela-
vated during optimization, leading to our noise-robust de-
tion R, we aim to find a relation prompt hRi to capture this
sign in Section 4.3.
relation, such that “EA hRi EB ” can be used to generate an
Based on the aforementioned analysis, we hypothesize
image where EA and EB interact via relation hRi .
that a common visual relation can be generally expressed
4.2. Preposition Prior as a set of basis prepositions, with only a small subset
of highly semantically-related prepositions activated. Mo-
Appearance inversion focuses on inverting low-level fea- tivated by this, we design a relation-steering contrastive
tures of a specific entity, thus the commonly used pixel- learning scheme to steer the relation prompt hRi into a
level reconstruction loss is sufficient to learn a prompt that relation-dense region in the text embedding space.
captures the shared information in exemplar images. In
contrast, relation is a high-level visual concept. A pixel- 4.3. Relation-Steering Contrastive Learning
wise loss alone cannot accurately extract the target relation.
Recall that our goal is to acquire a relation prompt hRi
Some linguistic priors need to be introduced to represent
that accurately captures the co-existing relation in the exem-
relations.
plar images. A basic objective is to reconstruct the exemplar
In this section, we present the “preposition prior”, a
images using hRi :
language-based prior that steers the relation prompt towards
a relation-dense region in the text embedding space. This h
2
i
hRi = arg min Et,x0 , k − θ (xt , t, τθ (c))k , (3)
prior is motivated by a well-acknowledged premise and two hri
interesting observations on natural language.
Premise: Prepositions describe relations. In natural lan- where  ∼ N (0, I), hRi is the optimized text embedding,
guage, prepositions are words that express the relation be- and θ (·) is a pre-trained text-to-image diffusion model
tween elements in a sentence [19]. This language prior nat- whose weights are frozen throughout optimization. hri is
urally leads us to use prepositional words to regularize our the relation prompt being optimized, and is fed into the pre-
relation prompt. trained T2I model as part of the text description c.
Observation I: POS clustering. As shown in Figure 3, However, as discussed in Section 4.2 , this pixel-level re-
in the text embedding space of language models, embed- construction loss mainly focus on low-level reconstruction
dings are generally clustered according to their Part-of- rather than visual relations. Consequently, directly apply-
Speech (POS) labels. This observation together with the ing this loss could result in appearance leakage and hence
Premise inspire us to steer our relation prompt hRi towards unsatisfactory relation inversion.
Motivated by our Premise and Observation I, we adopt a larger t. The Denoising Loss for relation-focal importance
the preposition prior as an important guidance to steer the sampling becomes:
relation prompt towards the relation-dense text embedding h i
2
subspace. Specifically, we can use the prepositions as posi- Ldenoise = Et∼f,x0 , k − θ (xt , t, τθ (c))k ,
tive samples and other POS’ words (i.e., nouns, adjectives) (6)
1 πt
as negative samples to construct a contrastive loss. Follow- f (t) = (1 − α cos ),
ing InfoNCE [31], this preliminary contrastive loss is de- T T
rived as: where f (t) is the importance sampling function, which
> characterizes the probability density function to sample t
eR ·Pi /γ from. The skewness of f (t) increases with α ∈ (0, 1]. We
Lpre = −log PK > k
, (4)
eR> ·Pi /γ + k=1 eR ·Ni /γ set α = 0.5 throughout our experiments. The overall op-
timization objective of the ReVersion framework is written
where R is the relation embedding, and γ is the temper- as:
ature parameter. Pi (i.e., positive sample) is a randomly
sampled preposition embedding at the i-th optimization it- hRi = arg min(λsteer Lsteer + λdenoise Ldenoise ), (7)
hri
eration, and Ni = {Ni1 , ..., NiK } (i.e., negative samples) is
a set of randomly sampled embeddings from other POS. All where λsteer and λdenoise are the weighting factors.
embeddings are normalized to unit length.
Since the relation prompt should also be disentangled 5. The ReVersion Benchmark
away from object appearance, we further propose to select
the object descriptions of exemplar images as the improved To facilitate fair comparison for Relation Inversion, we
negative set. In this way, our choice of negatives serves present the ReVersion Benchmark. It consists of diverse
two purposes: 1) provides POS guidance away from non- relations and entities, along with a set of well-defined text
prepositional clusters, and 2) prevents appearance leakage descriptions. This benchmark can be used for conducting
by including exemplar object descriptions in the negative qualitative and quantitative evaluations.
set. Relations and Entities. We define ten representative ob-
In addition, Observation II (sparse activation) implies ject relations with different abstraction levels, ranging from
that only a small set of prepositions should be considered as basic spatial relations (e.g., “on top of”), entity interactions
true positives. Therefore, we need a contrastive loss that is (e.g., “shakes hands with”), to abstract concepts (e.g., “is
tolerant about noises in the positive set (i.e., not all preposi- carved by”). A wide range of entities, such as animals, hu-
tions should be activated). Inspired by [29], we revise Equa- man, household items, are involved to further increase the
tion 4 to a noise-robust contrastive loss as our final Steering diversity of the benchmark.
Loss: Exemplar Images and Text Descriptions. For each re-
PL R> ·P l /γ lation, we collect four to ten exemplar images containing
l=1 e
i different entities. We further annotate several text templates
Lsteer = −log PL l M
, (5) for each exemplar image to describe them with different lev-
>
R ·Pi /γ + R> ·Nim /γ
P
l=1 e m=1 e
els of details1 . These training templates can be used for the
where Pi = {Pi1 , ..., PiL } refers to positive samples ran- optimization of the relation prompt.
domly drawn from a set of basis prepositions (more details Benchmark Scenarios. To validate the robustness of the
provided in Supplementary File), and Ni = {Ni1 , ..., NiM } relation inversion methods, we design 100 inference tem-
refers to the improved negative samples. plates composing of different object entities for each of the
ten relations. This provides a total of 1,000 inference tem-
4.4. Relation-Focal Importance Sampling plates for performance evaluation.
In the sampling process of diffusion models, high-level 6. Experiments
semantics usually appear first, and fine details emerge at
later stages [50, 18]. As our objective is to capture the rela- We present qualitative and quantitative results in this sec-
tion (a high-level concept) in exemplar images, it is undesir- tion, and more experiments and analysis are in the Supple-
able to emphasize on low-level details during optimization. mentary File. We adopt Stable Diffusion [34] for all experi-
Therefore, we conduct an importance sampling strategy to ments since it achieves a good balance between quality and
encourage the learning of high-level relations. Specifically, speed. We generate images at 512 × 512 resolution.
unlike previous reconstruction objectives, which samples 1 For example, a photo of a cat sitting on a box could be annotated as 1)
the timestep t from a uniform distribution, we skew the sam- “cat hRi box”, 2) “an orange cat hRi a black box” and 3) “an orange cat
pling distribution so that a higher probability is assigned to hRi a black box, with trees in the background”.
(a)

exemplar images <R> “hamster <R> “cat <R> quilt” “dog <R> blanket” “panda <R> “rabbit <R> “rabbit <R>
ReVersion blanket” blanket” blanket” bed sheet”

(b)

exemplar images <R> “schoolbag <R> “blue bag <R> “lamp <R> chair” “blue bag <R> “lamp <R> table” “blue bag <R>
ReVersion ceiling” handrail” ceiling” chair”

(c)

“dog <R> carrot” “dog <R> clay” “dog <R> glass” “dog <R> jade” “dog <R> metal” “dog <R> wood”

exemplar images <R>


ReVersion

coarse descriptions:
“horse <R> metal”, “lotus <R> carrot” “lotus <R> clay” “lotus <R> glass” “lotus <R> jade” “lotus <R> metal” “lotus <R> wood”
“rabbit <R> glass”,
“dog <R> metal”,
“lion <R> marble”

“bodhisattva <R> “bodhisattva <R> “bodhisattva <R> “bodhisattva <R> “bodhisattva <R> “bodhisattva <R>
carrot” clay” glass” jade” metal” wood”

Figure 5: Qualitative Results. Our ReVersion framework successfully captures the relation that co-exists in the exemplar
images, and applies the extracted relation prompt hRi to compose novel entities.

6.1. Comparison Methods method developed on Stable Diffusion 1.5, we use the dif-
fuser [8] implementation of Textual Inversion [9] on Stable
Text-to-Image Generation using Stable Diffusion [34]. Diffusion 1.5. Based on the default hyper-parameter set-
We use the original Stable Diffusion 1.5 as the text-to-image tings, we tuned the learning rate and batch size for its op-
generation baseline. Since there is no ground-truth textual timal performance on our Relation Inversion task. We use
description for the relation in each set of exemplar images, Textual Inversion’s LDM objective to optimize hRi for 3000
we use natural language that can best describe the relation to iterations, and generate images using the obtained hRi .
replace the hRi token. For example, in Figure 6 (a), the co-
existing relation in the reference images can be roughly de- 6.2. Qualitative Comparisons
scribed as “is painted on”. Thus we use it to replace the hRi
In Figure 5, we provide the generation results using hRi
token in the inference template “Spiderman hRi building”,
inverted by ReVersion. We observe that our framework is
resulting in a sentence “Spiderman is painted on building”,
capable of 1) synthesizing the entities in the inference tem-
which is then used as the text prompt for generation.
plate and 2) ensuring that entities follow the relation co-
Textual Inversion [9]. For fair comparison with our existing in the exemplar images. We then compare our
Text-to-Image Generation Textual Inversion Ours
(a)

exemplar images <R> “Spiderman is painted on building” “Spiderman <R> building” “Spiderman <R> building”
Relation Inversion

(b)

exemplar images <R> “monkey sits back to back with monkey” “monkey <R> monkey” “monkey <R> monkey”
Relation Inversion

Figure 6: Qualitative Comparisons with Existing Methods. Our method significantly surpasses both baselines in terms of
relation accuracy and entity accuracy.

method with 1) text-to-image generation via Stable Diffu- Table 1: Quantitative Results. Percentage of votes where
sion [34] and 2) Textual Inversion [9] in Figure 6. For exam- users favor our results vs. comparison methods. Our
ple, in the first row, although the text-to-image baseline suc- method outperforms the baselines under both metrics.
cessfully generates both entities (Spiderman and building), Method Relation Entity
it fails to paint Spiderman on the building as the exemplar Text-to-Image Generation 7.86% 15.49%
images do. Text-to-image generation severely relies on the Textual Inversion 8.94% 10.05%
bias between two entities: Spiderman usually climbs/jumps Ours 83.20% 74.46%
on the buildings, instead of being painted onto the build-
ings. Using exemplar images and our ReVersion framework
alleviates this problem. In Textual Inversion, entities in the Table 2: Ablation Study. Suppressing steering or impor-
exemplar images like canvas are leaked to hRi , such that tance sampling introduces performance drops, which shows
the generated image shows a Spiderman on the canvas even the necessity of both relation-steering and importance sam-
when the word “canvas” is not in the inference prompt. pling.
Method Relation Entity
6.3. Quantitative Comparisons w/o Steering 11.20% 10.90%
w/o Importance Sampling 11.20% 13.62%
We conduct a user study with 37 human evaluators to as-
Ours 77.60% 75.48%
sess the performance of our ReVersion framework on the
Relation Inversion task. We sampled 20 groups of images.
Each group has three images generated by different meth-
ods. For each group, apart from generated images, the fol- whether Entity A and Entity B are both authentically gener-
lowing information is presented: 1) exemplar images of a ated in each image.
particular relation 2) text description of the exemplar im- Relation Accuracy. Human evaluators are asked to evalu-
ages. We then ask the evaluators to vote for the best gener- ate whether the relations of the two entities in the generated
ated image respect to the following metrics. image are consistent with the relation co-existing in the ex-
Entity Accuracy. Given an inference template in the form emplar images. As shown in Table 1, our method clearly
of “Entity A hRi Entity B”, we ask evaluators to determine obtains better results under the two quality metrics.
w/o Steering w/o Importance Sampling Ours
(a)

exemplar images <R> “dog <R> basket” “dog <R> basket” “dog <R> basket”
Relation Inversion

(b)

exemplar images <R> “dog <R> plate” “dog <R> plate” “dog <R> plate”
Relation Inversion

Figure 7: Qualitative Comparisons with Ablation Variants. Without relation-steering, hRi suffers from appearance leak
(e.g., white puppy in (a), gray background in (b)) and inaccurate relation capture (e.g., dog not being on top of plate in
(b)). Without importance sampling, hRi focuses on lower-level visual details (e.g., rattan around puppy in (a)) and misses
high-level relations.

6.4. Ablation Study Figure 7 (a) “w/o Importance Sampling”, the basket rattan
wraps around the puppy’s head in the same way as the ex-
From Table A4, we observe that removing steering or emplar image, instead of containing the puppy inside.
importance sampling results in deterioration in both rela-
tion accuracy and entity accuracy. This corroborates our
observations that 1) relation-steering effectively guides hRi 7. Conclusion
towards the relation-dense “preposition prior” and disentan-
gles hRi away from exemplar entities, and 2) importance In this work, we take the first step forward and propose
sampling emphasizes high-level relations over low-level de- the Relation Inversion task, which aims to learn a relation
tails, aiding hRi to be relation-focal. We further show qual- prompt to capture the relation that co-exists in multiple ex-
itatively the necessity of both modules in Figure 7. emplar images. Motivated by the preposition prior, our
Effectiveness of Relation-Steering. In “w/o Steering”, we relation-steering contrastive learning scheme effectively
remove the Steering Loss Lsteer in the optimization pro- guides the relation prompt towards relation-dense regions
cess. As shown in Figure 7 (a), the appearance of the white in the text embedding space. We also contribute the ReVer-
puppy in the lower-left exemplar image is leaked into hRi , sion Benchmark for performance evaluation. Our proposed
resulting in similar puppies in the generated images. In Fig- Relation Inversion task would be a good inspiration for fu-
ure 7 (b), many appearance elements are leaked into hRi , ture works in various domains such as generative model in-
such as the gray background, the black cube, and the husky version, representation learning, few-shot learning, visual
dog. The dog and the plate also do not follow the relation relation detection, and scene graph generation.
of “being on top of” as shown in exemplar images. Con- Limitations. Our performance is dependent on the genera-
sequently, the images generated via hRi do not present the tive capabilities of Stable Diffusion. It might produce sub-
correct relation and introduced unwanted leaked imageries. optimal synthesis results for entities that Stable Diffusion
Effectiveness of Importance Sampling. We replace our struggles at, such as human body and human face.
relation-focal importance sampling with uniform sampling, Potential Negative Societal Impacts. The entity relational
and observe that hRi pays too much attention to low-level composition capabilities of ReVersion could be applied ma-
details rather than high-level relations. For instance, in liciously on real human figures.
References [18] Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, and Ziwei
Liu. Collaborative diffusion for multi-modal face generation
[1] Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior and editing. In CVPR, 2023. 5
Wolf. SegDiff: Image segmentation with diffusion proba-
[19] Rodney Huddleston and Geoffrey K. Pullum. The Cam-
bilistic models. arXiv preprint arXiv:2112.00390, 2021. 2
bridge Grammar of the English Language. Cambridge Uni-
[2] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar- versity Press, 2002. 4
low, and Rianne van den Berg. Structured denoising diffu-
[20] Aapo Hyvärinen and Peter Dayan. Estimation of non-
sion models in discrete state-spaces. In NeurIPS, 2021. 2
normalized statistical models by score matching. JMLR,
[3] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, 2005. 2
Valentin Khrulkov, and Artem Babenko. Label-efficient se- [21] Jingwei Ji, Ranjay Krishna, Fei-Fei Li, and Juan Carlos
mantic segmentation with diffusion models. In ICLR, 2022. Niebles. Action genome: Actions as compositions of spatio-
2 temporal scene graphs. In CVPR, pages 10236–10247, 2020.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina 2
Toutanova. BERT: Pre-training of deep bidirectional [22] Yuming Jiang, Shuai Yang, Haonan Qju, Wayne Wu,
transformers for language understanding. arXiv preprint Chen Change Loy, and Ziwei Liu. Text2human: Text-driven
arXiv:1810.04805, 2018. 3 controllable human image generation. ACM TOG, 2022. 2
[5] Prafulla Dhariwal and Alexander Nichol. Diffusion models [23] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
beat GANs on image synthesis. In NeurIPS, 2021. 2 Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[6] Patrick Esser, Robin Rombach, Andreas Blattmann, and Text-based real image editing with diffusion models. arXiv
Bjorn Ommer. ImageBART: Bidirectional context with preprint arXiv:2210.09276, 2022. 2
multinomial diffusion for autoregressive image synthesis. In [24] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
NeurIPS, 2021. 2 Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
[7] Patrick Esser, Robin Rombach, and Björn Ommer. A note tidis, Li-Jia Li, David A Shamma, Michael Bernstein, and
on data biases in generative models. In NeurIPS Workshop, Fei-Fei Li. Visual genome: Connecting language and vision
2020. 16 using crowdsourced dense image annotations. IJCV, 2017. 2
[8] Hugging Face. Diffusers. https://huggingface.co/ [25] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli
docs/diffusers/index. 6 Shechtman, and Jun-Yan Zhu. Multi-concept customization
[9] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- of text-to-image diffusion. arXiv preprint arXiv:2212.04488,
nik, Amit H. Bermano, Gal Chechik, and Daniel Cohen- 2022. 2, 4
Or. An image is worth one word: Personalizing text-to- [26] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
image generation using textual inversion. arXiv preprint regularization. In ICLR, 2019. 15
arXiv:2208.01618, 2022. 2, 4, 6, 7, 11 [27] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Fei-Fei
[10] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Li. Visual relationship detection with language priors. In
Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, ECCV, 2016. 2
and Yoshua Bengio. Generative adversarial nets. In NeurIPS, [28] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
2014. 2 jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided
[11] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and image synthesis and editing with stochastic differential equa-
Dimitris Samaras. Diffusion models as plug-and-play priors. tions. In ICLR, 2022. 2
In NeurIPS, 2022. 2 [29] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan
[12] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Laptev, Josef Sivic, and Andrew Zisserman. End-to-end
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- learning of visual representations from uncurated instruc-
tor quantized diffusion model for text-to-image synthesis. In tional videos. In CVPR, pages 9879–9889, 2020. 5
CVPR, 2022. 2 [30] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
[13] William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
tian Weilbach, and Frank Wood. Flexible diffusion modeling Mark Chen. GLIDE: Towards photorealistic image gener-
of long videos. arXiv preprint arXiv:2205.11495, 2022. 2 ation and editing with text-guided diffusion models. arXiv
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross preprint arXiv:2112.10741, 2021. 2
Girshick. Momentum contrast for unsupervised visual rep- [31] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
resentation learning. In CVPR, 2020. 15 sentation learning with contrastive predictive coding. arXiv
[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- preprint arXiv:1807.03748, 2018. 5
sion probabilistic models. In NeurIPS, 2020. 2, 3 [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[16] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
models for high fidelity image generation. JMLR, 2022. 2 ing transferable visual models from natural language super-
[17] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William vision. In ICML, 2021. 3, 11
Chan, Mohammad Norouzi, and David J Fleet. Video dif- [33] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
fusion models. arXiv preprint arXiv:2204.03458, 2022. 2 and Mark Chen. Hierarchical text-conditional image gener-
ation with CLIP latents. arXiv preprint arXiv:2204.06125, textual description. arXiv preprint arXiv:2210.02399, 2022.
2022. 2 2
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [49] Pascal Vincent. A connection between score matching and
Patrick Esser, and Björn Ommer. High-resolution image syn- denoising autoencoders. Neural Computation, 2011. 2
thesis with latent diffusion models. In CVPR, 2022. 2, 3, 5, [50] Binxu Wang and John J. Vastola. Diffusion models gener-
6, 7, 11, 16 ate images like painters: an analytical theory of outline first,
[35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- details later. arXiv preprint arXiv:2303.02490, 2023. 5
Net: Convolutional networks for biomedical image segmen- [51] Danfei Xu, Yuke Zhu, Christopher B Choy, and Fei-Fei Li.
tation. In MICCAI, 2015. 3 Scene graph generation by iterative message passing. In
[36] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, CVPR, 2017. 2
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine [52] Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou,
tuning text-to-image diffusion models for subject-driven Wayne Zhang, and Ziwei Liu. Panoptic scene graph gen-
generation. arXiv preprint arXiv:2208.12242, 2022. 2, 4, eration. In ECCV, pages 178–196. Springer, 2022. 2, 11
12 [53] Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo,
[37] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Liangyu Chen, Bo Li, Zheng Ma, Kaiyang Zhou, Wayne
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Zhang, Chen Change Loy, and Ziwei Liu. Panoptic video
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, scene graph generation. In CVPR, 2023. 2
Rapha Gontijo Lopes, et al. Photorealistic text-to-image [54] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Vi-
diffusion models with deep language understanding. arXiv sual relationship detection with internal and external linguis-
preprint arXiv:2205.11487, 2022. 2 tic knowledge distillation. In ICCV, 2017. 2
[38] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- [55] Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian Reid.
mans, David J Fleet, and Mohammad Norouzi. Image super- Towards context-aware interaction recognition for visual re-
resolution via iterative refinement. IEEE TPAMI, 2022. 2 lationship detection. In ICCV, 2017. 2
[39] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
man, et al. Laion-5b: An open large-scale dataset for
training next generation image-text models. arXiv preprint
arXiv:2210.08402, 2022. 3
[40] Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang,
and Tat-Seng Chua. Video visual relation detection. In ACM
MM, 2017. 2
[41] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
Oran Gafni, et al. Make-a-video: Text-to-video generation
without text-video data. arXiv preprint arXiv:2209.14792,
2022. 2
[42] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 2
[43] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In ICLR, 2021. 2
[44] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
generative modeling through stochastic differential equa-
tions. In ICLR, 2021. 2
[45] Angus Stevenson. Oxford dictionary of English. Oxford Uni-
versity Press, USA, 2010. 15
[46] Patrick Tinsley, Adam Czajka, and Patrick Flynn. This face
does not exist... but it might be yours! identity leakage in
generative models. In WACV, 2021. 16
[47] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
data using t-sne. JMLR, 9(11), 2008. 4
[48] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
Phenaki: Variable length video generation from open domain
Supplementary
In this supplementary file, we provide more experimental comparisons in Section A, and elaborate on the ReVersion
benchmark details in Section B. We then provide further explanations on basis prepositions in Section C, and the implemen-
tation details of our ReVersion framework in Section D. The potential societal impacts of our work are discussed in Section E.
At the end of the supplementary file, we show various qualitative results of ReVersion in Section F.

A. More Experimental Comparisons


In this section, we provide more experimental comparison results and analysis. For each method in comparison, we use
the 1,000 inference templates in the ReVersion benchmark, and generate 10 images using each template.

A.1. Relation Accuracy Score


We devise objective evaluation metric to measure the quality and accuracy of the inverted relation. To do this, we train
relation classifiers that categorize the ten relations in our ReVersion benchmark. We then use these classifiers to determine
whether the entities in the generated images follow the specified relation. We employ PSGFormer [52], a pre-trained scene-
graph generation network, to extract the relation feature vectors from a given image. The feature vectors are averaged-pooled
and fed into linear SVMs for classification.
We calculate the Relation Accuracy Score as the percentage of generated images that follow the relation class in the
exemplar images. Table A3 shows that our method outperforms text-to-image generation [34] and Textual Inversion [9].
Additionally, Table A4 reveals that removing the steering scheme or importance sampling scheme results in a performance
drop in relation accuracy.

A.2. Entity Accuracy Score


To evaluate whether the generated image contains the entities specified by the text prompt, we compute the CLIP score [32]
between a revised text prompt and the generated image, which we refer to as the Entity Accuracy Score.
CLIP [32] is a vision-language model that has been trained on large-scale datasets. It uses an image encoder and a text
encoder to project images and text into a common feature space. The CLIP score is calculated as the cosine similarity between
the normalized image and text embeddings. A higher score usually indicates greater consistency between the output image
and the text prompt. In our approach, we calculate the CLIP score between the generated image and the revised text prompt
“EA , EB ”, which only includes the entity information.
In Table A3, we observe that our method outperforms Textual Inversion in terms of entity accuracy. This is because the
hRi learned by Textual Inversion might contain leaked entity information, which might distract the model from generating the
desired “EA ” and “EB ”. Our steering loss effectively prevents entity information from leaking into hRi , allowing for accurate
entity synthesis. Furthermore, our approach achieves comparable entity accuracy score with text-to-image generation using
Stable Diffusion [34], and significantly surpasses it in terms of relation accuracy. Table A4 shows that removing the steering
or importance sampling scheme results in a drop in entity accuracy.

Table A3: Additional Quantitative Results (Baselines).


Method Relation Accuracy Score (%) ↑ Entity Accuracy Score (%) ↑
Text-to-Image Generation [34] 35.16% 28.96%
Textual Inversion [9] 37.85% 26.79%
Ours 38.17% 28.20%

Table A4: Additional Quantitative Results (Ablation Study).


Method Relation Accuracy Score (%) ↑ Entity Accuracy Score (%) ↑
Ours w/o Steering 37.48% 27.66%
Ours w/o Importance Sampling 34.64% 27.90%
Ours 38.17% 28.20%
DreamBooth Ours
(a)

exemplar images <R> “Spiderman in <R> relation with building” “Spiderman <R> building”
Relation Inversion

(b)

exemplar images <R> “monkey in <R> relation with monkey” “monkey <R> monkey”
Relation Inversion

(c)

exemplar images <R> “rabbit in <R> relation with barrel” “rabbit <R> barrel”
Relation Inversion

Figure A8: Qualitative Comparison with DreamBooth [36]. In (a), the Spiderman generated by DreamBooth is mostly
climbing on the building rather than being a painting. In (b), DreamBooth fails to capture the “sits back to back with”
relation. In (c), while DreamBooth successfully captures the relation, the appearance of the basket from exemplar images are
severely leaked into the generated images via hRi .

A.3. Comparison with Fine-Tuning Based Method


We further compare our method with a fine-tuning based method, DreamBooth [36]. In order to adapt DreamBooth to our
relation inversion task, we follow the original implementation to design a text prompt “A photo of hRi relation” containing
the unique identifier “hRi ” to fine-tune the model. The class-specific prior preservation loss is also added with a text prompt
“A photo of relation” to avoid overfitting and language drift. As shown in Figure A8, directly using DreamBooth on our
task could result in poor object relation and appearance leakage. For example, in Figure A8 (a) the Spiderman is mostly
climbing on the building rather than being a painting, while in Figure A8 (c) the appearance of the basket in exemplar images
is severely leaked into the DreamBooth generated images.

B. ReVersion Benchmark Details


In this section, we provide the details of our ReVersion Benchmark. The full benchmark will be publicly available.
B.1. Relations
To benchmark the Relation Inversion task, we define ten diverse and representative object relations as follows:
• EA is painted on (the surface of) EB
• EA is carved by / is made of the material of EB
• EA shakes hands with EB
• EA hugs EB
• EA sits back to back with EB
• EA is contained inside EB
• EA on / is on top of EB
• EA is hanging from EB
• EA is wrapped in EB
• EA rides (on) EB

where EA and EB are the two entities that follow the specified relation. It is worth mentioning that the relations can be best
described by the exemplar images, and the text descriptions provided above are simply approximated summarizations of the
true relations.

B.2. Exemplar Images


A wide range of entities, such as animals, human, household items, are involved to further increase the diversity of the
benchmark. In Figure A9, we show the exemplar images and text descriptions for the relation “EA sits back to back with
EB ”. The exemplar images contain both human figures and animals to emphasize the invariant “back to back” relation in
different scenarios.

B.3. Text Descriptions


As shown in Figure A9, the text descriptions for each image contains several levels, from short sentences which only
mention the class names, to complex and comprehensive sentences that describe each entity and the scene backgrounds. The
hRi in each description will be replaced by the learnable relation prompt during optimization.

B.4. Inference Templates


To evaluate the performance of relation inversion methods, we devise 100 inference templates for each relation. The
inference templates contains diverse entity combinations to test the robustness and generalizability of the inverted relation
hRi . To quantitatively evaluate relation inversion performance, we use each inference template to synthesize 10 images,
resulting in a total of 1,000 synthesized images for each inverted hRi .
Below, we show the 100 inference templates for the relation “EA sits back to back with EB ”:
• man hRi man, man hRi woman, man hRi child, man hRi cat, man hRi rabbit, man hRi monkey, man hRi dog, man hRi
hamster, man hRi kangaroo, man hRi panda,
• woman hRi man, woman hRi woman, woman hRi child, woman hRi cat, woman hRi rabbit, woman hRi monkey, woman
hRi dog, woman hRi hamster, woman hRi kangaroo, woman hRi panda,
• child hRi man, child hRi woman, child hRi child, child hRi cat, child hRi rabbit, child hRi monkey, child hRi dog, child
hRi hamster, child hRi kangaroo, child hRi panda,
• cat hRi man, cat hRi woman, cat hRi child, cat hRi cat, cat hRi rabbit, cat hRi monkey, cat hRi dog, cat hRi hamster, cat
hRi kangaroo, cat hRi panda,
• rabbit hRi man, rabbit hRi woman, rabbit hRi child, rabbit hRi cat, rabbit hRi rabbit, rabbit hRi monkey, rabbit hRi dog,
rabbit hRi hamster, rabbit hRi kangaroo, rabbit hRi panda,
Benchmark Sample
"girl <R> boy",
"a girl with in white and green <R> a boy in white and light grey",
"a girl wearing white T-shirt and green skirt <R> a boy in white T-shirt and grey shorts, white background"

"cat <R> cat",


"a long haired cat <R> a long haired cat",
"a dark long haired cat <R> a grey long haired cat, white background",

"woman <R> man",


"a woman wearing in white trousers and blue shirt <R> a man in grey",
"a woman wearing in white trousers and blue shirt <R> a man in khaki trousers and light grey shirt, white background"

"girl <R> boy",


"a girl with in pink top and jeans <R> a boy with striped t-shirt and jeans",
"a girl with in pink top and jeans <R> a boy with striped t-shirt and jeans, grey sofa in background"

"boy <R> boy",


"a boy with shirt and trousers <R> another boy with shirt and trousers",
"a boy with shirt and trousers <R> another boy with shirt and trousers, white background",

"bear <R> bear",


"a bear <R> a bear in wooded area",
"a bear <R> a bear, bush in background"

"girl <R> boy",


"a young girl in purple dress <R> a young boy in white",
"a young girl in purple dress <R> a young boy in white, in the field"

"girl <R> boy",


"a teenager girl <R> a teenager boy, white background",
"a teenager girl wearing red shirt and jeans <R> a teenager boy in blue shirt and khaki trousers, white background"

"cat <R> cat",


"an orange cat <R> a brown and white cat",
"an orange cat <R> a brown and white cat on a wooden bench, grasses in background"

"boy <R> boy",


"a boy with shirt and jeans <R> a boy in shirt and jeans",
"a boy wearing shirt and jeans <R> another boy wearing shirt and jeans, white background"

Figure A9: Benchmark Sample. We present exemplar images and text descriptions that illustrate the relation where “EA sits
back to back with EB ”. The exemplar images feature both human figures and animals to demonstrate the invariant “back to
back” relationship in various scenarios. The text descriptions are provided at several levels, ranging from simple class name
mentions to detailed descriptions of the entities and their surroundings. During optimization, the hRi in each description will
be replaced with the learnable relation prompt.
• monkey hRi man, monkey hRi woman, monkey hRi child, monkey hRi cat, monkey hRi rabbit, monkey hRi monkey,
monkey hRi dog, monkey hRi hamster, monkey hRi kangaroo, monkey hRi panda,
• dog hRi man, dog hRi woman, dog hRi child, dog hRi cat, dog hRi rabbit, dog hRi monkey, dog hRi dog, dog hRi hamster,
dog hRi kangaroo, dog hRi panda,
• hamster hRi man, hamster hRi woman, hamster hRi child, hamster hRi cat, hamster hRi rabbit, hamster hRi monkey,
hamster hRi dog, hamster hRi hamster, hamster hRi kangaroo, hamster hRi panda,
• kangaroo hRi man, kangaroo hRi woman, kangaroo hRi child, kangaroo hRi cat, kangaroo hRi rabbit, kangaroo hRi
monkey, kangaroo hRi dog, kangaroo hRi hamster, kangaroo hRi kangaroo, kangaroo hRi panda,
• panda hRi man, panda hRi woman, panda hRi child, panda hRi cat, panda hRi rabbit, panda hRi monkey, panda hRi
dog, panda hRi hamster, panda hRi kangaroo, panda hRi panda

C. Further Explanations on Basis Prepositions


As stated in the manuscript, we devise a set of basis prepositions to steer the learning process of the relation prompt.
Specifically, we collect a comprehensive list of ∼100 prepositions from [45], and drop the prepositions that describes non-
visual-relations (i.e., temporal relations, causal relations, etc.), while keep the ones that are related to visual relations. For
example, the prepositional word “until” is discarded as a temporal preposition, while words like “above”, “beneath”, “to-
ward” will be kept as plausible basis prepositions.
The basis preposition set contains a total of 56 words, listed as follows:

• aboard • astride • in • regarding


• about • at • including • round

• above • atop • inside • through


• across • before • into • throughout
• after • behind • near • to
• against • below • of • toward

• along • beneath • off • towards


• alongside • beside • on • under
• amid • between • onto • underneath

• amidst • beyond • opposite • up


• among • by • out • upon
• amongst • down • outside • versus
• anti • following • over • with

• around • from • past • within

D. Implementation Details
All experiments are conducted on 512×512 image resolution. To ensure that the numerical values λdenoise Ldenoise and
λsteer Lsteer are in comparable order of magnitude, we set λdenoise = 1.0 and λsteer = 0.01. The temperature parameter γ in
the steering loss Lsteer is set as 0.07, following [14]. During the optimization process, we first initialize our relation prompt
hRi using the word “and”, then optimize the prompt using the AdamW [26] optimizer for 3,000 steps, with learning rate
2.5×10−4 and batch size 2. In each iteration, 8 positive samples are randomly selected from the basis preposition set. During
the inference process, we use classifier-free guidance for all experiments including the baselines and ablation variants, with
a constant guidance weight 7.5.
E. Potential Societal Impacts
Although ReVersion can generate diverse entity combinations through inverted relations, this capability can also be ex-
ploited to synthesize real human figures interacting in ways they never did. As a result, we strongly advise users to only use
ReVersion for proper recreational purposes.
The rapid advancement of generative models has unlocked new levels of creativity but has also introduced various societal
concerns. First, it is easier to create false imagery or manipulate data maliciously, leading to the spread of misinformation.
Second, data used to train these models might be revealed during the sampling process without explicit consent from the data
owner [46]. Third, generative models can suffer from the biases present in the training data [7]. We used the pre-trained
Stable Diffusion [34] for ReVersion, which has been shown to suffer from data bias in certain scenarios. For example, when
prompted with the phrase “a professor”, Stable Diffusion tends to generate human figures that are white-passing and male-
passing. We hope that more research will be conducted to address the risks and biases associated with generative models,
and we advise everyone to use these models with discretion.

F. More Qualitative Results


We show various qualitative results in Figure A10-A16, which are located at the end of this Supplementary File.

F.1. ReVersion with Diverse Styles and Backgrounds


As shown in Figure A10, we apply the hRi inverted by ReVersion in scenarios with diverse backgrounds and styles, and
show that hRi robustly adapt these environments with impressive results.

F.2. ReVersion with Arbitrary Entity Combinations


In Figure A11 and A12, we show that the hRi inverted by ReVersion can be applied to robustly relate arbitrary entity
combinations. For example, in Figure A11, for the hRi extracted from the exemplar images where one entity is “painted
on” the other entity, we enumerate over all combinations among “{cat / flower / guitar / hamburger / Michael Jackson /
Spiderman} hRi {building / canvas / paper / vase / wall}”, and observe that hRi successfully links these entities together via
exactly the same relation in the exemplar images.

F.3. Additional Qualitative Results


We show additional qualitative results of ReVersion in Figure A13, A14, A15, and A16.
ReVersion
<R>

“cat <R> cat, in urban cityscape” “cat <R> cat, with fireworks” “cat <R> cat, with sunset” “cat <R> cat, in the snow”

“cat <R> cat, in cartoon style” “cat <R> cat, in outer space” “cat <R> cat, in the desert” “cat <R> cat, in the ocean”

ReVersion
<R>

“monkey <R> monkey, in sketch style” “monkey <R> monkey, with sunset” “monkey <R> monkey, in the snow” “monkey <R> monkey, in the ocean”

ReVersion
<R>

“otter <R> otter, in the snow” “otter <R> otter, in the desert” “otter <R> otter, in pixel art style” “otter <R> otter, with fireworks”

Figure A10: ReVersion for Diverse Styles and Backgrounds. The hRi inverted by ReVersion can be applied robustly to
related entities in scenes with diverse backgrounds or styles.
exemplar images

ReVersion
<R>

“cat <R> building” “cat <R> canvas” “cat <R> paper” “cat <R> vase” “cat <R> wall”

“flower <R> building” “flower <R> canvas” “flower <R> paper” “flower <R> vase” “flower <R> wall”

“guitar <R> building” “guitar <R> canvas” “guitar <R> paper” “guitar <R> vase” “guitar <R> wall”

“hamburger <R> “hamburger <R> “hamburger <R> “hamburger <R> vase” “hamburger <R> wall”
building” canvas” paper”

“Michael Jackson “Michael Jackson “Michael Jackson “Michael Jackson “Michael Jackson
<R> building” <R> canvas” <R> paper” <R> vase” <R> wall”

“Spiderman “Spiderman “Spiderman <R> “Spiderman “Spiderman


<R> building” <R> canvas” paper” <R> vase” <R> wall”

Figure A11: Arbitrary Entity Combinations. The hRi inverted by ReVersion can be robustly applied to arbitrary entity
combinations. For example, for the hRi extracted from the exemplar images where one entity is “painted on” the other
entity, we enumerate over all combinations among “{cat / flower / guitar / hamburger / Michael Jackson / Spiderman} hRi
{building / canvas / paper / vase / wall}”, and observe that hRi successfully links these entities together via exactly the same
relation in the exemplar images.
exemplar images

ReVersion
<R>

“cat <R> apple” “cat <R> carrot” “cat <R> clay” “cat <R> glass” “cat <R> jade” “cat <R> marble” “cat <R> metal” “cat <R> wood”

“swan <R> apple” “swan <R> carrot” “swan <R> clay” “swan <R> glass” “swan <R> jade” “swan <R> marble” “swan <R> metal” “swan <R> wood”

“horse <R> apple” “horse <R> carrot” “horse <R> clay” “horse <R> glass” “horse <R> jade” “horse <R> marble” “horse <R> metal” “horse <R> wood”

“lion <R> apple” “lion <R> carrot” “lion <R> clay” “lion <R> glass” “lion <R> jade” “lion <R> marble” “lion <R> metal” “lion <R> wood”

“rose <R> apple” “rose <R> carrot” “rose <R> clay” “rose <R> glass” “rose <R> jade” “rose <R> marble” “rose <R> metal” “rose <R> wood”

“rabbit <R> apple” “rabbit <R> carrot” “rabbit <R> clay” “rabbit <R> glass” “rabbit <R> jade” “rabbit <R> marble” “rabbit <R> metal” “rabbit <R> wood”

Figure A12: Arbitrary Entity Combinations. The hRi inverted by ReVersion can be applied to arbitrary entity combina-
tions. For example, for the hRi extracted from the exemplar images where one entity is “is made of the material of / is carved
by” the other entity, we enumerate over all combinations among “{cat / swan / horse / lion / rose / rabbit} hRi {apple / carrot
/ clay / glass / jade / marble / metal / wood}”, and observe that hRi successfully links these entities together via exactly the
same relation in the exemplar images.
exemplar images

ReVersion
<R>

“cat <R> bicycle” “cat <R> motorbike” “cat <R> motorbike” “kangaroo <R> bicycle” “panda <R> bicycle” “panda <R> bicycle” “panda <R> “panda <R>
motorbike” motorbike”

“dog <R> bicycle” “dog <R> bicycle” “dog <R> bicycle” “dog <R> bicycle” “dog <R> motorbike” “kangaroo <R> “monkey <R> “child <R> motorbike”
motorbike” motorbike”

“hamster <R> bicycle” “hamster <R> “hamster <R> “hamster <R> “hamster <R> “rabbit <R> bicycle” “rabbit <R> bicycle” “rabbit <R> motorbike”
motorbike” motorbike” motorbike” motorbike”

exemplar images

ReVersion
<R>

“cat <R> cat” “cat <R> child” “cat <R> child” “cat <R> man” “cat <R> cat” “panda <R> panda” “panda <R> panda” “panda <R> panda”

“kangaroo <R> woman” “kangaroo <R> “child <R> child” “child <R> child” “child <R> rabbit” “rabbit <R> rabbit” “rabbit <R> rabbit” “rabbit <R> rabbit”
kangaroo”

“hamster <R> child” “hamster <R> child” “hamster <R> child” “hamster <R> child” “hamster <R> child” “Mickey Mouse <R> “woman <R> woman” “man <R> dog”
Mickey Mouse”

Figure A13: More Qualitative Results.


exemplar images

ReVersion
<R>

“monkey <R> monkey” “Spiderman <R> “cat <R> cat” “child <R> child” “child <R> child” “kangaroo <R> “man <R> man” “Mickey Mouse <R>
Spiderman” kangaroo” Mickey Mouse”

exemplar images

ReVersion
<R>

“ocean <R> pot” “ocean <R> pot” “ocean <R> pot” “ocean <R> pot” “ocean <R> pot” “ocean <R> pot” “ocean <R> cup” “sea <R> basket”

“swimming pool “swimming pool “swimming pool “swimming pool “sea <R> pot” “sea <R> pot” “sea <R> cup” “sea <R> basket”
<R> basket” <R> basket” <R> basket” <R> basket”

“garden <R> pot” “garden <R> pot” “garden <R> pot” “garden <R> basket” “garden <R> basket” “garden <R> box” “child <R> basket” “child <R> basket”

“garden <R> cup” “garden <R> cup” “garden <R> cup” “Spiderman <R> “Spiderman <R> “Spiderman <R> “Spiderman <R> “Spiderman <R>
basket” basket” basket” basket” basket”

Figure A14: More Qualitative Results.


exemplar images

ReVersion
<R>

“cat <R> basket” “cat <R> basket” “cat <R> basket” “cat <R> basket” “cat <R> paper bag” “cat <R> pot” “cat <R> pot” “child <R> paper bag”

“dog <R> basket” “dog <R> basket” “dog <R> paper bag” “dog <R> paper bag” “dog <R> paper bag” “dog <R> paper bag” “dog <R> pot” “dog <R> pot”

“hamster <R> cup” “hamster <R> cup” “hamster <R> cup” “hamster <R> cup” “hamster <R> cup” “rabbit <R> cup” “rabbit <R> cup” “child <R> cup”

“hamster <R> basket” “hamster <R> basket” “hamster <R> basket” “hamster <R> basket” “hamster <R> vase” “hamster <R> vase” “hamster <R> vase” “hamster <R> vase”

“panda <R> basket” “panda <R> basket” “panda <R> cup” “panda <R> cup” “panda <R> cup” “panda <R> pot” “rabbit <R> pot” “rabbit <R> vase”

“rabbit <R> basket” “rabbit <R> basket” “rabbit <R> cup” “rabbit <R> cup” “rabbit <R> “rabbit <R> “rabbit <R> “rabbit <R>
paper bag” paper bag” paper bag” paper bag”

Figure A15: More Qualitative Results.


exemplar images

ReVersion
<R>

“bag <R> ceiling” “bag <R> ceiling” “bag <R> handrail” “bag <R> handrail” “bag <R> handrail” “schoolbag “schoolbag “toy truck <R> ceiling”
<R> ceiling” <R> ceiling”

“lamp <R> bridge” “lamp <R> bridge” “lamp <R> ceiling” “lamp <R> ceiling” “schoolbag <R> “schoolbag <R> “schoolbag <R> “lamp <R> handrail”
handrail” handrail” handrail”

“blue bag <R> bridge” “blue bag <R> ceiling” “blue bag <R> ceiling” “blue bag <R> ceiling” “blue bag “blue bag “blue bag <R> chair” “blue bag <R> table”
<R> handrail” <R> handrail”

exemplar images

ReVersion
<R>

“cat <R> cat” “cat <R> child” “cat <R> child” “cat <R> child” “Spiderman <R> “cat <R> man” “panda <R> cat” “cat <R> child”
Spiderman”

“cat <R> dog” “panda <R> panda” “rabbit <R> child” “rabbit <R> child” “rabbit <R> child” “rabbit <R> child” “hamster <R> “hamster <R>
hamster” hamster”

“otter <R> otter” “otter <R> otter” “otter <R> otter” “otter <R> otter” “rabbit <R> rabbit” “rabbit <R> rabbit” “dog <R> dog” “dog <R> dog”

Figure A16: More Qualitative Results.

You might also like