Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views13 pages

TokenFlow Arxiv

The document presents TokenFlow, a framework that enhances text-driven video editing by leveraging a pre-trained text-to-image diffusion model to generate high-quality videos while maintaining the original spatial layout and motion. This method ensures temporal consistency across edited frames by explicitly propagating diffusion features based on inter-frame correspondences, without requiring additional training. The authors demonstrate state-of-the-art editing results on various real-world videos, addressing limitations of existing video editing methods.

Uploaded by

parrypwhppp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

TokenFlow Arxiv

The document presents TokenFlow, a framework that enhances text-driven video editing by leveraging a pre-trained text-to-image diffusion model to generate high-quality videos while maintaining the original spatial layout and motion. This method ensures temporal consistency across edited frames by explicitly propagating diffusion features based on inter-frame correspondences, without requiring additional training. The authors demonstrate state-of-the-art editing results on various real-world videos, addressing limitations of existing video editing methods.

Uploaded by

parrypwhppp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

T OKEN F LOW: C ONSISTENT D IFFUSION F EATURES

FOR C ONSISTENT V IDEO E DITING

Michal Geyer∗ Omer Bar-Tal∗ Shai Bagon Tali Dekel


Weizmann Institute of Science
*Indicates equal contribution.
Project webpage: https://diffusion-tokenflow.github.io
Input

“A robot spinning a shiny silver ball” “A cheetah in the Sahara desert”


Edit

“A Van Gogh portrait” “A wolf in Machu Pichu”

Figure 1: TokenFlow enables consistent, high-quality semantic edits of real-world videos. Given
an input video (top row), our method edits it according to a target text prompt (middle and bottom
rows), while preserving the semantic layout and motion in the original scene.

A BSTRACT

The generative AI revolution has recently expanded to videos. Nevertheless, cur-


rent state-of-the-art video models are still lagging behind image models in terms
of visual quality and user control over the generated content. In this work, we
present a framework that harnesses the power of a text-to-image diffusion model
for the task of text-driven video editing. Specifically, given a source video and
a target text-prompt, our method generates a high-quality video that adheres to
the target text, while preserving the spatial layout and motion of the input video.
Our method is based on a key observation that consistency in the edited video can
be obtained by enforcing consistency in the diffusion feature space. We achieve
this by explicitly propagating diffusion features based on inter-frame correspon-
dences, readily available in the model. Thus, our framework does not require any
training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-
image editing method. We demonstrate state-of-the-art editing results on a variety
of real-world videos.

1 I NTRODUCTION
The evolution of text-to-image models has recently facilitated advances in image editing and con-
tent creation, allowing users to control various proprieties of both generated and real images. Nev-
ertheless, expanding this exciting progress to video is still lagging behind. A surge of large-scale
text-to-video generative models has emerged, demonstrating impressive results in generating clips
solely from textual descriptions. However, despite the progress made in this area, existing video
models are still in their infancy, being limited in resolution, video length, or the complexity of video
dynamics they can represent. In this paper, we harness the power of a state-of-the-art pre-trained
text-to-image model for the task of text-driven editing of natural videos. Specifically, our goal is to
generate high-quality videos that adhere to the target edit expressed by an input text prompt, while
preserving the spatial layout and motion of the original video. The main challenge in leveraging an
image diffusion model for video editing is to ensure that the edited content is consistent across all
video frames – ideally, each physical point in the 3D world undergoes coherent modifications across
time. Existing and concurrent video editing methods that are based on image diffusion models have
demonstrated that global appearance coherency across the edited frames can be achieved by extend-
ing the self-attention module to include multiple frames (Wu et al., 2022; Khachatryan et al., 2023b;

1
Ceylan et al., 2023; Qi et al., 2023). Nevertheless, this approach is insufficient for achieving the
desired level of temporal consistency, as motion in the video is only implicitly preserved through the
attention module. Consequently, professionals or semi-professionals users often resort to elaborate
video editing pipelines that entail additional manual work. In this work, we propose a framework to
tackle this challenge by explicitly enforcing the original inter-frame correspondences on the edit.
Intuitively, natural videos contain redundant information across frames, e.g., depict similar appear-
ance and shared visual elements. Our key observation is that the internal representation of the video
in the diffusion model exhibits similar properties. That is, the level of redundancy and temporal
consistency of the frames in the RGB space and in the diffusion feature space are tightly correlated.
Based on this observation, the pillar of our approach is to achieve consistent edit by ensuring that
the features of the edited video are consistent across frames. Specifically, we enforce that the edited
features convey the same inter-frame correspondences and redundancy as the original video features.
To do so, we leverage the original inter-frame feature correspondences, which are readily available
by the model. This leads to an effective method that directly propagates the edited diffusion features
based on the original video dynamics. This approach allows us to harness the generative prior of
state-of-the-art image diffusion model without additional training or fine-tuning, and can work in
conjunction with an off-the-shelf diffusion-based image editing method (e.g., Meng et al. (2022);
Hertz et al. (2022); Zhang & Agrawala (2023); Tumanyan et al. (2023)).
To summarize, we make the following key contributions:
• A technique, dubbed TokenFlow, that enforces semantic correspondences of diffusion fea-
tures across frames, allowing to significantly increase temporal consistency in videos gen-
erated by a text-to-image diffusion model.
• Novel empirical analysis studying the proprieties of diffusion features across a video.
• State-of-the-art editing results on diverse videos, depicting complex motions.

2 R ELATED W ORK
Text-driven image & video synthesis Seminal works designed GAN architectures to synthesize
images conditioned on text embeddings (Reed et al., 2016; Zhang et al., 2016). With the ever-
growing scale of vision-language datasets and pretraining strategies (Radford et al., 2021; Schuh-
mann et al., 2022), there has been a remarkable progress in text-driven image generation capabilities.
Users can sytnesize high-quality visual content using simple text prompts. Much of this progress
is also attributed to diffusion models (Sohl-Dickstein et al., 2015; Croitoru et al., 2022; Dhariwal
& Nichol, 2021; Ho et al., 2020; Nichol & Dhariwal, 2021) which have been established as state-
of-the-art text-to-image generators (Nichol et al., 2021; Saharia et al., 2022; Ramesh et al., 2022;
Rombach et al., 2022; Sheynin et al., 2022; Bar-Tal et al., 2023). Such models have been extended
for text-to-video generation, by extending 2D architectures to the temporal dimension (e.g., using
temporal attention Ho et al. (2022b)) and performing large-scale training on video datasets (Ho
et al., 2022a; Blattmann et al., 2023; Singer et al., 2022). Recently, Gen-1 (Esser et al., 2023) tai-
lored a diffusion model architecture for the task of video editing, by conditioning the network on
structure/appearance representations. Nevertheless, due to their extensive computation and memory
requirements, existing video diffusion models are still in infancy and are largely restricted to short
clips, or exhibit lower visual quality compared to image models. On the other side of the spectrum,
a promising recent trend of works leverage a pre-trained image diffusion model for video synthesis
tasks, without additional training (Fridman et al., 2023; Wu et al., 2022; Lee et al., 2023a; Qi et al.,
2023). Our work falls into this category, employing a pretrained text-to-image diffusion model for
the task of video editing, without any training or finetuning.
Consistent video stylization A common approach for video stylization involves applying image
editing techniques (e.g., style transfer) on a frame-by-frame basis, followed by a post-processing
stage to address temporal inconsistencies in the edited video (Lai et al. (2018b); Lei et al. (2020;
2023)). Although these methods effectively reduce high-frequency temporal flickering, they are not
designed to handle frames that exhibit substantial variations in content, which often occur when
applying text-based image editing techniques (Qi et al., 2023). Kasten et al. (2021) propose to de-
compose a video into a set of 2D atlases, each provides a unified representation of the background
or of a foreground object throughout the video. Edits applied to the 2D atlases are automatically
mapped back to the video, thus achieving temporal consistency with minimal effort. Bar-Tal et al.
(2022); Lee et al. (2023b) leverage this representation to perform text-driven editing. However, the
atlas representation is limited to videos with simple motion and requires long training, limiting the
applicability of this technique and of the methods built upon it. Our work is also related to classi-
cal works that demonstrated that small patches in a natural video extensively repeat across frames
(Shahar et al., 2011; Cheung et al., 2005), and thus consistent editing can by simplified by editing

2
Original Per-frame editing Ours

Sample Frames
Features (PCA)
x-t slice

t t t

x x x
Figure 3: Diffusion features across time. Left: Given an input video (top row), we apply DDIM inversion
on each frame and extract features from the highest resolution decoder layer in ϵθ . We apply PCA on the
features (i.e., output tokens from the self-attention module) extracted from all frames and visualize the first
three components (second row). We further visualize an x-t slice (marked in red on the original frame) for both
RGB and features (bottom row). The feature representation is consistent across time – corresponding regions
are encoded with similar features across the video. Middle: Frames and feature visualization for an edited video
obtained by applying an image editing method (Tumanyan et al. (2023)) on each frame; inconsistent patterns
in RGB are also evident in the feature space (e.g., on the dog’s body). Right: Our method enforces the edited
video to convey the same level of feature consistency as the original video, which translates into a coherent and
high-quality edit in RGB space.

a subset of keyframes and propagating the edit across the video by establishing patch correspon-
dences using handcrafted features and optical flow (Ruder et al., 2016; Jamriška et al., 2019) or by
training a patch-based GAN (Texler et al., 2020). Nevertheless, such propagation methods strug-
gle to handle videos with illumination changes, or with complex dynamics. Importantly, they rely
on a user provided consistent edit of the keyframes, which remains a labor-intensive task yet to be
automated. Yang et al. (2023) combines keyframe editing with a propagation method by Jamriška
et al. (2019). They edit keyframes using a text-to-image diffusion model while enforcing optical
flow constraints on the edited keyframes. However, since optical flow estimation between distant
frames is not reliable, their method fails to consistently edit keyframes that are far apart (as seen in
our Supplementary Material - SM), and as a result, fails to consistently edit most videos.
Our work shares a similar motivation as this ap-
proach that benefits from the temporal redundan-
Source

cies in natural videos. We show that such redun- (a) (b) (c)
Reconstructed Warped Nearest-Neighbour
dancies are also present in the feature space of a Target Source Target Field
text-to-image diffusion model, and leverage this
property to achieve consistency.
Target I

Controlled generation via diffusion features


manipulation Recently, a surge of works
demonstrated how text-to-image diffusion mod-
Target II

els can be readily adapted to various editing and


generation tasks, by performing simple opera-
tions on the intermediate feature representation
of the diffusion network (Chefer et al., 2023;
Hong et al., 2022; Ma et al., 2023; Tumanyan Figure 2: Fine-grained feature correspondences.
et al., 2023; Hertz et al., 2022; Patashnik et al., Features (i.e., output tokens from the self-attention
modules) extracted from of a source frame are used
2023; Cao et al., 2023). Luo et al. (2023); to reconstruct nearby frames. This is done by: (a)
Zhang et al. (2023) demonstrated semantic ap- swapping each feature in the target by its nearest fea-
pearance swapping using diffusion feature corre- ture in the source, in all layers and all generation time
spondences. Hertz et al. (2022) observed that by steps, and (b) simple warping in RGB space, using
manipulating the cross-attention layers, it is pos- a nearest neighbour field (c), computed between the
sible to control the relation between the spatial source and target features extracted from the highest
layout of the image to each word in the text. Plug- resolution decoder layer. The target is faithfully re-
and-Play Diffusion (PnP, Tumanyan et al. (2023)) constructed, demonstrating the high level of spatial
analyzed the spatial features and the self-attention granularity and shared content between the features.

3
Jt̂

Input
Video I Compute
NN field γ
ϵθ



DDIM Extract
inversion tokens

Noisy (II) TokenFlow propagation Denoised


(I) Joint editing
Video Jt Video Jt−1
ℱγ
Sampled
keyframes ϵθ̂ Tbase ϵθ̂



Extended Attention
“colourful Ki1 ... Kik Vi1
painting” . ...
“colourful Qi1 ... Vik
painting”

Figure 4: TokenFlow pipeline. Top: Given an input video I, we DDIM invert each frame, extract its tokens,
i.e., output features from the self-attention modules, from each timestep and layer, and compute inter-frame
features correspondences using a nearest-neighbor (NN) search. Bottom: The edited video is generated as
follows: at each denoising step t, (I) we sample keyframes from the noisy video Jt and jointly edit them using
an extended-attention block; the set of resulting edited tokens is Tbase . (II) We propagate the edited tokens
across the video according to the pre-computed correspondences of the original video features. To denoise
Jt , we feed each frame to the network, and replace the generated tokens with the tokens obtained from the
propagation step (II).

maps and found that they capture semantic information at high spatial granularity. Tune-A-Video
(Wu et al., 2022) observed that by extending the self-attention module to operate on more than a sin-
gle frame, it is possible to generate frames that share a common global appearance. Qi et al. (2023);
Ceylan et al. (2023); Khachatryan et al. (2023a); Shin et al. (2023); Liu et al. (2023) leverage this
property to achieve globally-coherent video edits. Nevertheless, as demonstrated in Sec. 5, inflat-
ing the self-attention module is insufficient for achieving fine-grained temporal consistency. Prior
and concurrent works either compromise visual quality, or exhibit limited temporal consistency. In
this work, we also perform video editing via simple operations in the feature space of a pre-trained
text-to-image model, we explicitly encourage the features of the model to be temporally consistent
through TokenFlow.

3 P RELIMINARIES
Diffusion Models Diffusion probabalistic models (DPM) (Sohl-Dickstein et al., 2015; Croitoru
et al., 2022; Dhariwal & Nichol, 2021; Ho et al., 2020; Nichol & Dhariwal, 2021) are a class of
generative models that aim to approximate a data distribution q through a progressive denosing
process. Starting from a Gaussian i.i.d noisy image xT ∼ N (0, I), the diffusion model ϵθ , gradually
denoises it, until reaching a clean image x0 drawn from the target distribution q. DPM can learn a
conditional distribution by incorporating additional guiding signals, such as text conditioning.
Song et al. (2020) derived DDIM, a deterministic sampling algorithm given an initial noise xT . By
applying this algorithm in the reverse order (a.k.a. DDIM inversion) starting from the clean x0 , it
allows to obtain the intermediate noisy images {xi }Tt=1 used to generate it.
Stable Diffusion Stable Diffusion (SD) (Rombach et al., 2022) is a prominent text-to-image dif-
fusion model that operates in a latent image space. A pretrained encoder maps RGB images to this
space, and a decoder decodes latents back to high-resolution images. In more detail, SD is based
on a U-Net architecture (Ronneberger et al., 2015), which comprises of residual, self-attention, and
cross-attention blocks. The residual block convolves the activations from a previous layer, while
cross-attention manipulates features according to the text prompt. In the self-attention block, fea-
tures are projected into queries Q, keys K, and values V . The Attention operation (Vaswani
et al., 2017) computes the affinities between the d-dimensional projections Q, K to yield the output
of the layer:
QK T
 
A · V where A = Attention(Q; K) and Attention(Q; K) = Softmax √ (1)
d

4
Input video Input video

“A Van Gogh portrait” “Ice sculpture of a car”

“A marble sculpture” “Sand sculpture of a car on the beach”

Input video Input video

“A robotic wolf” “Maui from Moana Movie”

“A colourful polygonal illustration” “A Pixar animation”

Figure 5: Results. Sample results of our method. We refer the reader to our webpage and SM for more
examples and full-video results.

4 M ETHOD
Given an input video I = [I 1 , ..., I n ], and a text prompt P describing the target edit, our goal is to
generate an edited video J = [J 1 , ..., J n ] that adheres to the text P, while preserving the original
motion and semantic layout of I. To achieve this, our framework leverages a pretrained and fixed
text-to-image diffusion model ϵθ .
Naı̈vely leveraging ϵθ for video editing, by applying an image editing method on each frame inde-
pendently (e.g., Hertz et al. (2022); Tumanyan et al. (2023); Meng et al. (2022); Zhang & Agrawala
(2023)), results in content inconsistencies across frames (e.g., Fig. 3 middle column). Our key
finding is that these inconsistencies can be alleviated by enforcing consistency among the internal
diffusion features across frames, during the editing process.
Natural videos typically depict coherent and shared content across time. We observe that the internal
representation of natural videos in ϵθ has similar properties. This is illustrated in Fig. 3, where we
visualize the features extracted from a given video (first column). As seen, the features depict a
shared and consistent representation across frames, i.e., corresponding regions exhibit similar repre-
sentation. We further observe that the original video features provide fine-grained correspondences
between frames, using a simple nearest neighbour search (Fig 2). Moreover, we show that these
corresponding features are interchangeable for the diffusion model – we can faithfully synthesize
one frame by swapping its features by their corresponding ones in a nearby frame (Fig 2(a)).
Nevertheless, when an edit is applied to each frame individually, the consistency of the features
breaks (Fig. 3 middle column). This implies that the level of consistency of in RGB space is corre-
lated with the consistency of the internal features of the frames. Hence, our key idea is to manipulate
the features of the edited video to preserve the level of consistency and inter-frame correspondences
of the original video features.
As illustrated in Fig. 4, our framework, dubbed TokenFlow, alternates at each generation timestep
between two main components: (i) sampling a set of keyframes and jointly editing them according to
P; this stage results in shared global appearance across the keyframes, and (ii) propagating the fea-
tures from the keyframes to all of the frames based on the correspondences provided by the original

5
Input

“A rainbow-textured dog” “A shiny metal sculpture” “An origami of a stork”

TAV

PNP

Gen-1

Text2
Video

Fate-
Zero

Rerender
a Video

Ours

Figure 6: Comparison. We compare our method against Tune-A-Video (TAV, Wu et al. (2022)), PnP-
Diffusion (Tumanyan et al., 2023) applied per frame, Gen-1 (Esser et al., 2023), Text2Video-Zero (Khachatryan
et al., 2023a) and Fate-Zero (Qi et al., 2023). We refer the reader to our supplementary material for full-video
comparisons.

video features; this stage explicitly preserves the consistency and fine-grained shared representation
of the original video features. Both stages are done in combination with an image editing technique
ϵˆθ (e.g, Tumanyan et al. (2023)). Intuitively, the benefit of alternating between keyframe editing
and propagation is twofold: first, sampling random keyframes at each generation step increases the
robustness to a particular selection. Second, since each generation step results in more consistent
features, the sampled keyframes in the next step will be edited more consistently.
Pre-processing: extracting diffusion features. Given an input video I, we apply DDIM in-
version (see Sec. 3) on each frame I i , which yields a sequence of latents [xi1 , ..., xiT ]. For each
generation timestep t, we feed the latent xit of each frame i ∈ [n] to the model and extract the tokens
ϕ(xit ) from the self-attention module of every layer in the network ϵθ (fig. 4, top). We will later use
these tokens to establish inter-frame correspondences between diffusion features.
4.1 K EYFRAME S AMPLING AND J OINT E DITING
Our observations imply that given the features of a single edited frame, we can generate the next
frames by propagating its features to their corresponding locations. Most videos, however, can
not be represented by a single keyframe. To account for that, we consider multiple keyframes,
from which we obtain a set of features (tokens), Tbase , that will later be propagated to the entire
video. Specifically, at each generation step, we randomly sample a set of keyframes {J i }i∈κ in
fixed frame intervals (see SM for details). We joinly edit the keyframes by extending the self-
attention block to simultaneously process them (Wu et al., 2022), thus encouraging them to share a
global appearance. In more detail, the input to the modified block are the self-attention features from
all keyframes {Qi }i∈κ , {K i }i∈κ , {V i }i∈κ where Qi , K i , V i are the queries, keys, and values of
frame i ∈ κ, κ = {i1 , ...ik }. The keys of all frames are concatenated, and the extended-attention is:
T !
Q i K i1 , . . . K ik
  
ExtAttn Qi ; [K i1 , . . . K ik ] = Softmax √ (2)
d

6
The output of the block for frame i is given by:
 
ϕ(J i ) =  · [V i1 , . . . V ik ] where  = ExtAttn Qi ; [K i1 , . . . K ik ] (3)

Intuitively, each keyframe queries all other keyframes, and aggregates information from them. This
results in a roughly unified appearance in the edited frames (Wu et al., 2022; Khachatryan et al.,
2023b; Ceylan et al., 2023; Qi et al., 2023). We define Tbase = {ϕ(J i )}i∈κ , for each layer in the
network (Fig. 4 bottom middle).
4.2 E DIT P ROPAGATION VIA T OKEN F LOW
Given Tbase , we propagate it across the video based on the token correspondences extracted from
the original video. At each generation step t, we compute the nearest neighbor (NN) of each original
frame’s tokens, ϕ(xit ), and its two adjacent keyframess’ tokens, ϕ(xi+ i−
t ), ϕ(xt ) where i+ is the
index of the closest future keyframe, and i− the index of the closest past keyframe. Denote the
resulting NN fields γ i+ , γ i− :
γ i± [p] = arg min D ϕ(xi )[p], ϕ(xi± )[q]

(4)
q

Where p, q are spatial locations in the token feature map, and D is cosine distance. For simplicity,
we omit the generation timestep t; our method is applied in all time-steps and self-attention layers.
Once we obtain γ ± , we use it to propagate the edited frames’ tokens Tbase to the rest of the video,
by linearly combining the tokens in Tbase corresponding to each spatial location p and frame i:

Fγ (Tbase , i, p) = wi · ϕ(J i+ )[γ i+ [p]] + (1 − wi ) · ϕ(J i− )[γ i− [p]] (5)



Where ϕ(J ) ∈ Tbase and wi ∈ (0, 1) is a scalar proportional to the distance between frame i
and its adjacent keyframes (see SM), ensuring a smooth transition. Note that F also modifies the
tokens of the sampled keyframes. That is, we modify the self-attention blocks to output a linear
combination of the tokens in Tbase for all frames, including the keyframes, according to the original
video token correspondences.
Overall algorithm We summarize our Algorithm 1 TokenFlow editing
video editing algorithm in Alg. 1: We
first perform DDIM inversion on the in- Input:
put video I and extract the sequence I = [I1 , ..., In ] ▷ Input Video
i T
of noisy latents {xt }t=1 for all frames P ▷ Target text prompt
i ∈ [n] (fig 4, top). We then denoise Ψ̂ ▷ Diffusion-based image editing technique
the video, alternating between keyframes {xit }Tt=1 , {ϕ(xi )}ni=1 ← DDIM-Inv[Ii ] ∀i ∈ [n], t ∈ [T ]
editing and TokenFlow propagation: At J1T , . . . , JnT ← x1T , . . . , xnT
each generation step t, we randomize For t = T, . . . , 1 do
k < n keyframe indices, and denoise K = {i1 , . . . , ik } ← sample keyframe indices
them using an image editing technique Fγ ← γ i± ∀i ∈ [n] compute NN field
(e.g., Tumanyan et al. (2023); Meng et al. {Jjt−1 }j∈K ← ϵˆθ [{Jjt }j∈K ; ExtAttn]
(2022); Zhang & Agrawala (2023)) com- Tbase ← ϕ({Jjt−1 }j∈K ) extract keyframes’ tokens
bined with extended-attention (Eq. 3, Fig.
Jt−1 ← ϵˆθ [Jt ; TokenFlow(Fγ (Tbase ))]
4 (I)). We then denoise the entire video 1 n
Jt by combining the image-editing tech- Output: J = [J0 , . . . , J0 ]
nique with TokenFlow (Eq. 5, Fig. 4 (II))
at every self-attention block in every layer of the network. Note that each layer includes a residual
connection between the input and output of the self-attention block, thus performing TokenFlow at
each layer is necessary.

5 R ESULTS
We evaluate our method on DAVIS videos (Pont-Tuset et al., 2017) and on Internet videos depicting
animals, food, humans, and various objects in motion. The spatial resolution of the videos is 384×
672 or 512×512 pixels, and they consist of 40 to 200 frames. We use various text prompts on each
video to obtain diverse editing results. Our evaluation dataset comprises of 61 text-video pairs. We
utilize PnP-Diffusion (Tumanyan et al., 2023) as the frame editing method, and we use the same
hyper-parameters for all our results. PnP-Diffusion may fail to accurately preserve the structure of
each frame due to inaccurate DDIM inversion (see Fig. 3, middle column, right frame: the dog’s
head is distorted). Our method improves robustness to this, as multiple frames contribute to the
generation of each frame in the video. Our framework can be combined with any diffusion-based
image editing technique that accurately preserves the structure of the images; results with different

7
image editing techniques (e.g. Meng et al. (2022); Zhang & Agrawala (2023)) are available in the
SM. Fig. 5 and 1 show sample frames from the edited videos. Our edits are temporally consistent and
adhere to the edit prompt. The man’s head is changed to Van-Gogh or marble (top left); importantly,
the man’s identity and the scene’s background are consistent throughout the video. The patterns of
the polygonal wolf (bottom left) are the same across time: the body is consistently orange while the
chest is blue. We refer the reader to the SM for implementation details and video results.
Baselines. We compare our method to state-of-the-art, and concurrent works: (i) Fate-Zero (Qi
et al., 2023) and (ii) Text2Video-Zero (Khachatryan et al., 2023b), that utilize a text-to-image model
for video editing using self-attention inflation. (iii) Re-render a Video (Yang et al., 2023) that edits
keyframes by adding optical flow optimization to self-attention inflation of an image model, and then
propagates the edit from the keyframes to the rest of the video using an off-the-shelf propagation
method. (iv) Tune-a-Video (Wu et al., 2022) that fine-tunes the text-to-image model on the given test
video. (v) Gen-1 (Esser et al., 2023), a video diffusion model that was trained on a large-scale image
and video dataset. (vi) Per-frame diffusion-based image editing baseline, PnP-Diffusion (Tumanyan
et al., 2023). We additionally consider the two following baselines: (i) Text2LIVE (Bar-Tal et al.,
2022) which utilize a layered video representation (NLA) (Kasten et al., 2021) and perform test-time
training using CLIP losses. Note that NLA requires foreground/background separation masks and
takes ∼ 10 hours to train. (ii) Applying PnP-Diffusion on a single keyframe and propagating the edit
to the entire video using Jamriška et al. (2019).

5.1 Q UALITATIVE EVALUATION


Fig. 6 provides a qualitative comparison of our method to prominent baselines; please refer to SM for
the full videos. Our method (bottom row) outputs videos that better adhere to the edit prompt while
maintaining temporal consistency of the resulting edited video, while other methods struggle to meet
both these goals. Tune-A-Video (second row) inflates the 2D image model into a video model, and
fine-tunes it to overfit the motion of the video; thus, it is suitable for short clips. For long videos
it struggles to capture the motion resulting with meaningless edits, e.g., the shiny metal sculpture.
Applying PnP for each frame independently (third row) results in exquisite edits adhering to the
edit prompt but, as expected, lack any temporal consistency. The results of Gen-1 (fourth row) also
suffer from some temporal inconsistencies (the beak of the origami stork changes color). Moreover,
their frame quality is significantly worse than that of a text-to-image diffusion model. The edits of
Text2Video-Zero and Fate-Zero (fifth and sixth row) suffer from severe jittering as these methods
rely heavily on the extended attention mechanism to implicitly encourage consistency. The results
of Rerender-a-Video exhibit notable long-range inconsistencies and artifacts arising primarily from
their reliance on optical flow estimation for distant frames (e.g. keyframes), which is known to be
sub-optimal (See our video results in the SM; when the wolf turns its head, the nose color changes).
We provide qualitative comparison to Text2LIVE and to a RGB propagation baseline in the SM.

5.2 Q UANTITATIVE EVALUATION


We evaluate our method in terms of: Table 1: We evaluate our method in temporal consistency
(i) edit fidelity measured by comput- by computing warp-error and conducting a user study, and
ing the average similarity between in fidelity to the target text prompt using CLIP similarity.
the CLIP embedding (Radford et al., See Sec. 5 for more details.
2021) of each edited frame and the
target text prompt; (ii) temporal con- Warp-err↓ User preference CLIP
sistency. Following Ceylan et al. ×10−3 of our method score ↑
(2023); Lai et al. (2018a), tempo- LDM recon. 2.0 − 0.23
ral consistency is measured by (a) PnP-Diffusion 11.3 94% 0.33
computing the optical flow of the Text2Video-Zero 12.5 78% 0.33
original video using Teed & Deng Tune-a-Video 30.0 82% 0.31
(2020), warping the edited frames Fate-Zero 6.9 71% 0.32
according to it, and measuring the Gen1 − 70% 0.32
warping error, and (b) a user study; Rerender-a-Video 1.8 71% 0.32
We adopt a Two-alternative Forced Ours w joint attention 5.9 90% 0.33
Choice (2AFC) protocol suggested Ours w/o rand keyframes 3.7 − 0.33
in Kolkin et al. (2019); Park et al. Ours 3.0 − 0.33
(2020), where participants are shown
the input video, ours and a baseline result, and are asked to determine which video is more tem-
porally consistent and better preserves the motion of the original video. The survey consists of
2000-3000 judgments per baseline obtained using Amazon mechanical turk. We note that warping-
error could not be measured for Gen1 since their product platform does not output the same number
of input frames. Table 1 compares our method to baselines. Our method achieves the highest CLIP

8
“A tractor”

Figure 7: Limitations. Our method edits the video according to the feature correspondences of the original
video, hence it cannot handle edits that requires structure deviations.

score, showing a good fit between the edited video and the input guidance prompt. Furthermore, our
method has a low warping error, indicating temporally consistent results. We note that Re-render-
a-Video optimizes for the warping error and uses optical flow to propagate the edit, and hence has
the lowest warping error; However, this reliance on optical flow often creates artifacts and long-
range inconsistencies which are not reflected in the warping error. Nonetheless, they are apparent
in the user study, that shows users significantly favoured our method over all baselines in terms of
temporal consistency. Additionally, we consider the reference baseline of passing the original video
through the LDM auto-encoder without performing editing (LDM recon.). This baseline provides an
upper bound on the temporal consistency achievable by LDM auto-encoder. As expected, the CLIP
similarity of this baseline is poor as it does not involve any editing. However, this baseline does not
achieve zero warp error either due to the imperfect reconstruction of the LDM auto-encoder, which
hallucinates high-frequency information.
We further evaluate our correspondences and video representation by measuring the accuracy of
video reconstruction using TokenFlow. Specifically, we reconstruct the video using the same
pipeline of our editing method, only removing the keyframes editing part. Table 2 reports the PSNR
and LPIPS distance of this reconstruction, compared to vanilla DDIM reconstruction. As seen,
TokenFlow reconstruction slightly improves DDIM inversion, demonstrating robust frame represen-
tation. This improvement can be attributed to the keyframe randomization; It increases robustness to
challenging frames since each frame is reconstructed from multiple other frames during the gener-
ation. Notably, our evaluation focuses on accurate correspondences within the feature space during
generation, rather than RGB frame correspondences evaluation, which is not essential to our method.
5.3 A BLATION STUDY
First, we ablate the use of TokenFlow, Sec. 4.2, for en- Table 2: We reconstruct the video using
forcing temporal consistency. In this experiment, we the TokenFlow pipeline, excluding keyframe
replace TokenFlow with extended attention (Eq. 3) and editing. We evaluate the TokenFlow represen-
compute it between each frames of the edited video and tation with PSNR and LPIPS metrics. Our
the keyframes (w joint attention). Second, we ablate the reconstruction improves vanilla DDIM inver-
randomizing of the keyframe selection at each genera- sion, highlighting the robusteness of Token-
tion step (w/o random keyframes). In this experiment, we Flow representation.
use the same keyframe indices (evenly spaced in time) PSNR ↑ LPIPS↓
across the generation. Table 1 (bottom) shows the quan- LDM recon. 31.13 0.03
titative results of our ablations, the resulting videos can DDIM inversion 25.32 0.14
be found in the SM. As seen, TokenFlow ensures higher Ours 25.74 0.13
degree of temporal consistency, indicating that solely re-
lying on the extension of self-attention to multiple frames is insufficient for achieving fine-grained
temporal consistency. Additionally, fixing the keyframes creates an artificial partition of the video
into short clips between the fixed keyframes, which reflects poorly on the consistency of the result.

6 D ISCUSSION
We presented a new framework for text-driven video editing using an image diffusion model. We
study the internal representation of a video in the diffusion feature space, and demonstrate that
consistent video editing can be achieved via consistent diffusion feature representation during the
generation. Our method outperforms existing baselines, demonstrating a significant improvement in
temporal consistency. As for limitations, our method is tailored to preserve the motion of the original
video, and as such, it cannot handle edits that require structural changes (Fig 7.) Moreover, our
method is built upon a diffusion-based image editing technique to allow the structure preservation
of the original frames. When the image-editing technique fails to preserve the structure, our method
enforces correspondences that are meaningless in the edited frames, resulting in visual artifacts.
Lastly, the LDM decoder introduces some high frequency flickering (Blattmann et al., 2023). A
possible solution for this would be to combine our framework with an improved decoder (e.g.,
Blattmann et al. (2023), Zhu et al. (2023)). We note that this minor level of flickering can be easily
eliminated with exiting post-process deflickering (see SM). Our work shed new light on the internal
representation of natural videos in the space of diffusion models (e.g., temporal redundancies), and
how they can be leveraged for enhancing video synthesis. We believe it can inspire future research
in harnessing image models for video tasks, and for the design of text-to-video models.

9
7 ACKNOWLEDGEMENT
We thank Narek Tumanyan for his valuable comments and discussion. We thank Hila Chefer for
proofreading the paper. We thank the authors of Gen-1 and of Fate-Zero for their help in run-
ning their comparisons. This project received funding from the Israeli Science Foundation (grant
2303/20), the Carolito Stiftung, and the NVIDIA Applied Research Accelerator Program. Dr. Bagon
is a Robin Chemers Neustein AI Fellow. We thank GEN-1 authors and Fate-Zero authors for their
help in conducting comparisons.

R EFERENCES
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered
image and video editing. In European Conference on Computer Vision, pp. 707–723. Springer, 2022.
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled
image generation. arXiv preprint arXiv:2302.08113, 2023.
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten
Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2023.
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-
free mutual self-attention control for consistent image synthesis and editing, 2023.
Duygu Ceylan, Chun-Hao Paul Huang, and Niloy Jyoti Mitra. Pix2video: Video editing using image diffusion.
ArXiv, abs/2303.12688, 2023.
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based
semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
V. Cheung, B.J. Frey, and N. Jojic. Video epitomes. In 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), 2005.
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A
survey. arXiv preprint arXiv:2209.04747, 2022.
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural
Information Processing Systems, 2021.
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure
and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. Scenescape: Text-driven consistent scene gener-
ation. arXiv preprint arXiv:2302.01133, 2023.
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt
image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems, 2020.
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma,
Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with
diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video
diffusion models. arXiv:2204.03458, 2022b.
Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion
models using self-attention guidance. arXiv preprint arXiv:2210.00939, 2022.

Ondřej Jamriška, Šárka Sochorová, Ondřej Texler, Michal Lukáč, Jakub Fišer, Jingwan Lu, Eli Shechtman, and
Daniel Sýkora. Stylizing video by example. ACM Transactions on Graphics, 2019.
Yoni Kasten, Dolev Ofri, Oliver Wang, and Tali Dekel. Layered neural atlases for consistent video editing.
ACM Transactions on Graphics (TOG), 2021.
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant
Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video
generators. ArXiv, abs/2303.13439, 2023a.
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant
Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video
generators. arXiv preprint arXiv:2303.13439, 2023b.

10
Nicholas Kolkin, Jason Salavon, and Gregory Shakhnarovich. Style transfer by relaxed optimal transport and
self-similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 10051–10060, 2019.

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning
blind video temporal consistency. In European Conference on Computer Vision, 2018a.

Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. Learning
blind video temporal consistency. In Proceedings of the European conference on computer vision (ECCV),
pp. 170–185, 2018b.

Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, and Jia-Bin Huang. Shape-aware text-
driven layered video editing. arXiv preprint arXiv:2301.13173, 2023a.

Yao-Chih Lee, Ji-Ze Genevieve Jang Jang, Yi-Ting Chen, Elizabeth Qiu, and Jia-Bin Huang. Shape-aware
text-driven layered video editing demo. arXiv preprint arXiv:2301.13173, 2023b.

Chenyang Lei, Yazhou Xing, and Qifeng Chen. Blind video temporal consistency via deep video prior. In
Advances in Neural Information Processing Systems, 2020.

Chenyang Lei, Xuanchi Ren, Zhaoxiang Zhang, and Qifeng Chen. Blind video deflickering by neural filtering
with a flawed atlas. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2023.

Shaoteng Liu, Yuecheng Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-
attention control. ArXiv, abs/2303.04761, 2023.

Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures:
Searching through time and space for semantic correspondence. arXiv, 2023.

Wan-Duo Kurt Ma, JP Lewis, W Bastiaan Kleijn, and Thomas Leung. Directed diffusion: Direct control of
object placement through attention guidance. arXiv preprint arXiv:2302.13153, 2023.

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit:
Guided image synthesis and editing with stochastic differential equations. In International Conference on
Learning Representations, 2022.

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever,
and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion
models. arXiv preprint arXiv:2112.10741, 2021.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Inter-
national Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.

Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, and Richard Zhang.
Swapping autoencoder for deep image manipulation. In Advances in Neural Information Processing Systems,
2020.

Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or. Localizing object-level
shape variations with text-to-image diffusion models. arXiv preprint arXiv:2303.11306, 2023.

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool.
The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen.
Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural
language supervision. In International conference on machine learning. PMLR, 2021.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative
adversarial text to image synthesis. In International conference on machine learning, pp. 1060–1069. PMLR,
2016.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 10684–10695, 2022.

11
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image
segmentation. In International Conference on Medical image computing and computer-assisted intervention,
2015.
Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic style transfer for videos. In Pattern Recognition
- 38th German Conference (GCPR), 2016.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-
image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy,
Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-
scale dataset for training next generation image-text models. ArXiv, abs/2210.08402, 2022.
Oded Shahar, Alon Faktor, and Michal Irani. Space-time super-resolution from a single video. In CVPR 2011,
2011.
Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman.
Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang gil Lee, and Sung-Hoon Yoon. Edit-a-video: Single video
editing with object-aware consistency. ArXiv, abs/2303.07945, 2023.
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron
Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation
without text-video data, 2022.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265.
PMLR, 2015.
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International
Conference on Learning Representations, 2020.
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16.
Springer, 2020.

Ondřej Texler, David Futschik, Michal Kučera, Ondřej Jamriška, Šárka Sochorová, Menglei Chai, Sergey
Tulyakov, and Daniel Sýkora. Interactive video stylization using few-shot patch-based training. ACM Trans-
actions on Graphics, 2020.
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven
image-to-image translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30,
2017.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu
Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video
generation. arXiv preprint arXiv:2212.11565, 2022.
Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-
video translation, 2023.
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas.
Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. 2017 IEEE
International Conference on Computer Vision (ICCV), pp. 5908–5916, 2016.
Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-
Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspon-
dence. arXiv preprint arxiv:2305.15347, 2023.
Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, and Gang Hua.
Designing a better asymmetric vqgan for stablediffusion, 2023.

12
Table 3: We report average runtime in seconds, of running ours and competing methods on a video
of 40 frames.
TAV Text2video-zero Rerender-a-video fatezero PnP ours (preprocess) ours (sampling) ours (total)
2684 198 285 349 208 50 187 237

We provide additional implementation details below. We refer the reader to the HTML file attached
to our Supplementary Material for video results.

A I MPLEMENTATION D ETAILS
StableDiffusion. We use Stable Diffusion as our pre-trained text-to-image model; we use the
StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage.
DDIM inversion. In all of our experiments, we use DDIM deterministic sampling with 50 steps.
For inverting the video, we follow Tumanyan et al. (2023) and use DDIM inversion with classifier-
free guidance scale of 1 and 1000 forward steps; and extract the self-attention input tokens from this
process similarly to Qi et al. (2023).
Runtime. Since we don’t compute the attention module on most video frames (i.e., we only com-
pute the self-attention output on the keyframes) our method is efficient in run-time, and the sampling
of the video reduces the time of per-frame editing by 20%. The inversion process with 1000 steps is
the main bottleneck of our method in terms of run-time, and in many cases a significantly smaller
amount of steps is suffieicent (e.g. 50). Table 3 reports runtime comparisons using 50 steps in all
methods. Notably, our sampling time is indeed faster than that of per-frame editing (PnP).
Hyper-parameters. In equation 5 we set wi to be:
wi = σ(d− /(d+ + d− ))
(6)
where d+ = ||i − i+ ||, d− = ||i − i− ||
where σ is a sigmoid function, i+ and i− are the future and past neighboring keyframes of i, respec-
tively.
For sampling the edited video we set the classifier-free guidance scale to 7.5. At each timestep, we
sample random keyframes in frame intervals of 8.
Baselines. For running the baseline of Tune-a-video (Wu et al., 2022) we used their official repos-
itory. For Gen-1 (Esser et al., 2023) we used their platform on Runaway website. This platform
outputs a video that is not in the same length and frame-rate as the input video; therefore, we could
not compute the warping error on their results. For text-to-video-zero (Khachatryan et al., 2023b) we
used their official repository, with their depth conditioning configuration. For Fate-Zero (Qi et al.,
2023) with used their official repository, and verified the run configurations with the authors.

13

You might also like