Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
370 views30 pages

Stable Video Diffusion

This document presents Stable Video Diffusion, a latent video diffusion model for high-resolution text-to-video and image-to-video generation. It finds that pretraining on a large, well-curated video dataset and finetuning on higher-quality data leads to improved video generation performance. It also shows that the model provides a strong prior for motion and multi-view generation, allowing it to generate multiple consistent views of objects from a single image.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
370 views30 pages

Stable Video Diffusion

This document presents Stable Video Diffusion, a latent video diffusion model for high-resolution text-to-video and image-to-video generation. It finds that pretraining on a large, well-curated video dataset and finetuning on higher-quality data leads to improved video generation performance. It also shows that the model provides a strong prior for motion and multi-view generation, allowing it to generate multiple consistent views of objects from a single image.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann* Tim Dockhorn* Sumith Kulal* Daniel Mendelevitch


Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti
Adam Letts Varun Jampani Robin Rombach
Stability AI

Figure 1. Stable Video Diffusion samples. Top: Text-to-Video generation. Middle: (Text-to-)Image-to-Video generation. Bottom: Multi-
view synthesis via Image-to-Video finetuning.
Abstract rating video data. In this paper, we identify and evalu-
We present Stable Video Diffusion — a latent video diffu- ate three different stages for successful training of video
sion model for high-resolution, state-of-the-art text-to-video LDMs: text-to-image pretraining, video pretraining, and
and image-to-video generation. Recently, latent diffusion high-quality video finetuning. Furthermore, we demon-
models trained for 2D image synthesis have been turned strate the necessity of a well-curated pretraining dataset
into generative video models by inserting temporal layers for generating high-quality videos and present a system-
and finetuning them on small, high-quality video datasets. atic curation process to train a strong base model, includ-
However, training methods in the literature vary widely, ing captioning and filtering strategies. We then explore
and the field has yet to agree on a unified strategy for cu- the impact of finetuning our base model on high-quality
data and train a text-to-video model that is competitive with
* Equal contributions.
closed-source video generation. We also show that our base

1
model provides a powerful motion representation for down- pretraining. Our main findings imply that pretraining on
stream tasks such as image-to-video generation and adapt- well-curated datasets leads to significant performance im-
ability to camera motion-specific LoRA modules. Finally, provements that persist after high-quality finetuning.
we demonstrate that our model provides a strong multi-view
3D-prior and can serve as a base to finetune a multi-view A general motion and multi-view prior Drawing on
diffusion model that jointly generates multiple views of ob- these findings, we apply our proposed curation scheme to
jects in a feedforward fashion, outperforming image-based a large video dataset comprising roughly 600 million sam-
methods at a fraction of their compute budget. We release ples and train a strong pretrained text-to-video base model,
code and model weights at https://github.com/ which provides a general motion representation. We exploit
Stability-AI/generative-models. this and finetune the base model on a smaller, high-quality
dataset for high-resolution downstream tasks such as text-
to-video (see Figure 1, top row) and image-to-video, where
1. Introduction we predict a sequence of frames from a single conditioning
image (see Figure 1, mid rows). Human preference studies
Driven by advances in generative image modeling with
reveal that the resulting model outperforms state-of-the-art
diffusion models [36, 64, 67, 72], there has been signifi-
image-to-video models.
cant recent progress on generative video models both in re-
Furthermore, we also demonstrate that our model pro-
search [8, 40, 78, 91] and real-world applications [51, 70]
vides a strong multi-view prior and can serve as a base to
Broadly, these models are either trained from scratch [39]
finetune a multi-view diffusion model that generates mul-
or finetuned (partially or fully) from pretrained image mod-
tiple consistent views of an object in a feedforward man-
els with additional temporal layers inserted [8, 30, 41, 78].
ner and outperforms specialized novel view synthesis meth-
Training is often carried out on a mix of image and video
ods such as Zero123XL [13, 54] and SyncDreamer [55].
datasets [39].
Finally, we demonstrate that our model allows for ex-
While research around improvements in video modeling
plicit motion control by specifically prompting the tempo-
has primarily focused on the exact arrangement of the spa-
ral layers with motion cues and also via training LoRA-
tial and temporal layers [8, 39, 41, 78], none of the afore-
modules [30, 43] on datasets resembling specific motions
mentioned works investigate the influence of data selection.
only, which can be efficiently plugged into the model.
This is surprising, especially since the significant impact of
To summarize, our core contributions are threefold: (i) We
the training data distribution on generative models is undis-
present a systematic data curation workflow to turn a large
puted [12, 100]. Moreover, for generative image modeling,
uncurated video collection into a quality dataset for gener-
it is known that pretraining on a large and diverse dataset
ative video modeling. Using this workflow, we (ii) train
and finetuning on a smaller but higher quality dataset sig-
state-of-the-art text-to-video and image-to-video models,
nificantly improves the performance [12, 67]. Since many
outperforming all prior models. Finally, we (iii) probe the
previous approaches to video modeling have successfully
strong prior of motion and 3D understanding in our models
drawn on techniques from the image domain [8, 40, 41], it
by conducting domain-specific experiments. Specifically,
is noteworthy that the effect of data and training strategies,
we provide evidence that pretrained video diffusion models
i.e., the separation of video pretraining at lower resolutions
can be turned into strong multi-view generators, which may
and high-quality finetuning, has yet to be studied. This work
help overcome the data scarcity typically observed in the
directly addresses these previously uncharted territories.
3D domain [13].
We believe that the significant contribution of data selec-
tion is heavily underrepresented in today’s video research
2. Background
landscape despite being well-recognized among practition-
ers when training video models at scale. Thus, in contrast Most recent works on video generation rely on diffusion
to previous works, we draw on simple latent video diffu- models [36, 80, 83] to jointly synthesize multiple con-
sion baselines [8] for which we fix architecture and training sistent frames from text- or image-conditioning. Diffu-
scheme and assess the effect of data curation. To this end, sion models implement an iterative refinement process by
we first identify three different video training stages that learning to gradually denoise a sample from a normal
we find crucial for good performance: text-to-image pre- distribution and have been successfully applied to high-
training, video pretraining on a large dataset at low resolu- resolution text-to-image [12, 60, 64, 67, 71] and video syn-
tion, and high-resolution video finetuning on a much smaller thesis [8, 27, 39, 78, 91].
dataset with higher-quality videos. Borrowing from large- In this work, we follow this paradigm and train a la-
scale image model training [12, 60, 62], we introduce a sys- tent [67, 88] video diffusion model [8, 21] on our video
tematic approach to curate video data at scale and present an dataset. We provide a brief overview of related works
empirical study on the effect of data curation during video which utilize latent video diffusion models (Video-LDMs)

2
in the following paragraph; a full discussion that includes 3. Curating Data for HQ Video Synthesis
approaches using GANs [9, 28] and autoregressive mod-
els [41] can be found in App. B. In this section, we introduce a general strategy to train a
state-of-the-art video diffusion model on large datasets of
videos. To this end, we (i) introduce data processing and cu-
Latent Video Diffusion Models Video-LDMs [8, 29, 30, ration methods, for which we systematically analyze the im-
33, 93] train the main generative model in a latent space of pact on the quality of the final model in Section 3.3 and Sec-
reduced computational complexity [20, 67]. Most related tion 3.4, and (ii), identify three different training regimes
works make use of a pretrained text-to-image model and in- for generative video modeling. In particular, these regimes
sert temporal mixing layers of various forms [1, 8, 27, 29, consist of
30] into the pretrained architecture. Ge et al. [27] addition- • Stage I: image pretraining, i.e. a 2D text-to-image diffu-
ally relies on temporally correlated noise to increase tempo- sion model [12, 60, 67].
ral consistency and ease the learning task. In this work, we • Stage II: video pretraining, which trains on large amounts
follow the architecture proposed in Blattmann et al. [8] and of videos.
insert temporal convolution and attention layers after every • Stage III: video finetuning, which refines the model on a
spatial convolution and attention layer. In contrast to works small subset of high-quality videos at higher resolution.
that only train temporal layers [8, 30] or are completely We study the importance of each regime separately in Sec-
training-free [49, 109], we finetune the full model. For text- tions 3.2 to 3.4.
to-video synthesis in particular, most works directly condi-
tion the model on a text prompt [8, 93] or make use of an 3.1. Data Processing and Annotation
additional text-to-image prior [21, 78].
In our work, we follow the former approach and show 1e7

that the resulting model is a strong general motion prior, 11.09


10
which can easily be finetuned into an image-to-video or 8
multi-view synthesis model. Additionally, we introduce 8
Nclip/video

micro-conditioning [60] on frame rate. We also employ Count 6


6
the EDM-framework [48] and significantly shift the noise
schedule towards higher noise values, which we find to be 4
4
essential for high-resolution finetuning. See Section 4 for a 2.65
2 2
detailed discussion of the latter.
0 0
Raw Processed 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Data Curation Pretraining on large-scale datasets [76] Data Optical Flow Score
is an essential ingredient for powerful models in several
Figure 2. Our initial dataset contains many static scenes and cuts
tasks such as discriminative text-image [62, 100] and lan-
which hurts training of generative video models. Left: Average
guage [25, 59, 63] modeling. By leveraging efficient number of clips per video before and after our processing, reveal-
language-image representations such as CLIP [45, 62, 100], ing that our pipeline detects lots of additional cuts. Right: We
data curation has similarly been successfully applied for show the distribution of average optical flow score for one of these
generative image modeling [12, 60, 76]. However, dis- subsets before our processing, which contains many static clips.
cussions on such data curation strategies have largely been
missing in the video generation literature [39, 41, 78, 90], We collect an initial dataset of long videos which forms the
and processing and filtering strategies have been intro- base data for our video pretraining stage. To avoid cuts and
duced in an ad-hoc manner. Among the publicly acces- fades leaking into synthesized videos, we apply a cut detec-
sible video datasets, WebVid-10M [6] dataset has been a tion pipeline1 in a cascaded manner at three different FPS
popular choice [8, 78, 110] despite being watermarked and levels. Figure 2, left, provides evidence for the need for cut
suboptimal in size. Additionally, WebVid-10M is often detection: After applying our cut-detection pipeline, we ob-
used in combination with image data [76], to enable joint tain a significantly higher number (∼4×) of clips, indicat-
image-video training. However, this amplifies the diffi- ing that many video clips in the unprocessed dataset contain
culty of separating the effects of image and video data on cuts beyond those obtained from metadata.
the final model. To address these shortcomings, this work Next, we annotate each clip with three different syn-
presents a systematic study of methods for video data cu- thetic captioning methods: First, we use the image captioner
ration and further introduces a general three-stage training CoCa [103] to annotate the mid-frame of each clip and use
strategy for generative video models, producing a state-of-
the-art model. 1 https://github.com/Breakthrough/PySceneDetect

3
Table 1. Comparison of our dataset before and after fitering with w/ Image Pretraining LVD-10M-F
1.0 w/o Image Pretraining 0.7 LVD-10M
publicly available research datasets. 0.6

User Preference
0.8

User Preference
LVD LVD-F LVD-10M LVD-10M-F WebVid InternVid 0.5
0.6 0.4
#Clips 577M 152M 9.8M 2.3M 10.7M 234M
0.4 0.3
Clip Duration (s) 11.58 10.53 12.11 10.99 18.0 11.7
0.2
Total Duration (y) 212.09 50.64 3.76 0.78 5.94 86.80 0.2
0.1
Mean #Frames 325 301 335 320 - -
0.0 0.0
Mean Clips/Video 11.09 4.76 1.2 1.1 1.0 32.96 Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated
Motion Annotations? ✓ ✓ ✓ ✓ ✗ ✗
(a) Initializing spatial layers from (b) Video data curation boosts per-
pretrained images models greatly formance after video pretraining.
V-BLIP [104] to obtain a video-based caption. Finally, we improves performance.
generate a third description of the clip via an LLM-based Figure 3. Effects of image-only pretraining and data curation on
summarization of the first two captions. video-pretraining on LVD-10M: A video model with spatial lay-
The resulting initial dataset, which we dub Large Video ers initialized from a pretrained image model clearly outperforms
Dataset (LVD), consists of 580M annotated video clip pairs, a similar one with randomly initialized spatial weights as shown
forming 212 years of content. in Figure 3a. Figure 3b emphasizes the importance of data cura-
However, further investigation reveals that the resulting tion for pretraining, since training on a curated subset of LVD-10M
with the filtering threshold proposed in Section 3.3 improves upon
dataset contains examples that can be expected to degrade
training on the entire uncurated LVD-10M.
the performance of our final video model, such as clips with
less motion, excessive text presence, or generally low aes-
thetic value. We therefore additionally annotate our dataset equally powerful off-the-shelf representations available in
with dense optical flow [22, 46], which we calculate at 2 the video domain to filter out unwanted examples, we rely
FPS and with which we filter out static scenes by removing on human preferences as a signal to create a suitable pre-
any videos whose average optical flow magnitude is below training dataset. Specifically, we curate subsets of LVD us-
a certain threshold. Indeed, when considering the motion ing different methods described below and then consider the
distribution of LVD (see Figure 2, right) via optical flow human-preference-based ranking of latent video diffusion
scores, we identify a subset of close-to-static clips therein. models trained on these datasets.
Moreover, we apply optical character recognition [4] to
More specifically, for each type of annotation introduced
weed out clips containing large amounts of written text.
in Section 3.1 (i.e., CLIP scores, aesthetic scores, OCR de-
Lastly, we annotate the first, middle, and last frames of
tection rates, synthetic captions, optical flow scores), we
each clip with CLIP [62] embeddings from which we calcu-
start from an unfiltered, randomly sampled 9.8M-sized sub-
late aesthetics scores [76] as well as text-image similarities.
set of LVD, LVD-10M, and systematically remove the bot-
Statistics of our dataset, including the total size and average
tom 12.5, 25 and 50% of examples. Note that for the syn-
duration of clips, are provided in Tab. 1.
thetic captions, we cannot filter in this sense. Instead, we
3.2. Stage I: Image Pretaining assess Elo rankings [19] for the different captioning meth-
ods from Section 3.1. To keep the number of total subsets
We consider image pretraining as the first stage in our train- tractable, we apply this scheme separately to each type of
ing pipeline. Thus, in line with concurrent work on video annotation. We train models with the same training hyper-
models [8, 39, 78], we ground our initial model on a pre- parameters on each of these filtered subsets and compare
trained image diffusion model - namely Stable Diffusion the results of all models within the same class of annotation
2.1 [67] - to equip it with a strong visual representation. with an Elo ranking [19] for human preference votes. Based
To analyze the effects of image pretraining, we train and on these votes, we consequently select the best-performing
compare two identical video models as detailed in App. D filtering threshold for each annotation type. The details of
on a 10M subset of LVD; one with and one without pre- this study are presented and discussed in App. E. Applying
trained spatial weights. We compare these models using this filtering approach to LVD results in a final pretraining
a human preference study (see App. E for details) in Fig- dataset of 152M training examples, which we refer to as
ure 3a, which clearly shows that the image-pretrained model LVD-F, cf . Tab. 1.
is preferred in both quality and prompt-following. Curated training data improves performance. In this
section, we demonstrate that the data curation approach
3.3. Stage II: Curating a Video Pretraining Dataset
described above improves the training of our video diffu-
A systematic approach to video data curation. For mul- sion models. To show this, we apply the filtering strat-
timodal image modeling, data curation is a key element egy described above to LVD-10M and obtain a four times
of many powerful discriminative [62, 100] and genera- smaller subset, LVD-10M-F. Next, we use it to train a base-
tive [12, 38, 65] models. However, since there are no line model that follows our standard architecture and train-

4
0.7 LVD-10M-F 0.7 LVD-10M-F 0.7 Curated-50M LVD-50M-F
0.7 LVD-10M-F From Image 127
WebVid-10M InternVid-10M Uncurated-50M Uncurated
0.6 0.6 0.6 0.6 120
User Preference

User Preference

User Preference
Curated 103

User Preference
0.5 0.5 0.5

Elo Improvment
100
0.5 89
0.4 0.4 0.4 0.4 80
70
0.3 0.3 0.3 0.3 60

0.2 0.2 0.2 0.2 40

0.1 0.1 0.1 0.1 20


0 0
0.0 0.0 0.0 0.0 0
Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated 10k steps 50k steps

(a) User preference for LVD- (b) User preference for LVD- (c) User preference at 50M (d) User preference on scal- (e) Relative ELO progression
10M-F and WebVid [6]. 10M-F and InternVid [96]. samples scales. ing datasets. over time during Stage III.
Figure 4. Summarized findings of Sections 3.3 and 3.4: Pretraining on curated datasets consistently boosts performance of generative
video models during video pretraining at small (Figures 4a and 4b) and larger scales (Figures 4c and 4d). Remarkably, this performance
improvement persists even after 50k steps of video finetuning on high quality data (Figure 4e).

ing schedule and evaluate the preference scores for visual stage, we finetune three identical models, which only dif-
quality and prompt-video alignment compared to a model fer in their initialization. We initialize the weights of the
trained on uncurated LVD-10M. first with a pretrained image model and skip video pretrain-
We visualize the results in Figure 3b, where we can ing, a common choice among many recent video modeling
see the benefits of filtering: In both categories, the model approaches [8, 78]. The remaining two models are initial-
trained on the much smaller LVD-10M-F is preferred. To ized with the weights of the latent video models from the
further show the efficacy of our curation approach, we com- previous section, specifically, the ones trained on 50M cu-
pare the model trained on LVD-10M-F with similar video rated and uncurated video clips. We finetune all models for
models trained on WebVid-10M [6], which is the most rec- 50K steps and assess human preference rankings early dur-
ognized research licensed dataset, and InternVid-10M [96], ing finetuning (10K steps) and at the end to measure how
which is specifically filtered for high aesthetics. Although performance differences progress in the course of finetun-
LVD-10M-F is also four times smaller than these datasets, ing. We show the obtained results in Figure 4e, where we
the corresponding model is preferred by human evaluators plot the Elo improvements of user preference relative to the
in both spatiotemporal quality and prompt alignment as model ranked last, which is the one initialized from an im-
shown in Figure 4b. age model. Moreover, the finetuning resumed from curated
Data curation helps at scale. To verify that our data cu- pretrained weights ranks consistently higher than the one
ration strategy from above also works on larger, more prac- initialized from video weights after uncurated training.
tically relevant datasets, we repeat the experiment above
Given these results, we conclude that i) the separation of
and train a video diffusion model on a filtered subset with
video model training in video pretraining and video finetun-
50M examples and a non-curated one of the same size. We
ing is beneficial for the final model performance after fine-
conduct a human preference study and summarize the re-
tuning and that ii) video pretraining should ideally occur on
sults of this study in Figure 4c, where we can see that the
a large scale, curated dataset, since performance differences
advantages of data curation also come into play with larger
after pretraining persist after finetuning.
amounts of data. Finally, we show that dataset size is also
a crucial factor when training on curated data in Figure 4d,
where a model trained on 50M curated samples is superior
to a model trained on LVD-10M-F for the same number of 4. Training Video Models at Scale
steps.

3.4. Stage III: High-Quality Finetuning In this section, we borrow takeaways from Section 3 and
present results of training state-of-the-art video models at
In the previous section, we demonstrated the beneficial scale. We first use the optimal data strategy inferred from
effects of systematic data curation for video pretraining. ablations to train a powerful base model at 320 × 576
However, since we are primarily interested in optimizing in App. D.2. We then perform finetuning to yield several
the performance after video finetuning, we now investigate strong state-of-the-art models for different tasks such as
how these differences after Stage II translate to the final per- text-to-video in Section 4.2, image-to-video in Section 4.3
formance after Stage III. Here, we draw on training tech- and frame interpolation in Section 4.4. Finally, we demon-
niques from latent image diffusion modeling [12, 60] and strate that our video-pretraining can serve as a strong im-
increase the resolution of the training examples. More- plicit 3D prior, by tuning our image-to-video models on
over, we use a small finetuning dataset comprising 250K multi-view generation in Section 4.5 and outperform con-
pre-captioned video clips of high visual fidelity. current work, in particular Zero123XL [13, 54] and Sync-
To analyze the influence of video pretraining on this last Dreamer [55] in terms of multi-view consistency.

5
Figure 5. Samples at 576 × 1024. Top: Image-to-video samples (conditioned on leftmost frame). Bottom: Text-to-video samples.

Table 2. UCF-101 zero-shot 0.7


ours standard EDM noise schedule [48] for 150k iterations and
baseline
text-to-video generation. Com- 0.6 batch size 1536. Next, we finetune the model to generate 14
User Preference

paring our base model to base- 0.5


320 × 576 frames for 100k iterations using batch size 768.
lines (numbers from literature). 0.4
0.3
We find that it is important to shift the noise schedule to-
Method FVD (↓) 0.2 wards more noise for this training stage, confirming results
0.1 by Hoogeboom et al. [42] for image models. For further
CogVideo (ZH) [41] 751.34 0.0
CogVideo (EN) [41] 701.59 Ours vs Pika Ours vs Gen2 training details, see App. D. We refer to this model as our
Make-A-Video [78] 367.23 base model which can be easily finetuned for a variety of
Video LDM [8] 550.61 Figure 6. Our 25 frame Image-
MagicVideo [110] 655.00 to-Video model is preferred by tasks as we show in the following sections. The base model
PYOCO [27] 355.20 human voters over GEN-2 [70] has learned a powerful motion representation, for example,
SVD (ours) 242.02 and PikaLabs [51]. it significantly outperforms all baselines for zero-shot text-
to-video generation on UCF-101 [84] (Tab. 2). Evaluation
details can be found in App. E.
4.1. Pretrained Base Model
As discussed in Section 3.2, our video model is based on 4.2. High-Resolution Text-to-Video Model
Stable Diffusion 2.1 [67] (SD 2.1). Recent works [42] show
that it is crucial to adopt the noise schedule when training We finetune the base text-to-video model on a high-quality
image diffusion models, shifting towards more noise for video dataset of ∼ 1M samples. Samples in the dataset gen-
higher-resolution images. As a first step, we finetune the erally contain lots of object motion, steady camera motion,
fixed discrete noise schedule from our image model towards and well-aligned captions, and are of high visual quality al-
continuous noise [83] using the network preconditioning together. We finetune our base model for 50k iterations at
proposed in Karras et al. [48] for images of size 256 × 384. resolution 576 × 1024 (again shifting the noise schedule
After inserting temporal layers, we then train the model on towards more noise) using batch size 768. Samples in Fig-
LVD-F on 14 frames at resolution 256 × 384. We use the ure 5, more can be found in App. E.

6
Figure 7. Applying three camera motion LoRAs (horizontal,
zooming, static) to the same conditioning frame (on the left).

4.3. High Resolution Image-to-Video Model


Besides text-to-video, we finetune our base model for
image-to-video generation, where the video model receives
a still input image as a conditioning. Accordingly, we re-
place text embeddings that are fed into the base model
with the CLIP image embedding of the conditioning. Ad-
ditionally, we concatenate a noise-augmented [37] version Figure 8. Generated multi-view frames of a GSO test object us-
of the conditioning frame channel-wise to the input of the ing our SVD-MV model (i.e. SVD finetuned for Multi-View gen-
UNet [69]. We do not use any masking techniques and eration), SD2.1-MV [68], Scratch-MV, SyncDreamer [55], and
simply copy the frame across the time axis. We finetune Zero123XL [13].
two models, one predicting 14 frames and another one pre-
dicting 25 frames; implementation and training details can zontally moving”, “zooming”, and “static”. In Figure 7 we
be found in App. D. We occasionally found that standard show samples of the three models for identical conditioning
vanilla classifier-free guidance [34] can lead to artifacts: frames; more samples can be found in App. E.
too little guidance may result in inconsistency with the
conditioning frame while too much guidance can result in 4.4. Frame Interpolation
oversaturation. Instead of using a constant guidance scale, To obtain smooth videos at high frame rates, we finetune our
we found it helpful to linearly increase the guidance scale high-resolution text-to-video model into a frame interpola-
across the frame axis (from small to high). Details can be tion model. We follow Blattmann et al. [8] and concatenate
found in App. D. Samples in Figure 5, more can be found the left and right frames to the input of the UNet via mask-
in App. E. ing. The model learns to predict three frames within the two
In Section 4.5 we compare our model with state-of-the- conditioning frames, effectively increasing the frame rate
art, closed-source video generative models, in particular by four. Surprisingly, we found that a very small number of
GEN-2 [21, 70] and PikaLabs [51], and show that our model iterations (≈ 10k) suffices to get a good model. Details and
is preferred in terms of visual quality by human voters. De- samples can be found in App. D and App. E, respectively.
tails on the experiment, as well as many more image-to-
video samples, can be found in App. E. 4.5. Multi-View Generation
To obtain multiple novel views of an object simultaneously,
4.3.1 Camera Motion LoRA
we finetune our image-to-video SVD model on multi-view
To facilitate controlled camera motion in image-to-video datasets [13, 14, 106].
generation, we train a variety of camera motion LoRAs Datasets. We finetuned our SVD model on two datasets,
within the temporal attention blocks of our model [30]; where the SVD model takes a single image and outputs
see App. D for exact implementation details. We train these a sequence of multi-view images: (i) A subset of Obja-
additional parameters on a small dataset with rich camera- verse [14] consisting of 150K curated and CC-licensed syn-
motion metadata. In particular, we use three subsets of the thetic 3D objects from the original dataset [14]. For each
data for which the camera motion is categorized as “hori- object, we rendered 360◦ orbital videos of 21 frames with

7
randomly sampled HDRI environment map and elevation
angles between [−5◦ , 30◦ ]. We evaluate the resulting mod- Method LPIPS↓ PSNR↑ CLIP-S↑
els on an unseen test dataset consisting of 50 sampled ob- SyncDreamer [55] 0.18 15.29 0.88
jects from Google Scanned Objects (GSO) dataset [18]. and Zero123 [54] 0.18 14.87 0.87
Zero123XL [13] 0.20 14.51 0.87
(ii) MVImgNet [106] consisting of casually captured multi-
view videos of general household objects. We split the Scratch-MV 0.22 14.20 0.76
SD2.1-MV [68] 0.18 15.06 0.83
videos into ∼200K train and 900 test videos. We rotate the SVD-MV (ours) 0.14 16.83 0.89
frames captured in portrait mode to landscape orientation.
The Objaverse-trained model is additionally conditioned (a) (b)
on the elevation angle of the input image, and outputs or-
Figure 9. (a) Multi-view generation metrics on Google Scanned
bital videos at that elevation angle. The MVImgNet-trained
Objects (GSO) test dataset. SVD-MV outperforms image-prior
models are not conditioned on pose and can choose an ar- (SD2.1-MV) and no-prior (Scratch-MV) variants, as well other
bitrary camera path in their generations. For details on the state-of-the-art techniques. (b) Training progress of multi-view
pose conditioning mechanism, see App. E. generation models with CLIP-S (solid, left axis) and PSNR (dot-
Models. We refer to our finetuned Multi-View model as ted, right axis) computed on GSO test dataset. SVD-MV shows
SVD-MV. We perform an ablation study on the impor- better metrics consistently from the start of finetuning.
tance of the video prior of SVD for multi-view genera-
tion. To this effect, we compare the results from SVD-
MV i.e. from a video prior to those finetuned from an
image prior i.e. the text-to-image model SD2.1 (SD2.1-
MV), and that trained without a prior i.e. from random
initialization (Scratch-MV). In addition, we compare with
the current state-of-the-art multiview generation models of
Zero123 [54], Zero123XL [13], and SyncDreamer [55].
Metrics. We use the standard metrics of Peak Signal-to-
Noise Ratio (PSNR), LPIPS [107], and CLIP [62] Simi-
larity scores (CLIP-S) between the corresponding pairs of
ground truth and generated frames on 50 GSO test objects.
Training. We train all our models for 12k steps (∼16 hours)
with 8 80GB A100 GPUs using a total batch size of 16, with
a learning rate of 1e-5.
Results. Figure 9(a) shows the average metrics on the GSO Figure 10. Generated novel multi-view frames for MVImgNet
dataset using our SVD-MV model, SD2.1-MV [68], Scratch-MV.
test dataset. The higher performance of SVD-MV com-
pared to SD2.1-MV and Scratch-MV clearly demonstrates
the advantage of the learned video prior in the SVD model
5. Conclusion
for multi-view generation. In addition, as in the case of We present Stable Video Diffusion (SVD), a latent video
other models finetuned from SVD, we found that a very diffusion model for high-resolution, state-of-the-art text-to-
small number of iterations (≈ 12k) suffices to get a good video and image-to-video synthesis. To construct its pre-
model. Moreover, SVD-MV is competitive w.r.t state-of- training dataset, we conduct a systematic data selection and
the-art techniques with lesser training time (12k iterations scaling study, and propose a method to curate vast amounts
in 16 hours), whereas existing models are typically trained of video data and turn large and noisy video collection into
for much longer (for example, SyncDreamer was trained suitable datasets for generative video models. Furthermore,
for four days specifically on Objaverse). Figure 9(b) shows we introduce three distinct stages of video model training
convergence of different finetuned models. After only 1k which we separately analyze to assess their impact on the
iterations, SVD-MV has much better CLIP-S and PSNR final model performance. Stable Video Diffusion provides
scores than its image-prior and no-prior counterparts. a powerful video representation from which we finetune
Figure 8 shows a qualitative comparison of multi-view video models for state-of-the-art image-to-video synthesis
generation results on a GSO test object and Figure 10 on and other highly relevant applications such as LoRAs for
an MVImgNet test object. As can be seen, our generated camera control. Finally we provide a pioneering study on
frames are multi-view consistent and realistic. More details multi-view finetuning of video diffusion models and show
on the experiments, as well as more multi-view generation that SVD constitutes a strong 3D prior, which obtains state-
samples, can be found in App. E. of-the-art results in multi-view synthesis while using only a

8
fraction of the compute of previous methods. [10] Joao Carreira and Andrew Zisserman. Quo vadis, action
We hope these findings will be broadly useful in the recognition? a new model and the kinetics dataset. In pro-
generative video modeling literature. A discussion on ceedings of the IEEE Conference on Computer Vision and
our work’s broader impact and limitations can be found Pattern Recognition, pages 6299–6308, 2017. 15, 24
in App. A. [11] Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Im-
proved conditional vrnns for video prediction. In The IEEE
International Conference on Computer Vision (ICCV),
References 2019. 15
[1] Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia- [12] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang
Bin Huang, Jiebo Luo, and Xi Yin. Latent-shift: Latent Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi-
diffusion with temporal shift for efficient text-to-video gen- aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek
eration. arXiv preprint arXiv:2304.08477, 2023. 3 Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue
[2] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot-
Deep Ganguli, Tom Henighan, Andy Jones, Nicholas wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra-
Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac manathan, Zijian He, Peter Vajda, and Devi Parikh. Emu:
Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Ka- Enhancing image generation models using photogenic nee-
mal Ndousse, Catherine Olsson, Dario Amodei, Tom dles in a haystack, 2023. 2, 3, 4, 5
Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared [13] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong
Kaplan. A general language assistant as a laboratory for Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris-
alignment, 2021. 22 tian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al.
[3] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Objaverse-XL: A universe of 10m+ 3d objects. arXiv
Roy H. Campbell, and Sergey Levine. Stochastic vari- preprint arXiv:2307.05663, 2023. 2, 5, 7, 8
ational video prediction. In International Conference on [14] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs,
Learning Representations, 2018. 15 Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana
[4] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse:
and Hwalsuk Lee. Character region awareness for text de- A universe of annotated 3d objects. In Proceedings of
tection. In Proceedings of the IEEE/CVF conference on the IEEE/CVF Conference on Computer Vision and Pattern
computer vision and pattern recognition, pages 9365–9374, Recognition, pages 13142–13153, 2023. 7
2019. 4, 18 [15] Emily Denton and Rob Fergus. Stochastic video genera-
[5] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, tion with a learned prior. In Proceedings of the 35th In-
Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, ternational Conference on Machine Learning, ICML 2018,
Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018,
Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, 2018. 15
Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, [16] Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat
Tristan Hume, Scott Johnston, Shauna Kravec, Liane GANs on Image Synthesis. arXiv:2105.05233, 2021. 24
Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom [17] Michael Dorkenwald, Timo Milbich, Andreas Blattmann,
Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Robin Rombach, Konstantinos G. Derpanis, and Björn Om-
Mann, and Jared Kaplan. Training a helpful and harm- mer. Stochastic image-to-video synthesis using cinns. In
less assistant with reinforcement learning from human feed- IEEE Conference on Computer Vision and Pattern Recog-
back, 2022. 22 nition, CVPR 2021, virtual, June 19-25, 2021, 2021. 15
[6] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser- [18] Laura Downs, Anthony Francis, Nate Koenig, Brandon
man. Frozen in time: A joint video and image encoder for Kinman, Ryan Hickman, Krista Reymann, Thomas B
end-to-end retrieval, 2022. 3, 5, 15 McHugh, and Vincent Vanhoucke. Google scanned objects:
[7] Andreas Blattmann, Timo Milbich, Michael Dorkenwald, A high-quality dataset of 3d scanned household items. In
and Björn Ommer. ipoke: Poking a still image for con- 2022 International Conference on Robotics and Automa-
trolled stochastic video synthesis. In 2021 IEEE/CVF In- tion (ICRA), pages 2553–2560. IEEE, 2022. 8
ternational Conference on Computer Vision, ICCV 2021, [19] Arpad E. Elo. The Rating of Chessplayers, Past and
Montreal, QC, Canada, October 10-17, 2021, 2021. 15 Present. Arco Pub., New York, 1978. 4, 22
[8] Andreas Blattmann, Robin Rombach, Huan Ling, Tim [20] Patrick Esser, Robin Rombach, and Björn Ommer. Tam-
Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten ing transformers for high-resolution image synthesis. arXiv
Kreis. Align your Latents: High-Resolution Video Synthe- preprint arXiv:2012.09841, 2020. 3
sis with Latent Diffusion Models. arXiv:2304.08818, 2023. [21] Patrick Esser, Johnathan Chiu, Parmida Atighehchian,
2, 3, 4, 5, 6, 7, 15, 19, 20, 22, 23, 24 Jonathan Granskog, and Anastasis Germanidis. Structure
[9] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun and content-guided video synthesis with diffusion models,
Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A 2023. 2, 3, 7
Efros, and Tero Karras. Generating long videos of dynamic [22] Gunnar Farnebäck. Two-frame motion estimation based on
scenes. 2022. 3, 15, 24 polynomial expansion. pages 363–370, 2003. 4, 17

9
[23] Gereon Fox, Ayush Tewari, Mohamed Elgharib, and Chris- [37] Jonathan Ho, Chitwan Saharia, William Chan, David J
tian Theobalt. Stylevideogan: A temporal generative model Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded
using a pretrained stylegan. In British Machine Vision Con- diffusion models for high fidelity image generation. arXiv
ference (BMVC), 2021. 15 preprint arXiv:2106.15282, 2021. 7, 15, 20
[24] Jean-Yves Franceschi, Edouard Delasalles, Mickaël Chen, [38] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Sylvain Lamprier, and Patrick Gallinari. Stochastic latent Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben
residual video prediction. In Proceedings of the 37th Inter- Poole, Mohammad Norouzi, David J Fleet, and Tim Sal-
national Conference on Machine Learning, 2020. 15 imans. Imagen Video: High Definition Video Generation
[25] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, with Diffusion Models. arXiv:2210.02303, 2022. 4
Travis Hoppe, Charles Foster, Jason Phang, Horace He, [39] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben
Leahy. The Pile: An 800gb dataset of diverse text for lan- Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali-
guage modeling. arXiv preprint arXiv:2101.00027, 2020. mans. Imagen video: High definition video generation with
3 diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[26] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan 2, 3, 4, 15, 20
Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. [40] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William
Long video generation with time-agnostic vqgan and time- Chan, Mohammad Norouzi, and David J. Fleet. Video dif-
sensitive transformer. In Computer Vision – ECCV 2022, fusion models. arXiv preprint arXiv:2204.03458, 2022. 2,
pages 102–118, Cham, 2022. Springer Nature Switzerland. 15
15 [41] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and
[27] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, An- Jie Tang. Cogvideo: Large-scale pretraining for text-to-
drew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, video generation via transformers, 2022. 2, 3, 6, 15
Ming-Yu Liu, and Yogesh Balaji. Preserve your own cor- [42] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans.
relation: A noise prior for video diffusion models. In simple diffusion: End-to-end diffusion for high resolution
Proceedings of the IEEE/CVF International Conference on images. arXiv preprint arXiv:2301.11093, 2023. 6
Computer Vision, pages 22930–22941, 2023. 2, 3, 6, 15
[43] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
[28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Lora: Low-rank adaptation of large language models. arXiv
and Yoshua Bengio. Generative adversarial nets. Advances preprint arXiv:2106.09685, 2021. 2
in neural information processing systems, 27, 2014. 3
[44] Aapo Hyvärinen and Peter Dayan. Estimation of Non-
[29] Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing
Normalized Statistical Models by Score Matching. Journal
Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang
of Machine Learning Research, 6(4), 2005. 18
Jiang, and Hang Xu. Reuse and diffuse: Iterative
[45] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade
denoising for text-to-video generation. arXiv preprint
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave,
arXiv:2309.03549, 2023. 3
Vaishaal Shankar, Hongseok Namkoong, John Miller, Han-
[30] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu
naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open-
Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your
clip, 2021. 3
personalized text-to-image diffusion models without spe-
cific tuning. arXiv preprint arXiv:2307.04725, 2023. 2, [46] Itseez. Open source computer vision library. https://
3, 7, 15, 20 github.com/itseez/opencv, 2015. 4, 17
[31] Sonam Gupta, Arti Keshari, and Sukhendu Das. Rv-gan: [47] Emmanuel Kahembwe and Subramanian Ramamoorthy.
Recurrent gan for unconditional video generation. In Pro- Lower dimensional kernels for video discriminators. Neu-
ceedings of the IEEE/CVF Conference on Computer Vision ral Networks, 132:506–520, 2020. 15
and Pattern Recognition (CVPR) Workshops, pages 2024– [48] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
2033, 2022. 15 Elucidating the Design Space of Diffusion-Based Genera-
[32] Nicholas Guttenberg and CrossLabs. Diffusion with offset tive Models. arXiv:2206.00364, 2022. 3, 6, 18, 19
noise, 2023. 19, 23 [49] Levon Khachatryan, Andranik Movsisyan, Vahram Tade-
[33] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and vosyan, Roberto Henschel, Zhangyang Wang, Shant
Qifeng Chen. Latent video diffusion models for high- Navasardyan, and Humphrey Shi. Text2video-zero: Text-
fidelity long video generation, 2023. 3 to-image diffusion models are zero-shot video generators,
[34] Jonathan Ho and Tim Salimans. Classifier-free diffusion 2023. 3
guidance. In NeurIPS 2021 Workshop on Deep Generative [50] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan
Models and Downstream Applications, 2021. 7, 19, 20 Ho. Variational diffusion models. Advances in neural in-
[35] Jonathan Ho and Tim Salimans. Classifier-Free Diffusion formation processing systems, 34:21696–21707, 2021. 19
Guidance. arXiv:2207.12598, 2022. 18, 19, 23 [51] Pika Labs. Pika labs, https://www.pika.art/,
[36] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- 2023. 2, 6, 7, 24
fusion probabilistic models. In Advances in Neural Infor- [52] Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel,
mation Processing Systems, 2020. 2, 24 Chelsea Finn, and Sergey Levine. Stochastic adversarial

10
video prediction. arXiv preprint arXiv:1804.01523, 2018. [67] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
15 Patrick Esser, and Björn Ommer. High-Resolution Image
[53] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Synthesis with Latent Diffusion Models. arXiv preprint
Common Diffusion Noise Schedules and Sample Steps are arXiv:2112.10752, 2021. 2, 3, 4, 6, 15, 19, 20
Flawed. arXiv:2305.08891, 2023. 19 [68] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[54] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- Patrick Esser, and Björn Ommer. High-resolution im-
makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: age synthesis with latent diffusion models. arXiv preprint
Zero-shot one image to 3d object, 2023. 2, 5, 8, 15 arXiv:2112.10752, 2021. 7, 8
[55] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie [69] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Liu, Taku Komura, and Wenping Wang. Syncdreamer: Net: Convolutional Networks for Biomedical Image Seg-
Generating multiview-consistent images from a single-view mentation. arXiv:1505.04597, 2015. 7, 20, 22
image. arXiv preprint arXiv:2309.03453, 2023. 2, 5, 7, 8, [70] RunwayML. Gen-2 by runway, https://research.
15 runwayml.com/gen2, 2023. 2, 6, 7, 24
[56] Ilya Loshchilov and Frank Hutter. Decoupled weight decay [71] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal-
regularization. arXiv preprint arXiv:1711.05101, 2017. 19, imans, David J Fleet, and Mohammad Norouzi. Image
20, 23 super-resolution via iterative refinement. arXiv preprint
arXiv:2104.07636, 2021. 2
[57] Pauline Luc, Aidan Clark, Sander Dieleman, Diego de
Las Casas, Yotam Doron, Albin Cassirer, and Karen Si- [72] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
monyan. Transformation-based adversarial video predic- Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
tion on large-scale data. ArXiv, 2020. 15 Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J
[58] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P.
Fleet, and Mohammad Norouzi. Photorealistic text-to-
Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans.
image diffusion models with deep language understanding.
On distillation of guided diffusion models, 2023. 15
arXiv preprint arXiv:2205.11487, 2022. 2
[59] Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
[73] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo-
Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei-
ral generative adversarial nets with singular value clipping.
dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau-
In ICCV, 2017. 15
nay. The RefinedWeb dataset for Falcon LLM: outperform-
[74] Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke
ing curated corpora with web data, and web data only. arXiv
Kobayashi. Train sparsely, generate densely: Memory-
preprint arXiv:2306.01116, 2023. 3
efficient unsupervised training of high-resolution temporal
[60] Dustin Podell, Zion English, Kyle Lacey, Andreas gan. International Journal of Computer Vision, 2020. 15
Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna,
[75] Tim Salimans and Jonathan Ho. Progressive Distillation
and Robin Rombach. SDXL: Improving Latent Dif-
for Fast Sampling of Diffusion Models. arXiv preprint
fusion Models for High-Resolution Image Synthesis.
arXiv:2202.00512, 2022. 15, 19
arXiv:2307.01952, 2023. 2, 3, 5, 24
[76] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[61] Giovanni Puccetti, Maciej Kilian, and Romain Beaumont.
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Training contrastive captioners. LAION blog, 2023. 17
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
[62] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya man, et al. Laion-5b: An open large-scale dataset for train-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, ing next generation image-text models. Advances in Neural
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Information Processing Systems, 35:25278–25294, 2022.
Krueger, and Ilya Sutskever. Learning Transferable 3, 4, 16, 17, 18
Visual Models From Natural Language Supervision. [77] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie
arXiv:2103.00020, 2021. 2, 3, 4, 8, 17 Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d
[63] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine generation. arXiv preprint arXiv:2308.16512, 2023. 15
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, [78] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,
and Peter J. Liu. Exploring the limits of transfer learning Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
with a unified text-to-text transformer. arXiv e-prints, 2019. Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taig-
3 man. Make-A-Video: Text-to-Video Generation without
[64] Aditya Ramesh. How dall·e 2 works, 2022. 2 Text-Video Data. arXiv:2209.14792, 2022. 2, 3, 4, 5, 6,
[65] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey 15, 20
Chu, and Mark Chen. Hierarchical text-conditional [79] Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho-
image generation with clip latents. arXiv preprint seiny. Stylegan-v: A continuous video generator with the
arXiv:2204.06125, 2022. 4 price, image quality and perks of stylegan2. In Proceed-
[66] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey ings of the IEEE/CVF Conference on Computer Vision and
Chu, and Mark Chen. Hierarchical Text-Conditional Image Pattern Recognition (CVPR), pages 3626–3636, 2022. 15
Generation with CLIP Latents. arXiv:2204.06125, 2022. [80] Jascha Sohl-Dickstein, Eric A Weiss, Niru Mah-
16 eswaranathan, and Surya Ganguli. Deep Unsuper-

11
vised Learning using Nonequilibrium Thermodynamics. [95] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou,
arXiv:1503.03585, 2015. 2 Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo
[81] Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Yu, Peiqing Yang, et al. Lavie: High-quality video genera-
Geiping, and Tom Goldstein. Understanding and mitigating tion with cascaded latent diffusion models. arXiv preprint
copying in diffusion models, 2023. 16, 17 arXiv:2309.15103, 2023. 15
[82] Yang Song and Stefano Ermon. Improved Tech- [96] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu,
niques for Training Score-Based Generative Models. Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei
arXiv:2006.09011, 2020. 23 Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A
large-scale video-text dataset for multimodal understanding
[83] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma,
and generation, 2023. 5, 16
Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-
[97] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit.
Based Generative Modeling through Stochastic Differential
Scaling autoregressive video models. In International Con-
Equations. arXiv:2011.13456, 2020. 2, 6, 18
ference on Learning Representations, 2020. 15
[84] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [98] Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei
Ucf101: A dataset of 101 human actions classes from Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva:
videos in the wild. arXiv preprint arXiv:1212.0402, 2012. Generating open-domain videos from natural descriptions.
6, 15, 24 arXiv:2104.14806, 2021.
[85] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs [99] Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang,
field transforms for optical flow. In Computer Vision– Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-
ECCV 2020: 16th European Conference, Glasgow, UK, Au- training for neural visual world creation. In European
gust 23–28, 2020, Proceedings, Part II 16, pages 402–419. Conference on Computer Vision, pages 720–736. Springer,
Springer, 2020. 17 2022. 15
[86] Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, [100] Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang,
Dimitris N. Metaxas, and Sergey Tulyakov. A good image Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh,
generator is what you need for high-resolution video syn- Luke Zettlemoyer, and Christoph Feichtenhofer. Demysti-
thesis. In International Conference on Learning Represen- fying clip data, 2023. 2, 3, 4
tations, 2021. 15 [101] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
[87] Suramya Tomar. Converting video formats with ffmpeg. video description dataset for bridging video and language.
Linux Journal, 2006(146):10, 2006. 16 In International Conference on Computer Vision and Pat-
[88] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based tern Recognition (CVPR), 2016. 15
generative modeling in latent space. In Advances in Neural [102] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind
Information Processing Systems, 2021. 2 Srinivas. Videogpt: Video generation using vq-vae and
[89] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, transformers, 2021. 15
and Honglak Lee. Decomposing motion and content for [103] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
natural video sequence prediction. ICLR, 2017. 15 jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
captioners are image-text foundation models, 2022. 3, 23
[90] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kin-
[104] Keunwoo Peter Yu. Videoblip. https://github.
dermans, Hernan Moraldo, Han Zhang, Mohammad Taghi
com / yukw777 / VideoBLIP, 2023. If you use
Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan.
VideoBLIP, please cite it as below. 4, 17
Phenaki: Variable length video generation from open do-
main textual description. arXiv:2210.02399, 2022. 3, 15 [105] Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho
Kim, Jung-Woo Ha, and Jinwoo Shin. Generating videos
[91] Vikram Voleti, Alexia Jolicoeur-Martineau, and Christo-
with dynamics-aware implicit generative adversarial net-
pher Pal. Mcvd: Masked conditional video diffusion for
works. In International Conference on Learning Represen-
prediction, generation, and interpolation. In (NeurIPS) Ad-
tations, 2022. 15
vances in Neural Information Processing Systems, 2022. 2
[106] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu,
[92] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu,
Generating videos with scene dynamics. In Proceedings of Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A
the 30th International Conference on Neural Information large-scale dataset of multi-view images. In Proceedings
Processing Systems, 2016. 15 of the IEEE/CVF Conference on Computer Vision and Pat-
[93] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, tern Recognition, pages 9150–9161, 2023. 7, 8
Xiang Wang, and Shiwei Zhang. Modelscope text-to-video [107] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
technical report. arXiv preprint arXiv:2308.06571, 2023. man, and Oliver Wang. The unreasonable effectiveness of
3, 15 deep features as a perceptual metric, 2018. 8
[94] Yaohui Wang, Piotr Bilinski, Francois Bremond, and An- [108] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao,
titza Dantcheva. G3an: Disentangling appearance and mo- Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and
tion for video generation. In IEEE/CVF Conference on Jingren Zhou. I2vgen-xl: High-quality image-to-video
Computer Vision and Pattern Recognition (CVPR), 2020. synthesis via cascaded diffusion models. arXiv preprint
15 arXiv:2311.04145, 2023. 15

12
[109] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng
Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo:
Training-free controllable text-to-video generation, 2023. 3
[110] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv,
Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video
generation with latent diffusion models. arXiv preprint
arXiv:2211.11018, 2022. 3, 6, 15

13
Contents
1. Introduction 2

2. Background 2

3. Curating Data for HQ Video Synthesis 3


3.1. Data Processing and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2. Stage I: Image Pretaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3. Stage II: Curating a Video Pretraining Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4. Stage III: High-Quality Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4. Training Video Models at Scale 5


4.1. Pretrained Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2. High-Resolution Text-to-Video Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3. High Resolution Image-to-Video Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3.1 Camera Motion LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4. Frame Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5. Multi-View Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5. Conclusion 8

A. Broader Impact and Limitations 15

B. Related Work 15

C. Data Processing 16

D. Model and Implementation Details 18


D.1. Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
D.2. Base Model Training and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
D.3. High-Resolution Text-to-Video Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
D.4. High-Resolution Image-to-Video Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
D.4.1 Linearly Increasing Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
D.4.2 Camera Motion LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
D.5. Interpolation Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
D.6. Multi-view generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

E. Experiment Details 21
E.1. Details on Human Preference Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
E.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
E.1.2 Elo Score Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
E.2. Details on Experiments from Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
E.2.1 Architectural Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
E.2.2 Calibrating Filtering Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
E.2.3 Finetuning Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
E.3. Human Eval vs SOTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
E.4. UCF101 FVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
E.5. Additional Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
E.5.1 Additional Text-to-Video Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
E.5.2 Additional Image-to-Video Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
E.5.3 Additional Camera Motion LoRA Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
E.5.4 Temporal Prompting via Temporal Cross-Attention Layers . . . . . . . . . . . . . . . . . . . . . . . 24
E.5.5 Additional Samples on Multi-View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

14
Appendix
A. Broader Impact and Limitations
Broader Impact: Generative models for different modalities promise to revolutionize the landscape of media creation and
use. While exploring their creative applications, reducing the potential to use them for creating misinformation and harm
are crucial aspects before real-world deployment. Furthermore, risk analyses need to highlight and evaluate the differences
between the various existing model types, such as interpolation, text-to-video, animation and long-form generation. Before
these models are used in practice, a thorough investigation of the models themselves, their intended uses, safety aspects,
associated risks and potential biases is essential.
Limitations: While our approach excels at short video generation, it comes with some fundamental shortcomings w.r.t. long
video synthesis: Although a latent approach provides efficiency benefits, generating multiple key frames at once is expensive
both during training but also inference, and future work on long video synthesis should either try a cascade of very coarse
frame generation, or build dedicated tokenizers for video generation. Furthermore, videos generated with our approach
sometimes suffer from too little generated motion. Lastly, video diffusion models are typically slow to sample and have high
VRAM requirements, and our model is no exemption. Diffusion distillation methods [39, 58, 75] are promising candidates
for faster synthesis.

B. Related Work
Video Synthesis. Many approaches based on various models such as variational RNNs [3, 11, 15, 24, 52], normalizing
flows [7, 17], autoregressive transformers [26, 31, 41, 97–99, 102], and GANs [9, 23, 47, 57, 73, 74, 79, 86, 89, 92, 94, 105]
have tackled video synthesis. Most of these works, however have generated videos either on low-resolution [3, 7, 11, 15, 17,
24, 52, 57, 86, 89, 92, 105] or on comparably small and noisy datasets [10, 84, 101] which were originally proposed to train
discriminative models.
Driven by increasing amounts of available compute resources and datasets better suited for generative modeling such as
WebVid-10M [6], more competitive approaches have been proposed recently, mainly based on well-scalable, explicit likeli-
hood based approaches such as diffusion [39, 40, 78] and autoregressive models [90]. Motivated by a lack of available clean
video data, all these approaches are leveraging joint image-video training [8, 39, 78, 110] and most methods are grounding
their models on pretrained image models [8, 78, 110]. Another commonality between these and most subsequent approaches
to (text-to-)video synthesis [27, 93, 95] is the usage of dedicated expert models to generate the actual visual content at a coarse
frame rate and to temporally upscale this low-fps video to temporally smooth final outputs at 24-32 fps [8, 39, 78]. Similar
to the image domain, diffusion based approaches can be mainly separated in cascaded approaches [39] following [27, 37]
and latent diffusion models [8, 108, 110] translating the approach of Rombach et al. [67] to the video domain. While most
of these works aim at learning general motion representation and are consequently trained on large and diverse datasets,
another well-recognized branch of diffusion based video synthesis tackles personalized video generation based on finetuning
of pretrained text-to image models on more narrow datasets tailored to a specific domain [30] or application, partly including
non-deep motion priors [108]. Finally, many recent works tackle the task of image-to-video synthesis, where the start frame
is already given and the model has to generate the consecutive frames [30, 93, 108]. Importantly, as shown in our work
(see Figure 1) when combined with off-the-shelf text-to-image models, image-to-video models can be used to obtain a full
text-(to-image)-to-video pipeline.
Multi-View Generation Several recent works such as Zero-123 [54] and SyncDreamer [55] propose techniques to adapt
and finetune image generation models such as Stable Diffusion (SD) for multi-view generation, thereby leveraging image
priors from SD. One issue with Zero-123 [54] is that the generated multi-views can be inconsistent with respect to each other
as they are generated independently with pose-conditioning. Some follow-up works try to address this view-consistency
problem by jointly synthesizing the multi-view images. MVDream [77] proposes to jointly generate 4 views of an object
using a shared attention module across images. SyncDreamer [55] proposes to estimate a 3D voxel structure in parallel to the
multi-view image diffusion process to maintain consistency across the generated views.
Despite rapid progress in the multi-view generation research, these approaches rely on single image generation models
such as SD. We believe that our video generative model is a better candidate for the multi-view generation as multi-view
images form a specific form of videos where the camera is moving around an object. As a result, it is much easier to adapt a
video generative model for multi-view generation compared to adapting an image generative model. In addition, the temporal
attention layers in our video model naturally assists in the generation of consistent multi-views of an object without needing
any explicit 3D structures like in [55].

15
C. Data Processing
In this section, we provide more details about our processing pipeline including their outputs on a few public video examples
for demonstration purposes.

Motivation We start from a large collection of raw video data which is not useful for generative text-video (pre)training [66,
96] because of the following adverse properties: First, in contrast to discriminative approaches to video modeling, generative
video models are sensitive to motion inconsistencies such as cuts of which usually many are contained in raw and unprocessed
video data, cf . Figure 2, left. Moreover our initial data collection is biased towards still videos as indicated by the peak at zero
motion in Figure 2, right. Since generative models trained on this data would obviously learn to generate videos containing
cuts and still scenes, this emphasizes the need for cut detection and motion annotations to ensure temporal quality. Another
critical ingredient for training generative text-video models are captions - ideally more than one per video [81] - which are
well-aligned with the video content. The last important component for generative video training which we are considering
here is a high visual quality of the training examples.
The design of our processing pipeline addresses the above points. Thus, to ensure temporal quality, we detect cuts with a
cascaded approach directly after download, clip the videos accordingly and estimate optical flow for each resulting video clip.
After that we apply three synthetic captioners to every clip and further extract frame-level CLIP similarities to all of these
text prompts to be able to filter out outlayers. Finally visual quality at frame-level is assessed by using a CLIP-embeddings
based aesthetics score [76]. We describe each step in more detail in what follows.

Cut Detected?
Source Video
w/o cascade w/ cascade (ours)

✓ ✓

✓ ✓

✗ ✓

✗ ✓

Figure 11. Comparing a common cut detector with our cascaded approach, shows the benefits of our cascaded method: While normal
single-fps cut detection can only detect sudden changes in scene, more continuous transitions tend to remain undetected, what is in contrast
with our approach which reliably also detects the latter transitions.

Cascaded Cut Detection. Similar to previous work [96] we use PySceneDetect 2 to detect cuts in our base video clips.
However, as qualitatively shown in Figure 11 we observe many fade-ins and fade-outs between consecutive scenes, which are
not detected when running the cut detector at a unique threshold and only native fps. Thus, in contrast to previous work, we
apply a cascade of 3 cut detectors which are operating at different frame rates and different thresholds to detect both sudden
changes and slow ones such as fades.

Keyframe-Aware Clipping. We clip the videos using FFMPEG [87] directly after cut detection by extracting the times-
tamps of the keyframes in the source videos and snapping detected cuts onto the closest keyframe timestamp which does not
cross the detected cut. This allows us to quickly extract clips without cuts via seeking and isn’t prohibitively slow at scale
like inserting new keyframes in each video.
2 https://github.com/Breakthrough/PySceneDetect

16
Optical Flow
Source Video
Score

0.043

Figure 12. Examples for a static video. Since such static scenes can have a negative impact on generative video-text (pre-)training, we
filter them out.

Optical Flow. As motivated in Section 3.1 and Figure 2 it is crucial to provide means for filtering out static scenes. To en-
able this, we extract dense optical flow maps at 2fps using the OpenCV [46] implementation of the Farnebäck algorithm [22].
To further keep storage size tractable we spatially downscale the flow maps such that the shortest side is at 16px resolution.
By averaging these maps over time and spatial coordinates we further obtain a global motion score for each clip, which
we use to filter out static scenes by using a threshold for the minimum required motion, which is chosen as detailed on
App. E.2.2. Since this only yields rough approximate, for the final Stage III finetuning, we compute more accurate dense
optical flow maps using RAFT [85] at 800 × 450 resolution. The motion scores are then computed similarly. Since the high-
quality finetuning data is relatively much smaller than the pretraining dataset, this makes the RAFT-based flow computation
tractable.

Caption
Source Video
CoCa VBLIP LLM

there is a piece of a person is using a A person is using a


wood on the floor ruler to measure a ruler to measure a
next to a tape piece of wood piece of wood on the
measure . floor next to a tape
measure.

two men sitting on a two people are Two men are fishing
rock near a river . fishing in a river in a river. One is
one is holding a stick holding a stick and
and the other is the other is holding a
holding a pole . pole.

Figure 13. Comparison of various synthetic captioners. We observe that CoCa often captures good spatial details, whereas VBLIP tends
to capture temporal details. We use an LLM to combine these two, and experiment with all three types of synthetic captions.

Synthetic Captioning. At million-sample scale, it is not feasible to hand-annotate data points with prompts. Hence we re-
sort to synthetic captioning to extract captions. However in light of recent insights on the importance of caption diversity [81]
and taking potential failure cases of these synthetic captioning models into consideration, we extract three captions per clip
by using i) the image-only captioning model CoCa [61], which describes spatial aspects well, ii) - to also capture temporal
aspects - the video-captioner VideoBLIP [104] and iii) to combine these two captions and like that, overcome potential flaws
in each of them, a lightweight LLM. Examples for the resulting captions are shown in Figure 13.

Caption similarities and Aesthetics. Extracting CLIP [62] image and text representations has proven to be very helpful
for data curation in the image domain, since computing the cosine similarity between the two allows to assess text-image
alignment for a given example [76] and thus to filter out examples with erroneous captions. Moreover it is possible to extract

17
scores for visual aesthetics [76]. Although CLIP is only able to process images, and this consequently is only possibly on
single frame level we opt to extract both CLIP-based i) text-image similarities and ii) aesthetics scores of the first, center and
last frames of each video clip. As shown in Section 3.3 and App. E.2.2, using training text-video models on data curated
by using these scores improves i) text following abilities and ii) visual quality of the generated samples compared to models
trained on unfiltered data.

Text Detection. In early experiments we noticed that models trained on early versions of LVD-F obtained a tendency to
generate videos with excessive amounts of written text depicted which is arguably not a desired feat for a text-to-video model.
To this end, we applied the off-the-shelf text-detector CRAFT [4] to annotate the start, middle and end frames of each clip
in our dataset with bounding box information on all written text. Using this information, we filtered out all clips with a total
area of detected detected bounding boxes larger than 7% to construct the final LVD-F.

Source Video Text Area Ratio

0.102

Figure 14. An example of a video with lots of unwanted text. We apply text-detection and annotate bounding boxes around text, and then
compute the ratio between the area of all the boxes and the size of the frame.

D. Model and Implementation Details


D.1. Diffusion Models
In this section, we give a concise summary of DMs. We make use of the continuous-time DM framework [48, 83]. Let
pdata (x0 ) denote the data distribution and let p(x; σ) be the distribution obtained by adding i.i.d. σ 2 -variance Gaussian noise
to the data. Note that or sufficiently large σmax , p(x; σmax2 ) ≈ N (0, σmax2 ). DM use this fact and, starting from high vari-
ance Gaussian noise xM ∼ N (0, σmax2 ), sequentially denoise towards σ0 = 0. In practice, this iterative refinement process
can be implemented through the numerical simulation of the Probability Flow ordinary differential equation (ODE) [83]
dx = −σ̇(t)σ(t)∇x log p(x; σ(t)) dt, (1)
where ∇x log p(x; σ) is the score function [44]. DM training reduces to learning a model sθ (x; σ) for the score function
∇x log p(x; σ). The model can, for example, be parameterized as ∇x log p(x; σ) ≈ sθ (x; σ) = (Dθ (x; σ) − x)/σ 2 [48],
where Dθ is a learnable denoiser that tries to predict the clean x0 . The denoiser Dθ is trained via denoising score match-
ing (DSM)
E(x0 ,c)∼pdata (x0 ,c),(σ,n)∼p(σ,n) λσ ∥Dθ (x0 + n; σ, c) − x0 ∥22 ,
 
(2)

where p(σ, n) = p(σ) N n; 0, σ 2 , p(σ) is a distribution over noise levels σ, λσ : R+ → R+ is a weighting function, and
c is an arbitrary conditioning signal. In this work, we follow the EDM-preconditioning framework [48], parameterizing the
learnable denoiser Dθ as
Dθ (x; σ) = cskip (σ)x + cout (σ)Fθ (cin (σ)x; cnoise (σ)), (3)
where Fθ is the network to be trained.
Classifier-free guidance. Classifier-free guidance [35] is a method used to guide the iterative refinement process of a DM
towards a conditioning signal c. The main ideas is to mix the predictions of a conditional and an unconditional model
Dw (x; σ, c) = wD(x; σ, c) − (w − 1)D(x; σ), (4)

18
where w ≥ 0 is the guidance strength. The unconditional model can be trained jointly alongside the conditional model in a
single network by randomly replacing the conditional signal c with a null embedding in Eq. (2), e.g., 10% of the time [35].
In this work, we use classifier free guidance, for example, to guide video generation towards text conditioning.

D.2. Base Model Training and Architecture


As discussed in , we start the publicly available Stable Diffusion 2.1 [67] (SD 2.1) model. In the EDM-framework [48], SD
2.1 has the following preconditioning functions:

cSD2.1
skip (σ) = 1, (5)
cSD2.1
out (σ) = −σ , (6)
1
cSD2.1
in (σ) = √ , (7)
2
σ +1
cSD2.1
noise (σ) = arg min(σ − σj ) , (8)
j∈[1000]

(9)

where σj+1 > σj . The distribution over noise levels p(σ) used for the original SD 2.1. training is a uniform distribution over
the 1000 discrete noise levels {σj }j∈[1000] . One issue with the training of SD 2.1 (and in particular its noise distribution p(σ))
is that even for the maximum discrete noise level σ1000 the signal-to-noise ratio [50] is still relatively high which results in
issues when, for example, generating very dark images [32, 53]. Guttenberg and CrossLabs [32] proposed offset noise, a
modification of the training objective in Eq. (2) by making p(n | σ) non-isotropic Gaussian. In this work, we instead opt for
modifying the preconditioning functions and distribution over training noise levels altogether.
Image model finetuning. We replace the above preconditioning functions with
−1
cskip (σ) = σ 2 + 1 , (10)
−σ
cout (σ) = √ , (11)
σ2 + 1
1
cin (σ) = √ , (12)
2
σ +1
cnoise (σ) = 0.25 log σ, (13)
(14)

which can be recovered in the EDM framework [48] by setting σdata = 1); the preconditioning functions were originally
proposed in [75]. We also use the noise distribution and weighting function proposed in Karras et al. [48], namely log σ ∼
2
N (Pmean , Pstd ) and λ(σ) = (1 + σ 2 )σ −2 , with Pmean = −1.2 and Pstd = 1. We then finetune the neural network backbone
Fθ of SD2.1 for 31k iterations using this setup. For the first 1k iterations, we freeze all parameters of Fθ except for the
time-embedding layer and train on SD2.1’s original training resolution of 512 × 512. This allows the model to adapt to the
new preconditioning functions without unnecessarily modifying the internal representations of Fθ too much. Afterwards, we
train all layers of Fθ for another 30k iterations on images of size 256 × 384, which is the resolution used in the intial stage
of video pretraining.
Video pretraining. We use the resulting model as the image backbone of our video model. We then insert temporal
convolution and attention layers. In particular, we follow the exact setup from [8] inserting a total of 656M new parameters
into the UNet, bumping its total size (spatial and temporal layers) to 1521M parameters. We then train the resulting UNet
on 14 frames on resolution 256 × 384 for 150k iters using AdamW [56] with learning rate 10−4 and a batch size of 1536.
We train the model for classifier-free guidance [34] and drop out the text-conditioning 15% of the time. Afterwards, we
increasing the spatial resolution to 320 × 576 and train for an additional 100k iterations, using the same settings as for the
lower-resolution training except for a reduced batch size of 768 and a shift of the noise distribution towards more noise,
in particular we increase Pmean = 0. During training, the base model as well as the high-resolution Text/Image-to-Video
models are all conditioned on the frame rate and a motion score of the input video. This allows us to vary the amount of
motion in a generated video at inference time.

19
D.3. High-Resolution Text-to-Video Model
We finetune our base model on a high quality dataset of ∼ 1M samples at resolution 576 × 1024. We train for 50k iterations
at a batch size of 768, learning rate 3 × 10−5 , and set Pmean = 0.5 and Pstd = 1.4. Additionally, we track an exponential
moving average of the weights at decay rate 0.9999. The final checkpoint is chosen using a combination of visual inspection
and human evaluation.

D.4. High-Resolution Image-to-Video Model


We can finetune our base text-to-video model for the image-to-video task. In particular, during training, we use one additional
frame which the model is conditioned on. We do not use any text-conditioning but rather replace text embeddings that are
fed into the base model with the CLIP image embedding of the conditioning frame. Additionally, we concatenate a noise-
augmented [37] version of the conditioning frame channel-wise to the input of the UNet [69]. In particular, we add a small
amount of noise of strength log σ ∼ N (−3.0, 0.52 ) to the conditioning frame, and then feed it through the standard SD
2.1 encoder. The mean of the encoder distribution is then concatenated to the input of the UNet (copied across the time
axis). Initially, we finetune our base model for the image-to-video task on the base resolution (320 × 576) for 50k iterations
using a batch size of 768 and learning rate 3 × 10−5 . Since the conditioning signal is very strong, we again shift the noise
distribution towards more noise, i.e., Pmean = 0.7 and Pstd = 1.6. Afterwards, we fintune the base image-to-video model
on a high quality dataset of ∼ 1M samples at resolution 576 × 1024. We train two versions: one to generate 14 frames and
one to generate 25 frames. We train both models for 50k iterations at a batch size of 768, learning rate 3 × 10−5 , and set
Pmean = 1.0 and Pstd = 1.6. Additionally, we track an exponential moving average of the weights at decay rate 0.9999. The
final checkpoints are chosen using a combination of visual inspection and human evaluation.

D.4.1 Linearly Increasing Guidance


We occasionally found that standard vanilla classifier-free guidance [34] (see Eq. (4)) can lead to artifacts: too little guidance
may result in inconsistency with the conditioning frame while too much guidance can result in oversaturation. Instead of
using a constant guidance scale, we found it helpful to linearly increase the guidance scale across the frame axis (from small
to high). A PyTorch implementation of this novel technique can be found in Figure 15.

D.4.2 Camera Motion LoRA


To facilitate controlled camera motion in image-to-video generation, we train a variety of camera motion LoRAs within the
temporal attention blocks of our model [30]. In particular we train low rank matrices of rank 16 for 5k iterations. Additional
samples can be found in Figure 20.

D.5. Interpolation Model Details


Similar to the text-to-video and image-to-video models, we finetune our interpolation model starting from the the base text-
to-video model, cf . App. D.2. To enable interpolation, we reduce the number of output frames from 14 to 5 of which
we use the first and last as conditioning frames which we feed to the UNet [69] backbone of our model via the concat-
conditioning-mechanism [67]. To this end, we embed these frames into the latent space of our autoencoder, resulting in two
image encodings zs , ze ∈ Rc×h×w , where c = 4, h = 52, w = 128. To form a latent frame sequence which is of the
same shape than the noise input of the UNet, i.e. R5×c×h×w , we use a learned mask embedding zm ∈ Rc×h×w and form
a latent sequence z = {zs , zm , zm , zm , ze } ∈ R5×c×h×w . We concatenate this sequence channel-wise with the noise input
and additionally with a binary mask where 1 indicates the presence of a conditioning frame and 0 that of a mask embedding.
The final input for the UNet is thus of shape (5, 9, 52, 128). In line with previous work [8, 39, 78] we use noise augmentation
for the two conditioning frames, which we apply in the latent space. Moreover we replace the CLIP text representation for
the crossattention conditioning with the corresponding CLIP image representation of the start frame and end frame, which
we concatenate to form conditioning sequence of length 2.
We train the model on our high-quality dataet at spatial resolution 576 × 1024 using AdamW [56] with a learning rate
of 10−4 in combination with exponential moving averaging at decay rate 0.9999, and use a shifted noise schedule with
Pmean = 1 and Pstd = 1.2. Surprisingly, we find this model, which we train with a comparably small batch size of 256, to
converge extremely fast and to yield consistent and smooth outputs after only 10k iterations. We take this as another evidence
of the usefuleness of the learned motion representation our base text-to-video model has learned.

20
1 import torch
2 from einops import rearrange, repeat
3
4
5 def append_dims(x: torch.Tensor, target_dims: int) -> torch.Tensor:
6 """Appends dimensions to the end of a tensor until it has target_dims dimensions."""
7 dims_to_append = target_dims - x.ndim
8 if dims_to_append < 0:
9 raise ValueError(
10 f"input has {x.ndim} dims but target_dims is {target_dims}, which is less"
11 )
12 return x[(...,) + (None,) * dims_to_append]
13
14
15 class LinearPredictionGuider:
16 def __init__(
17 self,
18 max_scale: float,
19 num_frames: int,
20 min_scale: float = 1.0,
21 ):
22 self.min_scale = min_scale
23 self.max_scale = max_scale
24 self.num_frames = num_frames
25 self.scale = torch.linspace(min_scale, max_scale, num_frames).unsqueeze(0)
26
27 def __call__(self, x: torch.Tensor, sigma: float) -> torch.Tensor:
28 x_u, x_c = x.chunk(2)
29
30 x_u = rearrange(x_u, "(b t) ... -> b t ...", t=self.num_frames)
31 x_c = rearrange(x_c, "(b t) ... -> b t ...", t=self.num_frames)
32 scale = repeat(self.scale, "1 t -> b t", b=x_u.shape[0])
33 scale = append_dims(scale, x_u.ndim).to(x_u.device)
34
35 return rearrange(x_u + scale * (x_c - x_u), "b t ... -> (b t) ...")

Figure 15. PyTorch code for our novel linearly increasing guidance technique.

D.6. Multi-view generation


We finetune the high-Resolution Image-to-Video Model on our specific rendering of the Objaverse dataset. We render 21
frames per orbit of an object in the dataset at 576 × 576 resolution, and finetune the 25-frame Image-to-Video model to
generate these 21 frames. We feed one view of the object as the image condition. In addition, we feed the elevation of the
camera as conditioning to the model. We first pass the evelation through a timestep embedding layer that embeds the sine and
cosine of the elevation angle at various frequences, and contatenates them into a vector. This vector is finally concatenated
to the overall vector condition to the UNet.
We trained for 12k iterations with a total batch size of 16 across 8 A100 GPUs of 80GB VRAM, at a learning rate of
1 × 10−5 .

E. Experiment Details
E.1. Details on Human Preference Assessment
For a majority of the evaluation conducted in this paper, we employ human evaluation as we observed it to contain the most
reliable signal. For text-to-video tasks and all ablations conducted for the base model, we generate video samples from a list
of 64 test prompts. We then employ human annotators to collect preference data on two axes: i) visual quality, and ii) prompt
following. More details on how the study was conducted App. E.1.1 and the rankings computed App. E.1.2 are listed below.

E.1.1 Experimental Setup


Given all models in one ablation axis (e.g. four models of varying aesthetic or motion scores), we compare each prompt for
each pair of models (1v1). For every such comparison, we collect on average three votes per task from different annotators,
i.e., three each for visual quality and prompt following, respectively. Performing a complete assessment between all pair-

21
wise comparisons gives us robust and reliable signals on model performance trends and the effect of varying thresholds.
Sample interfaces that the annotators interact with are shown in Figure 16. The order of prompts and the order between
models are fully randomized. Frequent attention checks are in place to ensure data quality.

(a) Sample instructions for evaluating visual quality of videos. (b) Sample instructions for evaluating the prompt following of videos.

Figure 16. Our human evaluation framework, as seen by the annotators. The prompt & task order and model choices are fully randomized.

E.1.2 Elo Score Calculation


To calculate rankings when comparing more than two models based on 1v1 comparisons as outlined in App. E.1.1, we use Elo
Scores (higher-is-better) [19] which were originally proposed as a scoring method for chess players but have more recently
also been applied to compare instruction-tuned generative LLMs [2, 5]. For a set of competing players with initial ratings
Rinit participating in a series of zero-sum games the Elo rating system updates the ratings of the two players involved in a
particular game based on the expected and and actual outcome of that game. Before the game with two players with ratings
R1 and R2 , the expected outcome for the two players are calculated as
1
E1 = R2 −R1 , (15)
1 + 10 400

1
E2 = R1 −R2 . (16)
1 + 10 400

After observing the result of the game, the ratings Ri are updated via the rule

Ri = Ri + K · (Si − Ei ) , i ∈ {1, 2} (17)

where Si indicates the outcome of the match for player i. In our case we have Si = 1 if player i wins and Si = 0 if player i
looses. The constant K can be see as weight putting emphasis on more recent games. We choose K = 1 and bootstrap the
final Elo ranking for a given series of comparisons based on 1000 individual Elo ranking calculations with randomly shuffled
order. Before comparing the models we choose the start rating for every model as Rinit = 1000.

E.2. Details on Experiments from Section 3


E.2.1 Architectural Details
Architecturally, all models trained for the presented analysis in Section 3 are identical. To insert create a temporal UNet [69]
based on an existing spatial models, we follow Blattmann et al. [8] and add temporal convolution and (cross-)attention layers
after each corresponding spatial layer. As a base 2D-UNet we use the architecture from Stable Diffusion 2.1, whose weights
we further use to initialize the spatial layers for all runs but the second one presented in Figure 3a, where we intentionally

22
Captioning Strategy Motion Thresholding Aesthetics Thresholding CLIP Score Thresholding Text Detection Thresholding
1100 CoCa Mixed-U 1100 unfiltered 0.25 1100 unfiltered 0.25 1100 unfiltered 0.25 1100 unfiltered 0.25
LLM V-Blip 0.125 0.5 0.125 0.5 0.125 0.5 0.125 0.5
1050 1050 1050 1050 1050
1000 1000 1000 1000 1000
Elo rating

Elo rating

Elo rating

Elo rating

Elo rating
950 950 950 950 950
900 900 900 900 900
850 850 850 850 850
800 800 800 800 800
Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated Prompt Alignment Quality Aggregated

Figure 17. Results of the dedicated experiments conducted to identify most useful filtering thresholds for each ablation axis. For of these
ablation studies we train four identical models using the architecture detailed in App. E.2.2 on different subset of LVD-10M, which we
create by systematically increasing the thresholds which corresponds to filter out more and more examples.

skip this initialization to create a baseline for demonstrating the effect of image-pretraining. As opposed to Blattmann et al.
[8] we train all layers including the spatial ones and do not freeze the spatial layers after initialization. All models are trained
with the AdamW [56] optimizer with a learning rate of 1.e − 4 and a batch size of 256. Moreover, in contrast to our models
from Section 4, we do not translate the noise process to continuous time but use the standard linear schedule used in Stable
Diffusion 2.1, including offset noise [32], in combination with the v-parameterization [35]. We omit the text-conditioning in
10% of the cases to enable classifier-free guidance [35] during inference. To generate samples for the evaluations, we use 50
steps of the deterministic DDIM sampler [82] with a classifier guidance scale of 12 for all models.

E.2.2 Calibrating Filtering Thresholds


Here we present the outcomes of our study on filtering thresholds presented in Section 3.3. As stated there, we conduct
experiment for the optimal filtering threshold for each type of annotation while not filtering for any other types. The only
difference here is our assessment of the most suitable captioning method, where we simply compare all used captioning
methods. We train each model on videos consisting of 8 frames at resolution 256 × 256 for exactly 40k steps with a batch
size of 256, which hence roughly corresponds to 10M training examples seen during training. For evaluation, we create
samples based on 64 pre-selected prompts for each model and conduct a human preference study as detailed in App. E.1.
Figure 17 shows the ranking results results of these human preference studies for each annotation axis for spatio-temporal
sample quality and prompt following. Additionally we show an averaged ’aggregated’ score.
For captioning, we see that - surprisingly - the captions generated by the simple clip-based image captioning method CoCa
of Yu et al. [103] clearly have the most beneficial influence on the model. However, since recent research recommends to
use more than one caption per training example, we sample one of the three distinct captions during training. We nonetheless
reflect the outcome of this experiment by shifting the captioning sampling distribution towards CoCa captions by using
pCoCa = 0.5; pV-BLIP = 0.25; pLLM = 0.25; .
For motion filtering, we choose to filter out 25% of the most static examples, although the aggregated preference score
of the model trained with this filtering method is not ranking as high in human preference as the non-filtered score. The
rationale behind this is the fact that non-filtered ranks best mostly because it ranks best in the category ’prompt following’
which is less important than the ’quality’ category when assessing the effect of motion filtering. Thus, we choose the 25%
threshold as mentioned above, since it achieves both competitive performance in ’prompt following’ and ’quality’.
For aesthetics filtering, where, as for motion thresholding, the ’quality’ category is more important than the ’prompt
following’-category, we choose to filter out the 25 % with the lowest aesthetics score, while for CLIP-score thresholding we
omit even 50%, since the model trained with the corresponding threshold is clearly performing best. Finally, for we filter out
the 25% of samples with the largest text area covering the videos, since it ranks highest both in the ’quality’ category and on
average.
Using these filtering methods we reduce the size of LVD by more than factor of 3, cf . Tab. 1, but obtain a much cleaner
dataset as shown in Section 3. For the remaining experiments in Section 3.3 we use the identical architecture and hyperpa-
rameters as stated above. We only vary the dataset as detailed in Section 3.3.

E.2.3 Finetuning Experiments


For the finetuning experiments shown in Section 3.4, we again follow the architecture, training hyperparameters and sampling
procesdure stated at the beginning of this section. The only notable differences are the exchange of the dataset and increase
the resolution from the pretraining resolution 256 × 256 to 512 × 512, while still generating videos consisting of 8 frames.
We train all models presented in this section for 50k steps.

23
E.3. Human Eval vs SOTA
For comparison of our image-to-video model with state-of-the-art models like Gen-2 [70] and Pika [51], we randomly choose
64 conditioning images generated from a 1024×576 finetune of SDXL [60]. We employ the same framework as in App. E.1.1
to evaluate and compare the visual quality generated samples with other models.
For Gen-2, we sample the image-to-video model from the web UI. We fixed the same seed of 23, used the default motion
value of 5 (on a scale of 10), and turned on the “Interpolate” and “Remove watermark” features. This results in 4-second
samples at 1408 × 768. We then resize the shorter side to yield 1056 × 576 and perform a center-crop to match our resolution
of 1024 × 576. For our model, we sample our 25-frame image-to-video finetune to give 28 frames and also interpolate using
our interpolation model to yield samples of 3.89 seconds at 28 FPS. We crop the Gen-2 samples to 3.89 seconds to avoid
biasing the annotators.
For Pika, we sample the image-to-video model from the Discord bot. We fixed the same seed of 23, used the motion value
of 2 (on a scale of 0-4), and specified a 16:9 aspect ratio. This results in 3-second samples at 1024 × 576, which matches our
resolution. For our model, we sample our 25-frame image-to-video finetune to give 28 frames and also interpolate using our
interpolation model to yield samples of 3.89 seconds at 28 FPS. We then crop our samples to 3 seconds to match Pika and
avoid biasing the annotators. Since Pika samples have a small “Pika Labs” watermark in the bottom right, we pad that region
with black pixels for both Pika and our samples to also avoid bias.

E.4. UCF101 FVD


In this section, we describe the zero-shot UCF101 FVD computation of our base text-to-video model. The UCF101
dataset [84] consist of 13,320 video clips, which are classified into 101 action categories. All videos are of frame rate 25 FPS
and resolution 240 × 320. To compute FVD, we generate 13,320 videos (16 frames at 25 FPS, classifier-free guidance with
scale w = 7) using the same distribution of action categories, that is, for example, 140 videos of “TableTennisShot”, 105
video of “PlayingPiano”, etc. We condition the model directly on the action category (“TableTennisShot”, “PlayingPiano”,
etc.) and do not use any text modification. Our samples are generated at our model’s native resolution 320 × 576 (16 frames)
and we downsample to 240 × 432 using bilinear interpolation with antialiasing, followed by a center crop to 240 × 320. We
extract features using a pretrained I3D action classification model [10], in particular we are using a torchscript3 provided
by Brooks et al. [9].

E.5. Additional Samples


Here we show additional samples for the models introduced in App. D.2 and Secs. 4.2, 4.3 and 4.5.

E.5.1 Additional Text-to-Video Samples


In Figure 18 we show additional samples from our text-to-video model introduced in Section 4.2.

E.5.2 Additional Image-to-Video Samples


In Figure 19 we show additional samples from our image-to-video model introduced in Section 4.3.

E.5.3 Additional Camera Motion LoRA Samples


In Figure 20 we show additional samples for our motion LoRA’s tuned for camera control as presented in Section 4.3.1.

E.5.4 Temporal Prompting via Temporal Cross-Attention Layers


Our architecture follows Blattmann et al. [8], who introduced dedicated temporal cross-attention layers, which are used
interleaved with the spatial cross-attention layers of the standard 2D-UNet [16, 36]. During probing our Text-to-Video model
from Section 4.2, we noticed that it is possible to independently prompt the model spatially and temporally by using different
text-prompts as inputs for the spatial and temporal cross-attention conditionings, see Figure 21. To achieve this, we use a
dedicated spatial prompt to describe the general content of the scene to be depicted while the motion of that scene is fed to
the model via a separate temporal prompt which is the input to the temporal cross-attention layers. We provide an example
3 https : / / www . dropbox . com / s / ge9e5ujwgetktms / i3d _ torchscript . pt with keyword arguments rescale=True,

resize=True, return features=True.

24
Figure 18. Additional Text-to-Video samples. Captions from top to bottom: “A hiker is reaching the summit of a mountain, taking in
the breathtaking panoramic view of nature.”, “A unicorn in a magical grove, extremely detailed.”, “Shoveling snow”, “A beautiful fluffy
domestic hen sitting on white eggs in a brown nest, eggs are under the hen.”, and “A boat sailing leisurely along the Seine River with the
Eiffel Tower in background by Vincent van Gogh”.

of these first experiments indicating this implicit disentanglement of motion and content, in Figure 21, where we show that
varying the temporal prompt while fixing random seed and spatial prompt, leads to spatially similar scene which obtain global
motion properties following the temporal prompt.

E.5.5 Additional Samples on Multi-View Synthesis


In Figures 22 to 24 we show additional visual examples for SVD-MV, trained on MVImageNet.

25
Figure 19. Additional Image-to-Video samples. Leftmost frame is use for conditioning.

26
Figure 20. Additional Image-to-Video samples with camera motion LoRAs (conditioned on leftmost frame). The first, second, and thirs
rows correspond to horizontal, static, zooming, respectively.

27
Figure 21. Text-to-video samples using the prompt “Flowers in a pot in front of a mountainside” (for spatial cross-attention). We adjust
the camera control by replacing the prompt in the temporal attention using “”, “panning”, “rotating”, and “zooming” (from top to bottom).
While not being trained for this inference task, the model performs surprisingly well.

28
Figure 22. Additional multi-view generation samples from GSO test dataset, and comparison with other methods.

29
Figure 23. Additional multi-view generation samples from GSO test dataset

Figure 24. Additional multi-view generation samples from MVI dataset, and comparison with other methods. Top row is ground truth
frames, second row is sample frames from SVD-MV (ours), third row is from SD2.1-MV, bottom row is from Scratch-MV

30

You might also like