Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views17 pages

Video Frame Synthesis with ConvTransformer

Uploaded by

Pedro Pietrafesa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views17 pages

Video Frame Synthesis with ConvTransformer

Uploaded by

Pedro Pietrafesa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ConvTransformer: A Convolutional Transformer Network for Video Frame

Synthesis

Zhouyong Liu Shun Luo Wubin Li Jingben Lu Yufan Wu Shilei Sun


Chunguo Li Luxi Yang *
School of Information Science and Engineering, Southeast University
arXiv:2011.10185v2 [cs.CV] 1 Jun 2021










'9) 0&1HW 2XUV *URXGWUXWK
   361566,0
Figure 1. Example of video frame extrapolation. Top is the extrapolated result, middle is the zoomed local details and bottom is the
occlusion map that depicts the residuals between extrapolated image and ground truth. ColorBar with gradient from blue to red indicates
the residual intensity in the range 0 to 255. The zoomed-in details indicate that ConvTransformer performs better in generating local
high-frequency details, while the occlusion maps demonstrate that ConvTransformer has superiority on accurate pixel intensity prediction.
Abstract future frame extrapolation task show ConvTransformer to
be superior in quality while being more parallelizable to
Deep Convolutional Neural Networks (CNNs) are pow- recent approaches built upon convolutional LSTM (ConvL-
erful models that have achieved excellent performance on STM). To the best of our knowledge, this is the first time
difficult computer vision tasks. Although CNNs perform that ConvTransformer architecture is proposed and applied
well whenever large labeled training samples are available, to video frame synthesis.
they work badly on video frame synthesis due to objects
deforming and moving, scene lighting changes, and cam-
eras moving in video sequence. In this paper, we present a 1. Introduction
novel and general end-to-end architecture, called convolu-
tional Transformer or ConvTransformer, for video frame Video frame synthesis, comprising interpolation and ex-
sequence learning and video frame synthesis. The core trapolation two subtasks, is one of the classical and funda-
ingredient of ConvTransformer is the proposed attention mental problems in video processing and computer vision
layer, i.e., multi-head convolutional self-attention layer, that community. The abrupt motion artifacts and temporal alias-
learns the sequential dependence of video sequence. Con- ing in video sequence can be suppressed with the help of
vTransformer uses an encoder, built upon multi-head con- video frame synthesis, and hence it can be applied to numer-
volutional self-attention layer, to encode the sequential de- ous applications ranging from motion deblurring [4], video
pendence between the input frames, and then a decoder de- frame rate up-sampling [6, 3], video editing [32, 51], novel
codes the long-term dependence between the target synthe- view synthesis [10] to autonomous driving [48].
sized frames and the input frames. Experiments on video Numerous solutions, over the past decades, have been
proposed for this problem, and achieved substantial im-
* Corresponding author provement. Traditional video frame synthesis pipeline usu-

1
ally involves two consecutive steps, i.e., optical flow estima- formance. Specifically, Villegas et al. investigated a motion
tion and optical based frame warping [45, 50]. The quality and content decomposition LSTM model, i.e. MCNet [40],
of synthesized frames, however, will be easily affected by to predict future frames. PredNet [24] uses the top-down
the estimated optical flow. Recent advancements in deep context to guide the training process. Besides, the inception
neural networks have successfully improved a number of operation is introduced in LSTM to obtain a broadly view
tasks, such as classification [16], segmentation [33] and vi- for synthesizing [12]. While the extrapolated frames look
sual tracking [18], and also promoted the development of somewhat better, the accurate pixel estimation is still chal-
video frame interpolation and extrapolation. lenging for multi-modal nature scenes. Additionally, the
Long et al. treated the video frame interpolation as an recurrent model of ConvLSTM and its variations are hard
image generating task, and then trained a generic convo- to train, and meanwhile suffer from heavy computational
lutional neural network (CNN) to directly interpolate the burden when the length of sequence is large.
in-between frame [22]. However, due to the difficulty for Aside from these recurrent models, the plain CNN archi-
generic CNN to capture the multi-modal distribution of tecture based methods are also investigated to generate fu-
video frames, severe blurriness exists in their interpolated ture frames. Liu et al. developed a deep voxel flow (DVF)
results. Lately, Niklaus et al. considered frame interpo- incorporated neural network to generate the future frames
lation as a local convolution problem, and proposed adap- [21], in which the 3D optical flow across time and space
tive convolutional operation [27] and separable convolu- are implicitly estimated. However, due to the lack of multi-
tional process [28] for each pixel in the target interpolated frames joint training, DVF [21] has a limitation in long-
frame. However, these kernel-based algorithms typically term multiple frames prediction. Different from these im-
suffer from heavy computation. Instead of only relying plicitly motion estimation methods, Wu et al. proposed an
on CNN, the optical flow embedded neural netoworks are explicitly motion detection and semantic estimation method
also investigated and applied to interpolated middle frames. to extrapolate future frames [46]. Although it performs well
The deep voxel flow (DVF) method, for instance, implic- in situations where there are explicitly foreground motions,
itly incorporate 3D optical flow across the time and space it suffers from a restriction in nature video scenes, where
to synthesize middle frames. Besides, Bao et al. proposed the distinction between foreground motion target and back-
an explicitly optical flow information embedded algorithm, ground is not clear.
namely DAIN [1], in which the depth map [7, 19] and opti- Although these unidirectional prediction methods, i.e.,
cal flow [19] are explicitly generated by the pretrained sub- along the frame sequence, can be applied to video frame in-
networks [19, 7, 19], and is used to guide the contextual terpolation task, these methods will suffer from heavy per-
features capturing pipeline. Although these methods could formance degradation, as compared with state-of-the-art in-
interpolate perceptually well frames with the accurately es- terpolation algorithms. It is because the bi-directional in-
timated flow generated by the pretrained optical flow esti- formation provided by latter frames can not be exploited to
mation sub-networks [19, 7, 19], the optical flow estimation guide the middle frames generating pipeline in these recur-
networks are sensitive to the artificial marked dataset. Be- rent and forward CNN based extrapolation models.
sides, these interpolation methods are mainly developed on In order to bridge this gap, we propose, in this work, a
two consecutive frames, while the high-order motion infor- general video frame synthesis architecture, named convo-
mation of video frame sequence is ignored, and not well lutional Transformer (ConvTransformer), which unifies and
exploited. Furthermore, these specially designed interpola- simplifies the video frame extrapolation and video frame
tion methods could not be applied to another video frame interpolation pipeline as an end-to-end encoder and de-
synthesis task, i.e. video frame extrapolation, because the coder problem. A multi-head convolutional self-attention
latter frame used to estimate flow can not be provided in is proposed to model the long-range dependence between
extrapolation task. the frames in video sequence. As compared with pre-
Unlike the video frame interpolation task, video frame viously elaborately designed interpolation methods, Con-
extrapolation can be treated as a conditional prediction vTransformer is a simple, but efficient architecture in ex-
problem, that is, the future frames are predicted by the pre- tracting the high-order motion information existing in video
vious video frame sequence in a recurrent model. As a pi- frame sequence. Besides, ConvTransformer, in compar-
oneer of recurrent-based extrapolation method, Shi et al. ison with previous ConvLSTM based recurrent models
proposed a convolutional LSTM (ConvLSTM) [34] to re- [34, 40, 24, 12], ConvTransformer can be implemented in
currently generate future frames. However, due to the ca- parallel both in training and testing stage.
pability limitation of this simple and generic recurrent ar- We evaluate ConvTransformer on several benchmarks,
chitecture, the predicted frames usually suffer from blurri- i.e. UCF101 [35], Vimeo90K [49], Sintel [13], REDS
ness. With the advances in research, several ConvLSTM [26], HMDB [17] and Adobe240fps [37]. Experimental re-
variations [40, 24, 12] were proposed to improve the per- sults demonstrate that ConvTransformer performs better in

2
extrapolating future frames when compared with previous algorithms, and thereby restricts its application. Unlike
ConvLSTM based recurrent models, and achives competi- these CNN and kernel-based algorithms, in literature [25],
tive results against state-of-the-art elaborately designed in- a pixel-phase information learning network PhaseNet [25]
terpolation algorithms. was proposed for frame interpolation. To a certain extent,
The main contributions of this paper are therefore as fol- it suppresses the artifacts as compared with previous meth-
lows. ods. The performance, however, will be easily affected by
complicated factors, e.g. large motion and disparity. With
• A novel architecture named ConvTransformer is pro- the rapid development of exploiting deep neural networks
posed for video frame synthesis. for optical flow estimation [19] and depth map generating
• A novel attention assigned multi-head convolutional [7, 19], several implicitly or explicitly flow and depth map
self-attention is proposed for modeling the long-range guided neural networks [14, 2, 29, 1] have been investigated
spatial and temporal dependence on video frame se- for interpolating middle frames. The typically pipeline of
quence. these algorithms lies in optical flow or depth information
estimating realised by pre-trained sub-networks [19, 7, 19]
• The effectiveness and superiority of the proposed Con- firstly, and then a warping layer is introduced to adaptively
vTransformer have been comprehensively analyzed in warping the input frames subsequently, and finally a infor-
this paper. mation fusing layer is built to generate the target middle
frames. These methods work well on anticipating occlusion
and artifacts, and thus interpolate sharp images. However,
2. Related Work these methods are restricted by the pre-trained optical flow
2.1. Video Frame Synthesis sub-network PWC-Net [19] and depth map estimation sub-
network [7, 19] which can be easily affected by the training
Video frame synthesis, including interpolation and ex- set. Last but not least, the architectures of these methods are
trapolation two subtasks, is a hot research topic and has all specially and elaborately designed, and hence the gener-
been extensively investigated in the literature [23, 21, 42, alization to another video frame synthesis task, i.e. video
27, 28, 39, 40, 41]. In the following, we give a detailed frame extrapolation, is limited.
discussion on video frame interpolation and extrapolation,
respectively. Video frame extrapolation. Extrapolating future video
Video frame interpolation. To interpolate the middle frames in video content still remains a challenging task be-
frames between two adjacent frames, traditional approaches cause of the multi-model of nature videos and the unex-
mainly consist of two steps: motion estimation, and pixel pected incident in the future. Traditional solutions take this
synthesis. With the accurately estimated motion vectors or problem as a recurrent prediction problem [34, 40, 24]. Shi
optical flow, the target frames can be interpolated via the et al. proposed a convolutional LSTM (ConvLSTM) archi-
conventional warping method. However, accurate motion tecture [34], a pioneer recurrent model, to generate future
estimation still remains a challenging problem because the frames. Given to the flexibility of nature scenes and limited
motion blur, occlusion and brightness change frequently ap- representation ability of this simple architecture, the pre-
pears in nature images. dicted frames naturally suffer from blurriness. Lately, sev-
The recent years have witnessed significant advances in eral improved algorithms based on ConvLSTM architecture
deep neural networks which have shown great power in were proposed, and, to some extent, achieved better per-
solving a variety of learning problems [16, 33, 18, 32, 51], formance. Specifically, Villegas et al. proposed a decom-
and have attracted much attention on applying it to video postion LSTM model, namely MCNet [40], in which the
frame interpolation task. Long et al. developed a general motion and content are decomposed respectively. In order
CNN to directly synthesize the middle frames [23]. How- to utilize the contextual information, Lotter et al. proposed
ever, a generic CNN, only stacking several convolutional a top-down context guided LSTM model PredNet [24]. Be-
layers, is limited in capturing the dynamic motion of na- sides, the inception mechanism is introduced in LSTM to
ture scenes, and thereby their method usually yields se- obtain a broadly view for synthesizing [12]. Aside from
vere visual artifact, e.g. motion blur and ringing artifact. these LSTM based recurrent models, the plain CNN archi-
Lately, Niklaus et al. treated the frame interpolation prob- tectures based models are also investigated to generate fu-
lem as a local convolutional kernel prediction problem, and ture frames. Liu et al. proposed a 3D optical flow, across the
proposed adaptive convolution operation [27] and separa- time and space, guided neural networks, dubbed DVF [21],
ble convolution process [28] for each pixel in frames. To to extrapolated frames. However, due to the limitation in
some extent, these methods perform better than previous multi-frames joint training, DVF [21] cannot work well on
general CNN based method. But even so, high memory long-term multiple frames prediction. Apart from these im-
footprint is the typical characteristic of these kernel-based plicitly motion estimation and utilising methods, Wu et al.

3
Decode
Decoder
Feature Embedding Encoder ˆ ˆ ˆ
t -0.5 t + 0.5
tt+

ˆ ˆ ˆ ˆ ˆ ˆ
t -3 t -2 t -1 t+1 t+ 2 t+3
Prediction

Add & Norm


t-3 t- 2 t-1 t+1 t+ 2 t+3
SFFN
Add & Norm N´ FeedForward
LReLU ˆ
Conv

Decoding Layer
t -0.5
Add & Norm

Encoding Layer
FeedForward
Shared Weights

LReLU SFFN
Conv N´ Multi-Head Convolutional
Attention
LReLU Add & Norm ˆ
t
Conv
Multi-Head Convolutional Add & Norm
LReLU SFFN
Conv Self-Attention Query Self-Attention
ˆ
t+0.5
t-3 t- 2 t-1 t+1 t+ 2 t+3
Generated
Middle Frames
t -3 t -2 t -1 t+1 t+ 2 t+3 t -0.5 t t + 0.5
t+

Input Sequential Positional Positional Middle Frame Queries


Positioned 2D Feature Maps
Frames Encoding Encoding
Figure 2. An overview of ConvTransformer GθG . ConvTransformer GθG consists of four parts: feature embedding, encoder, decoder and
prediction. In feature embedding part, a shared convolutional neural network transforms the input video frame image in RGB format
to a compact feature maps space. In the following encoder step, several stacked encoding layers incorporating multi-head convolutional
self-attention module are used to extract the long-term dependence among the input video sequence. Thirdly, the encoded video feature
maps sequence and target query frames are passed to the decoder layers. The sequential dependence between target frames and input video
sequence is decoded in this step. Finally, the target frame sequence in RGB format is generated in the prediction step with the shared
network SFFN.

proposed an explicitly motion detection and semantic esti- natural language processing (NLP) tasks, such as machine
mation method to extrapolate future frames [46]. Although translation [15] and speech processing [8, 43]. Recently,
it works well when there are explicitly foreground motions the basic Transformer architecture has been successfully ap-
in frame sequence, e.g. a car and a runner, it suffers from plied to the field of image generation [30], image recogni-
a restriction in nature video scenes, where the distinction tion [9], and object detection [5]. Specifically, Nicolas et
between the forground and background is not clear. al. proposed the DETR [5] object detection method, and it
Although recent years have witness a great progress in achieves competitive result on COCO dataset as compared
video interpolation and extrapolation two sub-tasks, there is with Faster R-CNN [31]. Through collapsing the spatial
still a lack of a general and unified video frame synthesis dimensions (two dimensions) into one dimension, DETR
architecture that can perform well on both of these two sub- reasons about the relations of pixels and the global image
tasks. In order to overcome this issue, we propose an uni- context. Although DETR has successfully applied Trans-
fied architecture, named convolutional Transformer (Con- former for computer vision task object detection, it is hard
vTransformer), for video frame synthesis. A simple but effi- to use the basic Transformer to model the long term depen-
cient multi-head convolutional self-attention architecture is dence among the two dimensional video frames, which are
proposed to model the long-range dependence between the not only temporally related, but also spatially related. In
video frame sequence. The experiment results conducted on order to overcome this issue, a convolutional Transformer
several benchmarks typically demonstrate that the proposed (ConvTransformer) is proposed in this work, and has been
ConvTransformer works well both in video frame interpo- successfully applied to video frame synthesis including in-
lation and extrapolation. To the best of our knowledge, it terpolation and extrapolation two subtasks.
is the first time that ConvTransformer architecture is pro-
posed, and has been successfully applied to video frame
synthesis.
3. Convolutional Transformer Architecture
The overall architecture of ConvTransformer GθG , as
2.2. Transformer Network
shown in Figure 2, consists of five main components, that is,
Transformer [38] is a novel architecture for learning feature embedding module FθF , positional encoding mod-
long-range sequential dependence, which abandons the tra- ule PθP , encoder module EθE , decoder module DθD , and
ditional building style of directly using RNN or LSTM ar- synthesis feed-forward network SθS . In this section, we first
chitecture. It has been successfully applied to numerous provide an overview of video frame synthesis pipeline re-

4
alised by ConvTransformer architecture, and then make an frames’ feature maps, so that the order information of the
illustrated introduction of the proposed ConvTransformer. video frame sequence can be utilized. To this end, “posi-
Finally, the implementation details and training loss are in- tional encodings” are added at each layer in encoder and de-
troduced. coder. It is noted that the positional encoding in ConvTrans-
former is a 3D tensor which is different from that in origi-
3.1. Algorithm Review nal Transformer architecture built for vector sequence. The
Given a video frame sequence X̃ = {X̃0 ,X̃1 , · · · ,X̃n } ∈ positional encoding has the same dimension as the frame
RH×W ×C , where n is the length of sequence and H, W feature maps, so that they can be summed directly. In this
and C denote height, width, and the number of channels, work, we use sine and cosine functions of different frequen-
respectively, our goal is to synthesize intermediate frames cies to encode the position of each frame in video sequence:
X̂ = {X̂i+t0 , X̂i+t1 , · · ·, X̂i+tk } at time tm ∈ [0, 1], or
Pos Map(p,(i,j,2k)) = sin(p/100002k/dmodel ) (3)
extrapolate future frames X̂ = {X̂n+1 , X̂n+2 , · · ·, X̂n+mk }
at order mk ∈ N. Specifically, the future frames forecasting
problem can be viewed as: Pos Map(p,(i,j,2k+1)) = cos(p/100002k/dmodel ) (4)
X̂n+1 , ···, X̂n+mk = argmax P (Xn+1 , ···, Xn+mk |X̃ ) where p ∈ [1, n] is the positional token, (i, j) represents
Xn+1 ,···,Xn+mk
(1) the spatial location of features and the channel dimension is
Here, P (·|·) stands for conditional probability operation. noted as 2k. That is, each dimension of the positional en-
Firstly, the Feature Embedding module embeds the in- coding corresponds to a sinusoid. The wavelengths form a
put video frames, and then generates representative fea- geometric progression from 2π to 10000 ∗ 2π. We choose
ture maps. Subsequently, the extracted feature maps of this function because it would allow the model to easily
each frame are added with the positional maps, which are learn to attend relative positions for any fixed offset m,
used for positional identity. Next, the positioned frame fea- Pos Map(p+m) can be represented as a linear function of
ture maps are passed as inputs to the Encoder to exploit Pos Map(p) .
the long-range sequential dependence among each frame in Given an embedded feature maps Ji , the positioned em-
video sequence. After getting the encoded high-level fea- bedding process can be viewed as the following equation:
ture maps, the high-level feature maps and positioned frame Zi = Ji ⊕ Pos Map(i) , i ∈ [1, n] (5)
queries are simultaneously passed into the Decoder, and
then the sequential dependence between the query frames where the ⊕ operation represents element-wise tensor addi-
and input sequential video frames is decoded. Finally, the tion.
decoded feature maps are fed into the Synthesis Feed-
Forward Networks (SFFN) to generate the final middle 3.4. Encoder EθE and Decoder DθD
interpolated frames or extrapolated frames. Encoder: As shown in Figure 2, the encoder is modeled
3.2. Feature Embedding FθF as a stack of N identical layers consisting of two sub-layers,
i.e., multi-head convolutional self-attention layer and a sim-
In order to extract a compact feature representation for ple 2D convolutional feed-forward network. The residual
subsequent effective learning, a representative feature is connection is adopted around each of the two sub-layers,
computed by a 4-layer convolution with Leaky ReLu ac- followed by group normalization [47]. To facilitate these
tivation function and hidden dimension dmodel . Given a residual connections, all sub-layers in the model, as well
video frame Xi ∈ RH×W ×3 , the embedded feature maps as the embedding layers, produce outputs of the same di-
Hi ∈ RH×W ×dmodel can be presented with the following mensional dmodel . Given a positioned feature sequence
equation: Z= {Z0 ,Z1 , · · · ,Zi , · · · ,Zn−1 ,Zn } ∈ RH×W ×dmodel ,
the identified feature sequence Ẑ= {Ẑ0 ,Ẑ1 , · · · ,Zn } ∈
Ji = FθF (Xi ), i ∈ [1, n] (2) RH×W ×dmodel can be learned, and the encoding operation
can be represented as:
It is worth mentioning that all input video frames share
not only the same embedding net architecture FθF , but also Ẑ = EθE (Z) (6)
the parameters θF .
Decoder: The decoder is also composed of a stack of N
3.3. Positional Encoding PθP
identical layers, which consists of three sub-layers. In ad-
Since our model contains no recurrence across the video dition to the two sub-layers as implemented in Encoder, an
frame sequence, some information about relative or abso- additional layer called query self-attention is inserted to per-
lute position of the video frames must be injected in the form the convolutional self-attention over the output frame

5
Convolutional Self-Attention

ˆ
i

Sum
Multi-Head Attention

E E E E E Element-Wise Production

0 1 i n-1 n
Value Maps

Query
Attention Maps
Maps ( i ,0) ( i ,1) ( i ,i ) ( i ,n-1) ( i ,n )
Linear
0 CNN CNN CNN CNN CNN Shared

Concat
i
C C C C C
Concat

0 1 i n-1
n-1 n
Key Maps
n Convolutional Self-Attention
QǃKǃV QǃKǃV QǃKǃV QǃKǃV QǃKǃV

CNN CNN CNN CNN CNN Shared

Feature Maps

0 1 i n-1 n

Figure 3. (left) Convolutional Self-Attention. (right) Multi-Head Attention in parallel.

queries. Given a query sequence Q= {Q0 ,Q1 , · · · ,Qn } ∈ Finally, the output V̂i can be obtained with summation
RH×W ×d , the decoding process can be conducted as: of the element wise production with attention map H(i,j)
and the corresponding value map Vj . This operation can be
Q̂ = DθD (Ẑ, Q) (7)
represented as:
It should be emphasized that the encoding and decoding
n
process are all conducted in parallel. X
V(i) = H(i,j) Vj (10)
3.5. Multi-Head Convolutional Self-Attention j=1

We call our particular attention “Convolutional Self- In order to jointly attend to information from different
Attention” ( as shown in Figure 3), which is computed upon representation subspaces at different feature spaces, a multi-
feature maps. The convolutional self-attention operation head pipeline is adopted. The process can be viewed as:
can be described as mapping a query map and a set of key-
value map pairs to an output, where the query map, key MultiHead(V̂i ) = Concat(V̂i1 , · · ·, V̂ih ) (11)
maps, value maps, and output are all 3D tensors. Given an
input comprised of sequential feature maps U= {U0 ,U1 , · 3.6. Synthesis Feed-Forward Network SθS
· · ,Un } ∈ RH×W ×dmodel , we apply convolutional sub-
In order to synthesize the final photo realistic video
network to generate the query map and paired key-value
frames, we construct a frame synthesis feed-forward net-
map of each frame, i.e., U 0 = {{Q0 , K0 , V0 },{Q1 , K1 , V1 },·
work, which consists of 2 cascaded sub-networks built upon
· · ,{Qn , Kn , Vn }} ∈ RH×W ×dmodel .
a U-Net-like structure. The frames states Q̂ decoded from
Given a set of {Qi , Ki , Vi } of frame Ui , the attention
previous decoder are fed into SFFN in parallel. This process
map H(i,j) ∈ RH×W ×1 of frame Ui and Uj can be gen-
can be represented as:
erated by applying a compatible sub-network MθM to the
query map Qi with the corresponding key map Kj , which X̂i = SθS (Q̂i ), i ∈ [1, N 0 ] (12)
can be represented as following equation:
H(i,j) = MθM (Qi , Kj ) (8) 3.7. Initialization of Query Set Q
After getting all the corresponding attention map H(i) = As an indispensable part of Decoder, query set Q is crit-
{H(i,1) , H(i,2) , · · ·, H(i,n) } ∈ RH×W ×1 , we make a con- ical for accurate extrapolation and interpolation. Specifi-
catenation operation of H(i) in the third dimension, and cally, given 4 input frames X = {X1 , X2 , X3 , X4 }, Con-
then a SoftMax operation is applied to H(i) ∈ RH×W ×n vTransformer extrapolates 3 frames X̂ = {X̂1 , X̂2 , X̂3 }.
along the dimension dim = 3. The query Qi equals to the embedded feature maps of the
last input frame, i.e., J4 . On the other hand, given a 6-frame
H(i) = Sof tM ax(H(i) )dim , dim = 3 (9) sequence X = {Xt−3 , Xt−2 , Xt−1 , Xt+1 , Xt+2 , Xt+3 },

6
Table 1. Details about trainset, validationset and testset in which, the initial learning rate is set to 10−4 and is re-
Training Validation Test
Viemo90K (935 sequences) duced with exponential decay, where the decay rate is 0.95
Vimeo90K Vimeo90K
( 64612 sequences) ( 20 sequences)
UCF101 (2533 sequences) while the decaying quantity is 20000. The whole training
Adobe240fps Adobe240fps
Adobe240fps (2660 sequences) proceeds for 6 × 105 iterations. Besides, the length of input
Sintel (1581 sequences)
(2120 sequences) (120 sequences)
HMDB (2684 sequences) sequence is 4 for video frame prediction, while for the video
interpolation task, the length of the input sequence is 6. The
more setting details about ConvTransformer are represented
ConvTransormer interpolates 3 frames between frame Xt−1 in appendix file.
and Xt+1 , i.e., X̂ = {X̂t−0.5 , X̂t , X̂t+0.5 }. The query Qi is
calculated by the element-wise average calculation of two 4.3. Comparisons with state-of-the-arts
adjacent frames’ feature maps, i.e., Jt−1 and Jt+1 . Specifi- In order to evaluate the performance of the proposed
cally, Qi = Jt−1 ∆Jt+1 , where the operation ∆ is element- ConvTransformer, we compare our finally trained Con-
wise average calculation. vTransformer on several public benchmarks with state-of-
3.8. Training Loss LGθG the-art video frame synthesis method DVF [21], representa-
tive ConvLSTM based video frame extrapolation algorithm
In this work, we choose the most widely used content MCNet [40], and specially designed video frame interpo-
loss, i.e., pixel-wise MSE loss, to guide the optimizing pro- lation solutions, i.e., SepConv [28], CyclicGen [20], DAIN
cess of ConvTransformer. The MSE loss is calculated as: [1] and BMBC [29]. For a fair comparison, we reimple-
N 0 mented and retrained these methods with the same trainset
1 X for training our ConvTransformer. Two widely used image
LGθG = X̂i − Yi (13)
N 0 i=1 2 quality metrics, i.e., peak signal to noise ratio (PSNR) [11]
and structural similarity (SSIM) [44], are adopted as the ob-
Here, N 0 stands for the number of synthesized results, X̂i jective evaluation criteria. The quantitative results of video
represents the synthesized target frame, and the Yi is the frame extrapolation are tabulated in Table 2, while Table 3
corresponding groundtruth. represents the quantitative comparison of video frame in-
terpolation. Besides, the visual comparisons of synthesized
4. Experiments and Analysis images with zoomed details and residual occlusion maps are
illustrated in Figure 1 and Figure 4, respectively.
In this section, we provide the details for experiments, As observed in Table 2, the proposed ConvTransformer
and results that demonstrate the performance and efficiency has given rise to better performance than DVF [21] and
of ConvTransformer, and compare it with previous pro- MCNet [40]. More concretely, taking the next frame ex-
posed specialized video frame extrapolation methods, elab- trapolation as an example, the relative performance gains of
orately designed video frame interpolation algorithms and ConvTransformer over the DVF and MCNet [40] models,
general video frame synthesis solutions on several bench- in terms of index PSNR, are 2.7140dB and 1.8983dB on
marks. Besides, to further validate the proposed Con- Vimeo 90k [49], 1.6819dB and 2.2137dB on Adeobe240fps
vTransformer, we conducted several ablation studies. [37], as well as 0.1321dB and 1.6734dB on UCF101 [36].
4.1. Datasets Besides, ConvTransformer, in terms of average comparison,
achieves 1.5094dB and 1.9285dB advantage on DVF [21]
To create the trainset of video frame sequence, we and MCNet [40] respectively. Additionally, the superior-
leverage the frame sequence from the Vimeo90K [49] and ity of ConvTransformer becomes larger in multiple future
Adobe240fps [37] dataset. On the other hand, we also frames extrapolation. Specifically, ConvTransformer gains
exploit several other widely used benchmarks, including an advantage of 2.22dB in PSNR criterion over the method
UCF101 [36], Sintel [13], REDS [26] and HMDB [17], for DVF [21], while it is 1.68dB in previous next frame extrap-
testing. Table 1 represents the details about training, valida- olation on the same benchmark Adobe240fps [37].
tion and testing sets. Figure 1 visualizes the qualitative comparisons. As
observed from the zoomed-in regions in Figure 1, the
4.2. Training Details and Parameters Setting
proposed ConvTransformer could extrapolate more photo-
Pytorch platform is used in our experiment for train- realistic frames, while the predicted frames generated from
ing. The experimental environment is the Linux opera- previous methods DVF [21] and MCNet [40] suffer from
tion system Ubuntu 16.04 LTS running on a sever with an image degradation, such as image distortion and local
AMD Ryzen 7 3800X CPU at 3.9GHz and a NVIDIA RTX smother. Besides, the residual occlusion maps suggest that
3090 GPU. In order to guarantee the convergence of Con- ConvTransformer has superiority on accurate pixel value es-
vTransformer, the Adam optimizer is adopted for training, timation, as compared with DVF [21] and MCnet [40].

7
       361566,0






6HS&RQY '$,1 &\FOLF*HQ &\FOLF*HQODUJH %0%& '9) 2XUV *URXGWUXWK
Figure 4. Visual Comparisons of ConvTransformer with other state-of-the-art video frame interpolation methods: SepConv [28], DAIN
[1], CyclicGen [20], CyclicGen-large [20], BMBC [29] and DVF [21].

Table 2. Video frame extrapolation: Quantitative evaluation of ConvTransformer with state-of-the-art methods.
Next frame Next 3 frames
Model UCF101 [36] Adobe240fps [37] Vimeo90K [49] Average UCF101 [36] Adobe240fps [37] Vimeo90K [49] Average
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
DVF [21] 29.1493 0.9181 28.7414 0.9254 27.8021 0.9073 28.5642 0.9169 26.1174 0.8779 25.8625 0.8598 24.5277 0.8432 25.5025 0.8603
MCNet [40] 27.6080 0.8504 28.2096 0.8796 28.6178 0.8726 28.1451 0.8675 25.0179 0.7766 24.9485 0.7721 25.4455 0.7671 25.1373 0.7719
Ours 29.2814 0.9205 30.4233 0.9457 30.5161 0.9406 30.0736 0.9356 26.7584 0.8874 28.0835 0.9045 27.1441 0.8926 27.3286 0.8948

In a nutshell, the quantitative and qualitative results task.


prove that the proposed ConvTransformer, incorporating According to the visual and quantitative comparisons
multi-head convolutional self-attention mechanism, can ef- above, two dominant conclusions can be drawn. First, Con-
ficiently model the long-range sequential dependence in vTransformer is an efficient model for synthesizing photo-
video frames, and then extrapolate high quality future realistic extrapolation and interpolation frames. Second, in
frames. contrast to the specially designed extrapolation and inter-
polation methods, i.e., MCnet [40], SepConv [28], Cyclic-
Except for the video frame extrapolation comparison, Gen [20], CyclicGen-large [20], DAIN [1] and BMBC [29],
we also make a video frame interpolation comparison with ConvTransformer is a unify and general architecture, that is,
the popular and state-of-the-art methods, including the gen- it performs well in both two subtasks.
eral video frame synthesis method DVF [21] and five spe-
cialized interpolation solutions, namely namely SepConv 4.4. Ablation Study
[28], CyclicGen [20], CyclicGen-large [20], DAIN [1] and In order to evaluate and justify the efficiency and supe-
BMBC [29]. The performance indices of these methods riority of each part in the proposed ConvTransformer archi-
are illustrated in Table 3. We can easily find that the Con- tecture, several ablation experiments have been conducted
vTransformer has attained better performance over the pre- in this work. Specifically, we gradually modify the baseline
vious synthesis method DVF [21]. In view of the PSNR ConvTransformer model and compare their differences.
and SSIM index, the proposed ConvTransformer has intro-
duced a relative performance gain of 1.36dB and 0.0205 on
4.4.1 Investigation of Positional Encoding and Resid-
average. On the other hand, as compared with specially
ual Connection
designed interpolation methods, ConvTransformer outper-
forms most solutions, and achieves competitive results as We separately eliminate the positional encoding module,
compared with best algorithm DAIN [1]. Concretely, con- residual connection operation and both of them, and dub
sidering the PSNR and SSIM criteria, ConvTransformer these three degradation networks as ConvTransformer-wo-
achieves 31.57dB and 0.9151 on average, which is better pos, ConvTransformer-wo-res and ConvTransformer-wo-
than popular methods SepConv-Lf [28], CyclicGen [20], res-wo-pos, in which, the abbreviation wo represents with-
CyclicGen-large [20] and BMBC [29]. Although Con- out, pos indicates positional encoding and res is an abbrevi-
vTransformer does not outperform state-of-the-art interpo- ation of residual connection. These three degradation algor-
lation method DAIN, the performance gap between them tihms are trained with the same trainset and implementation
is not so large. Furthermore, as compared with elaborately applied to ConvTransformer. We tabulate their performance
designed algorithm DAIN [1], ConvTransformer is a more on extrapolation task in terms of objective quantitative in-
general model, which not only works well on video frame dex PSNR and SSIM in Table 4, and visual comparisons in
interpolation, but also performs well on video frame extrap- Figure 5.
olation. On the contrary, the solution DAIN [1] could not As summarized in Table 4, ConvTransformer has at-
be used for the video frame extrapolation task, because the tained the best performance and achieves a relative perfor-
latter frame , used for estimating and depth map and optical mance gain of 0.1971dB, 1.4001dB and 1.4813dB in com-
flow map, cannot be provided in video frame extrapolation parison with three degradation models ConvTransformer-

8
Table 3. Video frame interpolation: Quantitative evaluation of ConvTransformer with state-of-the-art methods.
Sintel [13] UCF101 [36] Adobe240fps [37] HMDB [17] Vimeo [49] REDS [26] Average
Model
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
SepConv-Lf [28] 31.68 0.9470 32.51 0.9473 36.36 0.9844 33.90 0.9483 33.49 0.9663 21.32 0.6965 31.54 0.9149
DAIN [1] 31.37 0.9452 32.72 0.9506 36.15 0.9837 33.89 0.9487 33.95 0.9701 21.85 0.7203 32.65 0.9197
Specialized CyclicGen [20] 31.54 0.9104 33.03 0.9303 35.60 0.9691 34.16 0.9242 33.04 0.9319 21.29 0.5686 31.44 0.8724
CyclicGen-large [20] 31.19 0.9004 32.43 0.9239 34.89 0.9626 33.75 0.9208 32.04 0.9163 20.92 0.5465 30.87 0.8617
BMBC [29] 27.01 0.9223 27.92 0.9412 28.58 0.9569 28.42 0.9384 30.61 0.9629 21.21 0.7056 27.29 0.9045
DVF [21] 30.71 0.9303 32.31 0.9454 33.24 0.9613 33.59 0.9469 30.99 0.9379 20.44 0.6460 30.21 0.8946
General
Ours 31.44 0.9469 32.48 0.9504 36.42 0.9844 33.37 0.9492 34.03 0.9637 21.68 0.6959 31.57 0.9151

24.83/0.71 24.70/0.71 26.04/0.76 26.11/0.76 PSNR/SSIM

250
200
150
100
50
0
-wo-pos-wo-res -wo-pos-w-res -w-pos-wo-res ConvTransformer Groud truth
Figure 5. Effect of positional encoding and residual connection. The first three columns represent the visual results of ablation experiments.
The second row represents the zoomed-in local details, while the third row shows the occlusion map with color bar displaying the pixel
residual in the range of 0 to 255. The visual comparison in zoomed-in local details indicate that positional encoding module and reisudal
connection is helpful for synthesizing photo-realistic images.

Table 4. Comparative results achieved with the ablation of residaul Table 5. Comparative results achieved by ConvTransformer with
connection and positional encoding. different head numbers
Model Residual Connection Positional Encoding PSNR SSIM Model ConvTransoformer-H-1 ConvTransoformer-H-2 ConvTransoformer-H-3 ConvTransoformer-H-4

ConvTransformer ! ! 29.8883 0.9383 PSNR 33.0054 33.7861 34.0274 34.2438

ConvTransformer-wo-res # ! 29.6912 0.9367 SSIM 0.9641 0.9695 0.9714 0.9731


ConvTransformer-wo-pos ! # 28.4882 0.9173
ConvTransformer-wo-res-wo-pos # # 28.4070 0.9104

4.4.2 Investigation of Multi-Head Numbers Setting


wo-pos, ConvTransformer-wo-res and ConvTransformer-
wo-res-wo-pos, respectively. Besides, visual comparisons
in Figure 5 efficiently confirm these quantitative analysis. The head numbers H is a hyperparameter that allows Con-
As shown in Figure 5, the stairs represented in zoomed-in vTransformer to jointly attend to information from differ-
area typically demonstrate that ConvTransformer with po- ent representation subspaces at different positions. In order
sitional encoding and residual connection has an advance to justify the efficiency of multi-head architecture, several
in suppressing artifacts and preserving local high-frequency multi-head variation experiments have been implemented in
details. this work, and the quantitative index in terms of PSNR and
SSIM, and visual examples are illustrated in Table 5 and
We also visualize the convergence process of these three Figure 6, respectively.
degradation networks in Figure 7. The convergence curves
are consistent with the quantitative and qualitative compar- As listed in Table 5, ConvTransformer-H-4, in the
isons above. As observed in Figure 7, we can find that the light of index PSNR, gains a relative performance gain of
performance of ConvTransformer can easily be affected by 1.2438dB in comparison with ConvTransformer-H-1. Be-
the module positional encoding, and the residual connection sides, the visual comparisons, in Figure 6, indicate that
can stabilize the training process. ConvTransformer-H-4 generates more photo-realistic im-
ages. From these quantitative and visual comparisons, we
To sum up everything that has been stated so far, the posi- can ascertain that more heads are helpful for ConTrans-
tional encoding and residual connection architecture is ben- former architecture to incorporate information from differ-
eficial for our proposed ConvTransformer to perform well. ent representation subspaces at different frame states.

9
    361566,0







KHDG KHDGV KHDGV KHDGV *URXGWUXWK
Figure 6. Visual comparisons of ConvTransformer variants with different head numbers. The zoomed-in details are shown in the second
row, and the third row illustrates the occlusion maps. The zoomed-in details demonstrate that ConvTransformer with more heads has an
advantage in synthesizing details.

26

24
Peak Signal to Noise Ratio(dB)

22

20

18

w_res/w_pos
16 w_res/wo_pos
wo_res/w_pos
wo_res/wo_pos
14
0 100000 200000 300000 400000
Iteration times
Figure 7. Plots of PSNR convergence curves of ConvTransformer Figure 8. Plots of PSNR convergence curves of ConvTransformer
with the ablation of residual connection and positional encoding. under different head numbers, i.e., 1 head, 2 heads, 3 heads and 4
w res/w pos represents ConvTransformer with residual connec- heads.
tion and positional encoding, w res/wo pos represents ConvTran-
former with the ablation of positional encoding, and wo res/w pos
indicates that the residual connection is ablated. dices of these networks are illustrated in Table 6.
As tabulated in Table 6, we can easily find that the
Table 6. Quantitative evaluation of ConvTransformer with varia- ConvTransformer-L-5 has acquired the optimal perfor-
tion number of layers in Encoder and Decoder mance. In view of the PSNR index, ConvTransformer-L-
Model ConvTransoformer-L-1 ConvTransoformer-L-2 ConvTransoformer-L-3 ConvTransoformer-L-5 5 has introduced a relative performance gain of 0.3593dB
PSNR 34.2438 34.4623 34.4989 34.6031
in comparison with ConvTransformer-L-1. Consequently,
SSIM 0.9731 0.9741 0.9744 0.9754
to achieve an excellent performance, it is desirable that the
encoder and decoder contain more layers. However, more
layers will take much more time for the networks to conver-
4.4.3 Investigation of Layer Numbers Setting
gence, and consume more memory in training and testing.
In summary, although more layers will improve the repre-
The layer numbers N is a hyperparameter which allows us
sentative ability of ConvTransformer, we should set appro-
to vary the capacity and computational cost of the encoder
priate value N , in practice, and take the training time and
module and decoder module in ConvTransformer. To inves-
memory burden into consideration.
tigate the trade-off between performance and computational
cost mediated by this hyperparameter, we conduct experi-
4.5. Long-Term Frame Sequence Dependence Anal-
ments with ConvTransformer for a range of different N val-
ysis
ues. Such modified networks are named ConvTransformer-
L-1, ConvTransformer-L-2, and so on. These layer varia- In order to verify whether the proposed multi-head con-
tion networks were fully trained, and the performance in- volutional self-attention could efficiently capture the long-

10
X0 X1 X2 X3 X4 X5

Head 0

Head 1

Head 2

Head 3

Figure 9. The attention maps Hi for decoding query Qi on input sequence {X0 , X1 , ..., X5 } under different heads. ColorBar with gradient
from yellow to blue represents the attention value in the range 0 to 1. The comparisons along each vertical column demonstrate that
each head responds for exploiting different dependence. For example, the attention map Hi−(1,3) , looking like a yellow map, indicates
that frame X3 contributes local high-frequency information for target synthesize frame in Head 1. Besides, the attention map Hi−(2,2) ,
predominated by blue, represents that X2 supplies low-pass information for synthesizing target frame in Head 2.

range sequential dependence within a video frame se- convolutional self-attention is proposed to model the long-
quence, we visualize the attention maps of decoder query range spatially and temporally relation of frames in video
Qi on input sequence X on decoder layer 1. As shown in sequence. Extensive quantitative and qualitative evaluations
Figure 9, the corresponding attention map Hi−(k,j) normal- demonstrate that ConvTransfomer is a concise, compact and
ized in range [0, 1] represents the exploiting of input frame efficient model. The successful implementation of Con-
Xj for synthesizing the target frame Q̂i in head k. It should vTransformer sheds light on applying it to other video tasks
be emphasized that the attention value close to 1 is drawn that need to exploit the long-term sequential dependence in
with color blue, while the color yellow represents the atten- video frames.
tion value close to 0.
With the vertical comparison in different positions, such References
as X0 , X1 , X2 , X3 , X4 and X5 , we can find that differ-
ent heads respond for incorporating different information [1] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang,
for synthesizing target frames. For instance, attention map Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware video
frame interpolation. In Proceedings of the IEEE Conference
Hi−(1,0) predominant by blue indicates that attention layer
on Computer Vision and Pattern Recognition, pages 3703–
in head 1 with position 0 mainly response for capturing 3712, 2019. 2, 3, 7, 8, 9
the low-pass background dependence between frame X0
[2] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao,
and target frame. Besides, along the lines of Hi−(1,0) , the and Ming-Hsuan Yang. Memc-net: Motion estimation and
attention map Hi−(1,2) looking like a yellow map, revels motion compensation driven neural network for video inter-
that frame X2 supplies local high-frequency information for polation and enhancement. IEEE transactions on pattern
synthesizing target frame in head 1. analysis and machine intelligence, 2019. 3
In summary, through the analysis above and attention [3] W. Bao, X. Zhang, L. Chen, L. Ding, and Z. Gao. High-order
maps shown in Figure 9, a conclusion can be drawn that model and dynamic filtering for frame rate up-conversion.
the proposed multi-head convolutional self-attention can ef- IEEE Transactions on Image Processing, 27(8):3813–3826,
ficiently model different long-term information dependen- 2018. 1
cies, i.e., foreground information, background information [4] T. Brooks and J. T. Barron. Learning to synthesize motion
and local high-frequency information. blur. In 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6833–6841, 2019. 1
5. Conclusion [5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-
In this work, we propose a novel video frame synthe- to-end object detection with transformers. arXiv preprint
sis architecture ConvTransformer, in which the multi-head arXiv:2005.12872, 2020. 4

11
[6] R. Castagno, P. Haavisto, and G. Ramponi. A method for 2018 IEEE/CVF Conference on Computer Vision and Pat-
motion adaptive frame rate up-conversion. IEEE Trans- tern Recognition, pages 8971–8980, 2018. 2, 3
actions on Circuits and Systems for Video Technology, [19] Zhengqi Li and Noah Snavely. Megadepth: Learning single-
6(5):436–446, 1996. 1 view depth prediction from internet photos. In 2018 IEEE
[7] Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single- Conference on Computer Vision and Pattern Recognition,
image depth perception in the wild. In D. Lee, M. Sugiyama, pages 2041–2050. IEEE Computer Society, 2018. 2, 3
U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in [20] Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu
Neural Information Processing Systems, volume 29, 2016. 2, Chuang. Deep video frame interpolation using cyclic frame
3 generation. In Proceedings of the 33rd Conference on Artifi-
[8] Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: cial Intelligence (AAAI), 2019. 7, 8, 9
a no-recurrence sequence-to-sequence model for speech [21] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and
recognition. In 2018 IEEE International Conference on Aseem Agarwala. Video frame synthesis using deep voxel
Acoustics, Speech and Signal Processing (ICASSP), pages flow. In Proceedings of the IEEE International Conference
5884–5888. IEEE, 2018. 4 on Computer Vision, pages 4463–4471, 2017. 2, 3, 7, 8, 9
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [22] G. Long, L. Kneip, J. M. Alvarez, and H. Li. Learning image
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, matching by simply watching video. European Conference
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- on Computer Vision, 2016. 2
vain Gelly, et al. An image is worth 16x16 words: Trans- [23] Gucan Long, Laurent Kneip, Jose M Alvarez, Hongdong Li,
formers for image recognition at scale. arXiv preprint Xiaohu Zhang, and Qifeng Yu. Learning image matching by
arXiv:2010.11929, 2020. 4 simply watching video. In European Conference on Com-
[10] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deep puter Vision, pages 434–450. Springer, 2016. 3
stereo: Learning to predict new views from the world’s im- [24] William Lotter, Gabriel Kreiman, and David Cox. Deep pre-
agery. In 2016 IEEE Conference on Computer Vision and dictive coding networks for video prediction and unsuper-
Pattern Recognition (CVPR), pages 5515–5524, 2016. 1 vised learning. In 5th International Conference on Learning
[11] Alain Horé and Djemel Ziou. Image quality metrics: Psnr vs. Representations, ICLR 2017, 2017. 2, 3
ssim. In International Conference on Pattern Recognition, [25] Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse,
ICPR 2010, Istanbul, Turkey, 23-26 August 2010, 2010. 7 and Alexander Sorkine-Hornung. Phase-based frame inter-
[12] M. Hosseini, A. S. Maida, M. Hosseini, and G. Raju. polation for video. In 2015 IEEE Conference on Computer
Inception-inspired lstm for next-frame video prediction. 2, Vision and Pattern Recognition (CVPR). 3
3 [26] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik
[13] Joel Janai, Fatma Guney, Jonas Wulff, Michael J Black, and Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu
Andreas Geiger. Slow flow: Exploiting high-speed cameras Lee. Ntire 2019 challenge on video deblurring and super-
for accurate and diverse optical flow reference data. In Pro- resolution: Dataset and study. In The IEEE Conference
ceedings of the IEEE Conference on Computer Vision and on Computer Vision and Pattern Recognition (CVPR) Work-
Pattern Recognition, pages 3597–3607, 2017. 2, 7, 9 shops, June 2019. 2, 7, 9
[14] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan [27] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via
Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: adaptive convolution. In 2017 IEEE Conference on Com-
High quality estimation of multiple intermediate frames for puter Vision and Pattern Recognition (CVPR), pages 2270–
video interpolation. In Proceedings of the IEEE Conference 2279, 2017. 2, 3
on Computer Vision and Pattern Recognition, pages 9000– [28] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation
9008, 2018. 3 via adaptive separable convolution. In 2017 IEEE Interna-
[15] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina tional Conference on Computer Vision (ICCV), pages 261–
Toutanova. Bert: Pre-training of deep bidirectional trans- 270, 2017. 2, 3, 7, 8, 9
formers for language understanding. In Proceedings of [29] Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su
NAACL-HLT, pages 4171–4186, 2019. 4 Kim. Bmbc: Bilateral motion estimation with bilat-
[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. eral cost volume for video interpolation. arXiv preprint
Imagenet classification with deep convolutional neural net- arXiv:2007.12622, 2020. 3, 7, 8, 9
works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. [30] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Weinberger, editors, Advances in Neural Information Pro- Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
cessing Systems, volume 25, pages 1097–1105. Curran As- age transformer. In International Conference on Machine
sociates, Inc., 2012. 2, 3 Learning, pages 4055–4064. PMLR, 2018. 4
[17] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. [31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
HMDB: a large video database for human motion recog- Faster R-CNN: Towards real-time object detection with re-
nition. In Proceedings of the International Conference on gion proposal networks. In Advances in Neural Information
Computer Vision (ICCV), 2011. 2, 7, 9 Processing Systems (NIPS), 2015. 4
[18] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance [32] W. Ren, J. Zhang, X. Xu, L. Ma, X. Cao, G. Meng, and W.
visual tracking with siamese region proposal network. In Liu. Deep video dehazing with semantic segmentation. IEEE

12
Transactions on Image Processing, 28(4):1895–1908, 2019. restoration. In International Workshop on Energy Minimiza-
1, 3 tion Methods in Computer Vision and Pattern Recognition,
[33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: pages 273–286. Springer, 2011. 2
Convolutional networks for biomedical image segmentation. [46] Yue Wu, Rongrong Gao, Jaesik Park, and Qifeng Chen.
In Nassir Navab, Joachim Hornegger, William M. Wells, and Future video synthesis with object motion prediction. In
Alejandro F. Frangi, editors, Medical Image Computing and 2020 IEEE/CVF Conference on Computer Vision and Pat-
Computer-Assisted Intervention – MICCAI 2015, pages 234– tern Recognition (CVPR), pages 5538–5547, 2020. 2, 4
241, Cham, 2015. Springer International Publishing. 2, 3 [47] Yuxin Wu and Kaiming He. Group normalization. In Pro-
[34] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, ceedings of the European conference on computer vision
Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm (ECCV), pages 3–19, 2018. 5
network: A machine learning approach for precipitation [48] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning
nowcasting. Advances in neural information processing sys- of driving models from large-scale video datasets. In 2017
tems, 28:802–810, 2015. 2, 3 IEEE Conference on Computer Vision and Pattern Recogni-
[35] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. tion (CVPR), pages 3530–3538, 2017. 1
Ucf101: A dataset of 101 human actions classes from videos [49] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and
in the wild. arXiv preprint arXiv:1212.0402, 2012. 2 William T Freeman. Video enhancement with task-
[36] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. oriented flow. International Journal of Computer Vision,
Ucf101: A dataset of 101 human actions classes from videos 127(8):1106–1125, 2019. 2, 7, 8, 9
in the wild. Computer ence, 2012. 7, 8, 9 [50] Zhefei Yu, Houqiang Li, Zhangyang Wang, Zeng Hu, and
[37] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Chang Wen Chen. Multi-level video frame interpolation:
Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video Exploiting the interaction among different levels. IEEE
deblurring for hand-held cameras. In Proceedings of the Transactions on Circuits and Systems for Video Technology,
IEEE Conference on Computer Vision and Pattern Recog- 23(7):1235–1248, 2013. 2
nition, pages 1279–1288, 2017. 2, 7, 8, 9 [51] C. Lawrence Zitnick, Sing Bing Kang, Matthew Uytten-
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- daele, Simon Winder, and Richard Szeliski. High-quality
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia video view interpolation using a layered representation. Acm
Polosukhin. Attention is all you need. In Advances in neural Transactions on Graphics, 23(3):600–608, 2004. 1, 3
information processing systems, pages 5998–6008, 2017. 4
[39] Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru
Erhan, Quoc V Le, and Honglak Lee. High fidelity video
prediction with large stochastic recurrent neural networks.
In NeurIPS, 2019. 3
[40] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin,
and Honglak Lee. Decomposing motion and content for nat-
ural video sequence prediction. In Proceedings of the In-
ternational Conference on Learning Representations, ICLR,
2017. 2, 3, 7, 8
[41] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn,
Xunyu Lin, and Honglak Lee. Learning to generate long-
term future via hierarchical prediction. In International Con-
ference on Machine Learning, ICML, 2017. 3
[42] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.
Generating videos with scene dynamics. In Proceedings
of the 30th International Conference on Neural Information
Processing Systems, pages 613–621, 2016. 3
[43] Yongqiang Wang, Abdelrahman Mohamed, Due Le, Chunxi
Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros
Tjandra, Xiaohui Zhang, Frank Zhang, et al. Transformer-
based acoustic modeling for hybrid speech recognition.
In ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages
6874–6878. IEEE, 2020. 4
[44] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and
Eero P. Simoncelli. Image quality assessment: From error
visibility to structural similarity. IEEE Transactions on Im-
age Processing, 13(4), 2004. 7
[45] Manuel Werlberger, Thomas Pock, Markus Unger, and Horst
Bischof. Optical flow guided tv-l 1 video interpolation and

13
Appendix
A. Appendix-1: The details about module SFFN

Residual Connection

DeconvBlock

DeconvBlock
ConvBlock

ConvBlock
3 ´ C ´ H ´W

Conv

Conv
3 ´ 3 ´ H ´W
Decoded High Level
Feature Maps

3 ´ 3 ´ H ´W
DeconvBlock
ConvBlock

Synthesised Target

DeConv
LRelu

LRelu

LRelu

LRelu
Conv

Conv

Conv
Add Operation Frames

Figure 10. The design of module SFFN. SFFN consists of two U-Net like sub networks, i.e., SFFN1 and SFFN2. The pooling operation,
for simplicity, and skip connection in U architecture are omitted. SFFN1 contains 5 layers, while SFFN2 is deeper, and includes 10 layers.
The basic unit ConvBlock contains two convolutional layers, and DeconvBlock contains one deconvolutional layer and one convolutional
layer. Given decoded feature maps sequence 3 ∗ C ∗ W ∗ H, the target synthesized frame sequence 3 ∗ 3 ∗ H ∗ W in RGB format is
generated with the use of SFFN1 and SFFN2. A residual connection, in order to accelerate convergence, is built between the SFFN1 and
the final output.

B. Appendix-2: The prove of the effectiveness of Positional Map proposed in ConvTransformer


As shown in Figure 11, the position of each frame is first encoded with positional vector (PosVector) along the channel
dimension, and then these PosVectors are expanded to the target positonal map (PosMap) through the repeating operation
along the vertical (height) and horizontal (width) orientation. According to equations (3) and (4), for arbitrary pixel coordinate
(i, j) in frame p + m, the PosMap(p+m,(i,j)) can be represented as follows.

PosVector 1 PosVector 2 PosVector 3 PosVector 4 PosVector 5 PosVector 6


PosVector to PosMap
Horizontal Horizontal Horizontal
Vertical

Vertical

Vertical

ĂĂ
PosMap 1 PosMap 2 PosMap 6
Figure 11. An illustration for positional map generating.

Pos Map(p+m,(i,j,2k)) = sin(wk (p + m)) = sin(wk p) cos(wk m) + cos(wk p) sin(wk m) (14)

14
Pos Map(p+m,(i,j,2k+1)) = cos(wk (p + m)) = cos(wk p) cos(wk m) − sin(wk p) sin(wk m) (15)

    
PosMapp+m,(i,j,2k) cos(wk m) sin(wk m) sin(wk p)
= (16)
PosMapp+m,(i,j,2k+1) − sin(wk m) cos(wk m) cos(wk p)

   
PosMapp+m,(i,j,2k) PosMapp,(i,j,2k)
=M (17)
PosMapp+m,(i,j,2k+1) PosMapp,(i,j,2k+1)
wk represents 100002k/dmodel . As illustrated in equations (14), (15), (16) and (17), the PosMapp+m can be represented as a
linear function of PosMapp , which proves that the proposed PosMap could efficiently model relative position relationship
between different frames.

C. Appendix-3:The parameters setting about ConvTransformer


Table 7 tabulates the parameter setting about one ConvTransformer. There are 4 heads, 7 encoder layers and 7 decoder
layers.

D. Appendix-4: Attention Calculation Process


The calculation process of multi-head convolutional self-attention in Encoder, query self-attention in Decoder and multi-
head convolutional attention in Decoder are represented in Figure 12.

15
Table 7. The parameter setting of ConvTransformer for synthesizing 3 target frames. It includes 7 encoder layers and 7 decoder layers,
while each attention layer of encoder and decoder consists of 4 heads. The dmodel is 128, and hence, the depth of each head is 32. In each
head, the Q Net generates the query feature maps, while the K V Net generates the key feature maps and value feature maps.
Module Settings Output Size
Input Sequence  ---  s×3×h×w
Feature Embedding Conv2D 3 × 3 s = 1 n = 128 × 4 s × 128 × h × w
      
Q Net Enc 0 [Conv2D 3 × 3 s = 1 n = 32]

   Head 0  K V Net Enc 0 [Conv2D 3 × 3 s = 1 n = 32]    
 Att Net Enc 0 [Conv2D 3 × 3 s = 1 n = 1]
   
   
      



 Q Net Enc 1 [Conv2D 3 × 3 s = 1 n = 32] 



   Head 1  K V Net Enc 1 [Conv2D 3 × 3 s = 1 n = 32]    
   

 Multi Head Convolutional Self Attention 
   Att Net Enc 1 [Conv2D 3 × 3 s = 1 n = 1] 
  


Encoder 
 (Conducted on input frame sequence) 
 Q Net Enc 2 [Conv2D 3 × 3 s = 1 n = 32] 

 × 7 (number of encoding layers)
 s × 128 × h × w
   Head 2  K V Net Enc 2 [Conv2D 3 × 3 s = 1 n = 32]    
   
 Att Net Enc 2 [Conv2D 3 × 3 s = 1 n = 1]
   
      



 Q Net Enc 3 [Conv2D 3 × 3 s = 1 n = 32] 





  Head 3  K V Net Enc 3 [Conv2D 3 × 3 s = 1 n = 32]    

 AttNet Enc 3 [Conv2D 3 × 3 s =  1 n = 1]

Feed Forward Conv2D 3 × 3 s = 1 n = 128
Query Sequence --- 3 × 128 × h × w
       
Q Net Que 0 [Conv2D 3 × 3 s = 1 n = 32]
   Head 0  K V Net Que 0 [Conv2D 3 × 3 s = 1 n = 32]    
 
 Att Net Que 0 [Conv2D 3 × 3 s = 1 n = 1]   
   
   
 

 
 Q Net Que 1 [Conv2D 3 × 3 s = 1 n = 32] 



   Head 1  K V Net Que 1 [Conv2D 3 × 3 s = 1 n = 32]    
   
Query Self Attention  Att Net Que 1 [Conv2D 3 × 3 s = 1 n = 1]   
   
   

 (Conducted on query frame sequence) 
 Q Net Que 2 [Conv2D 3 × 3 s = 1 n = 32] 




   Head 2  K V Net Que 2 [Conv2D 3 × 3 s = 1 n = 32]    

 
 Att Net Que 2 [Conv2D 3 × 3 s = 1 n = 1]   
   
   

 
 Q Net Que 3 [Conv2D 3 × 3 s = 1 n = 32] 




   Head 3  K V Net Que 3 [Conv2D 3 × 3 s = 1 n = 32]    


 Att Net Que 3 [Conv2D 3 × 3 s = 1 n = 1] 

     
Decoder Q Net Dec 0 [Conv2D 3 × 3 s = 1 n = 32]  × 7 (number of decoding layers) s × 128 × h × w
 


   Head 0  K V Net Dec 0 [Conv2D 3 × 3 s = 1 n = 32]    

 Att Net Dec 0 [Conv2D 3 × 3 s = 1 n = 1]   
   
   
 
Q Net Dec 1 [Conv2D 3 × 3 s = 1 n = 32]
   
   

   Head 1  K V Net Dec 1 [Conv2D 3 × 3 s = 1 n = 32]    

Multi Head Convolutional Self Attention  
Att Net Dec 1 [Conv2D 3 × 3 s = 1 n = 1]
   
 
 (Conducted on query frame sequence       
  Q Net Dec 2 [Conv2D 3 × 3 s = 1 n = 32]  

 and encodedinput frame sequence) 
  Head 2  K V Net Dec 2 [Conv2D 3 × 3 s = 1 n = 32]   
 

   



   Att Net Dec 2 [Conv2D 3 × 3 s = 1 n = 1] 
  





 Q Net Dec 3 [Conv2D 3 × 3 s = 1 n = 32] 
 

   Head 3  K V Net Dec 3 [Conv2D 3 × 3 s = 1 n = 32]    
 
 AttNet Dec 3 [Conv2D 3 × 3 s =  1 n = 1] 
 Feed Forward   Conv2D3 × 3 s = 1 n = 128    
Conv2D 3 × 3 s = 1 n = 256
ConvBlock0
 Conv2D 3 × 3 s = 1 n = 256    
     
  
   Conv2D 3 × 3 s = 1 n = 512   
   ConvBlock1   
 Conv2D 3 × 3 s = 1 n = 512    
     
  
   Conv2D 3 × 3 s = 1 n = 256   
   DeconvBlock0   


 SFFN0 
   Conv2D 3 × 3 s = 1 n = 256   
   
   Conv2D 3 × 3 s = 1 n = 128   
   DeconvBlock1   





  Conv2D 3 × 3 s = 1 n = 128   
   





 Conv2D 3 × 3 s = 1 n = 64   
  
   Conv  Conv2D 3 × 3 s = 1 n = 32    
   



   Conv2D 1 × 1 s = 1 n = 3  
   
  Conv2D 3 × 3 s = 1 n = 32  
  ConvBlock0  
 Conv2D 3 × 3 s = 1 n = 32    
     
  
   Conv2D 3 × 3 s = 1 n = 64   
   ConvBlock1   
 Conv2D 3 × 3 s = 1 n = 64    
     
Prediction  SFFN   s×3×h×w
   Conv2D 3 × 3 s = 1 n = 128   
   ConvBlock2   





  Conv2D 3 × 3 s = 1 n = 128   
   
   Conv2D 3 × 3 s = 1 n = 256   
   ConvBlock3   
 Conv2D 3 × 3 s = 1 n = 256    
     
  
   Conv2D 3 × 3 s = 1 n = 512   
   ConvBlock4   
 Conv2D 3 × 3 s = 1 n = 512    
  SFFN1    
  
   Conv2D 3 × 3 s = 1 n = 256   
   DeconvBlock0   





  Conv2D 3 × 3 s = 1 n = 256   
   
   Conv2D 3 × 3 s = 1 n = 128   
   DeconvBlock1   
 Conv2D 3 × 3 s = 1 n = 128    
     
  
   Conv2D 3 × 3 s = 1 n = 64   
   DeconvBlock2   
 Conv2D 3 × 3 s = 1 n = 64    
     
  
   Conv2D 3 × 3 s = 1 n = 32   
   DeconvBlock3   
  
 Conv2D 3 × 3 s = 1 n = 32 
  
Conv Conv2D 1 × 1 s = 1 n = 3
Output Sequence --- s×3×h×w

16
Input Query Encoded Query
•× ×ℎ×" •× ×ℎ×" Input FrameSequence • × × ℎ × " Frame Sequence • × × ℎ × "
Frame Sequence Frame Sequence

K_V_Net Q_Net K_V_Net Q_Net K_V_Net Q_Net


Conv Conv Conv Conv Conv Conv
•× ×ℎ×" •× ×ℎ×" •× ×ℎ×" •× ×ℎ×" •× ×ℎ×" •× ×ℎ×"

Expand Expand Expand Expand Expand Expand Expand Expand Expand


dim=0 dim=0 dim=1 dim=0 dim=0 dim=1 dim=0 dim=0 dim=1

1×•× ×ℎ×" 1×•× ×ℎ×" •×1× ×ℎ×" 1×•× ×ℎ×" 1×•× ×ℎ×" •×1× ×ℎ×" 1×•× ×ℎ×" 1×•× ×ℎ×" •×1× ×ℎ×"
Repeat Repeat Repeat Repeat Repeat Repeat Repeat Repeat Repeat
dim=0 dim=0 dim=0 dim=0 dim=0 dim=0 dim=0 dim=0 dim=0
•×•× ×ℎ×" •×•× ×ℎ×" •×•× ×ℎ×" •×•× ×ℎ×" •×•× ×ℎ×" •×•× ×ℎ×" •× ×!×ℎ×# •× ×!×ℎ×# •× ×!×ℎ×#

Reshape Reshape Reshape Reshape Reshape Reshape Reshape Reshape Reshape

(• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × "

Concate Concate Concate


dim=1 dim=1 dim=1
(• ∗ •) × (2 ) × ℎ × " (• ∗ •) × (2 ) × ℎ × " (• ∗ •) × (2 ) × ℎ × "

Att_Net Att_Net Att_Net

(• ∗ •) × 1 × ℎ × ! (• ∗ •) × 1 × ℎ × ! (• ∗ •) × 1 × ℎ × !
Attention Maps Attention Maps Attention Maps

(• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × " (• ∗ •) × × ℎ × "


Weighted Value Maps Weighted Value Maps Weighted Value Maps

Reshape Reshape Reshape


•×•× ×ℎ×" •×•× ×ℎ×" •× ×!×ℎ×#
Sum Sum Sum
dim=1 dim=1 dim=1
•×1× ×ℎ×" •×1× ×ℎ×" •×1× ×ℎ×"
Sequeeze Sequeeze Sequeeze
dim=1 dim=1 dim=1
•× ×ℎ×" •× ×ℎ×" •× ×ℎ×"

Convolutional -Attention
Convolutional Self-Attention on Query Self-Attention
(Conducted on Query Frame Sequence
Input Frame Sequeence (Conducted on Query Frame Sequence)
and Encoded Input Frame Sequence)
Encoder Decoder

Figure 12. The calculation process of multi-head convolutional self-attention in Encoder, query self-attention in Decoder and multi-head
convolutional attention in Decoder.

17

You might also like