Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
28 views16 pages

Inception Recurrent Neural Network Architecture For Video Frame Prediction

This research paper presents a novel Inception-based convolutional recurrent neural network (RNN) architecture for video frame prediction, enhancing traditional gated convolutional RNNs by utilizing multi-channel kernels to capture complex spatio-temporal dynamics in videos. The proposed model outperforms existing methods, such as PredNet, on benchmark datasets while maintaining a manageable increase in training time. The study explores different configurations of the Inception architecture and validates the effectiveness of the model through various experimental results.

Uploaded by

edfero ca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Inception Recurrent Neural Network Architecture For Video Frame Prediction

This research paper presents a novel Inception-based convolutional recurrent neural network (RNN) architecture for video frame prediction, enhancing traditional gated convolutional RNNs by utilizing multi-channel kernels to capture complex spatio-temporal dynamics in videos. The proposed model outperforms existing methods, such as PredNet, on benchmark datasets while maintaining a manageable increase in training time. The study explores different configurations of the Inception architecture and validates the effectiveness of the model through various experimental results.

Uploaded by

edfero ca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SN Computer Science (2023) 4:69

https://doi.org/10.1007/s42979-022-01498-y

ORIGINAL RESEARCH

Inception Recurrent Neural Network Architecture for Video Frame


Prediction
Matin Hosseini1 · Anthony Maida2 · Seyedmajid Hosseini1,2 · Raju Gottumukkala1

Received: 10 February 2022 / Accepted: 7 November 2022 / Published online: 27 November 2022
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2022

Abstract
Video frame prediction is needed for various computer-vision-based systems such as self-driving vehicles and video stream-
ing. This paper proposes a novel Inception-based convolutional recurrent neural network (RNN) as an enhancement to a basic
gated convolutional RNN. A basic gated convolutional RNN has fixed-size kernels that are hyperparameters of the network.
Our model replaces the single-size kernel in the convolutional RNN with Inception-like multi-channel kernels. Multiple
kernel sizes allow the capturing of spatio-temporal dynamics of multiple objects in the video compared to one single-sized
kernel. Our model is tested within a predictive coding framework to improve video frame prediction. We seek to determine
whether multi-kernel convolutional gated RNNs improve performance compared to basic convolutional RNNs. We study
different variants of the proposed multi-kernel convolutional RNNs, namely LSTM and GRU, with both Inception V1 and
Inception V2 configurations. We observe that video frame prediction offers improved performance compared to existing
PredNet-based video prediction methods, but with minor additional cost in training time.

Keywords Video frame prediction · Deep learning · Video prediction · PredNet · Predictive coding

Introduction autonomous car guidance [8], and robotic motion planning


[9] use video frame prediction.
The unsupervised frame prediction models can anticipate The human brain is naturally good at predicting real-
upcoming events as a key component of intelligent decision- world dynamics as it happens through real-time synchro-
making systems such as autonomous vehicles [1], Anomaly nization between sensory perception and neural operations.
detection [2], public safety monitoring [3], robot navigation For instance, when one is watching live sports, the brain can
[4]. Video frame prediction involves developing intelligent predict the speed and direction of the ball quite naturally.
systems that predict the next few video frames from previ- Similarly, the brain can anticipate other drivers’ movements
ous frames that can be used in the multimodal learning [5]. to avoid accidents. Predictive coding provides a computa-
Many computer vision problems such as video transcod- tional framework that imitates brain function through infor-
ing [3], video frame synthesis [6], anomaly detection [7], mation flow and error propagation between the feed-forward
and feed-backward networks [10]. The PredNet framework
* Raju Gottumukkala developed by Lotter et al. [11] was inspired by the predic-
[email protected] tive coding framework that uses convolutional LSTM-based
Matin Hosseini (Long-Short Term Memory) architecture for next-frame
[email protected] video prediction.
Anthony Maida Convolutional Neural Networks (CNNs) is a popular
[email protected] deep learning technique applied extensively in various
Seyedmajid Hosseini image processing tasks such as ALEXNET [12], VGG
[email protected] [13], ResNet [14], and GoogLeNet [15] due to their abil-
ity to capture spatial correlations. Recurrent neural net-
1
School of Computing and Informatics, University works such as LSTM and GRU (Gated Recurrent Units) are
of Louisiana at Lafayette, Lafayette 70504, LA, USA
quite good at modeling sequential dependencies [16, 17],
2
Informatics Research Institute, University of Louisiana which make these models good candidates for capturing
at Lafayette, Lafayette 70504, LA, USA

SN Computer Science
Vol.:(0123456789)
69 Page 2 of 16 SN Computer Science (2023) 4:69

spatio-temporal dynamics in the video. LSTMs have been convolutional LSTM (ConvLSTM) and GRU (ConvGRU)
applied to video sequence learning by several researchers. within the PredNet video frame prediction architecture.
Srivastava et al. [16] designed an encoder-decoder LSTM The following is the organization of our paper. The fol-
for predicting future frames and current input. Lotter et al. lowing section provides related work, and the subsequent
[18] also designed an LSTM-based architecture consisting section provides the methods, namely the general Inception
of the convolution layer, LSTM, and de-convolution for next LSTM and GRU architectures, formulations, algorithms,
frame video prediction. One of the pioneering LSTM arti- and datasets. Section “Experimental Results and Analysis”
cles by Shi et al. [19] incorporated convolution operations provides the experimental results and analysis of proposed
into LSTMs for precipitation nowcasting. This convolutional methods on two well-known datasets, namely KITTI and
LSTM has become part of several deep learning architec- KTH. A brief discussion related to the methods is provided
tures. Villegas et al. [20] developed a video frame predic- in “Discussion”. Finally, we provide conclusions and scope
tion model using a combination of Motion Encoder and of future work in “Limitations and Future Work”.
Content Encoder using Convolutional LSTM. Lotter et al.
[11] proposed the PredNet architecture, which is the first
deep learning model inspired by predictive coding principles Related Work
[21]. PredNet is not a comprehensive emulation of predictive
coding [22, 23] and several improvements are possible. Our In general, video frame prediction aims to predict the next
prior work [24] introduced a new variation of the predictive few frames given the observed previous frames. Deep learn-
coding architecture based on Rao and Ballard [21] protocols. ing approaches to video frame prediction roughly fall into
The following are the key contributions of the paper: three categories. The Motion-features-based strategies cap-
ture the object trajectory dynamics to forecast future frames
• We introduce inception-inspired Recurrent Neural Net- [20, 27–30]. These object-centric methods are computa-
work architecture for video frame prediction. tionally efficient but do not perform well when multiple
• The proposed architecture is validated against two ver- objects are in the video. The other two categories are the
sions of Inception, namely Inception V2 that uses two well-known recurrent and non-recurrent methods. The fully-
3X3 kernels in sequence and Inception V1 that uses 5X5 connected networks and convolutional networks are some of
kernels. the prominent non-recurrent neural network architectures.
• We observe that the proposed 3-layer inception module Non-recurrent neural networks have low computational
outperforms 4-layer convolutional recurrent networks on overhead but have poor prediction performance compared
two standard benchmark datasets, namely KITTI [25], to recurrent neural networks because of their inability to
and KTH [26]. capture sequential dependencies [28, 31, 32]. Autoencoder
• We also observe that the proposed model works very is a popular non-recurrent architecture that has been used in
well for transfer learning where the model was trained several models to predict future frames [33].
on UCF11 dataset and tested on KTH dataset. Recurrent neural networks have become quite popular due
to their ability to model sequential tasks and accumulate and
We find that the proposed model automatically trains the forget selective information. Hsieh et al. [31] used 2-dimen-
best kernel size by deciding whether to forget or remem- sional recurrence as input in the autoencoder for decomposi-
ber the previous state instead of manually using predefined tion and disentanglement of motion and video components.
hyper-parameters (such as manually choosing kernel size). Both LSTMs and GRUs are popular gated recurrent neural
Furthermore, we eliminated max-pooling to reduce the com- network models. LSTMs and GRUs differ in the arrangement
putational cost because max-pooling is already used across of their gates [17, 34]. Nitish et. al. [16] uses the multi-layer
layers in the PredNet architecture. We evaluate the LSTM LSTM for video prediction. There are several LSTM-based
and GRU models for both Inception V1 and Inception ver- architectures for video frame prediction. The PredRNN [35]
sion 2 on standard benchmark datasets; There is a 3 × 3 and PredRNN++ used spatiotemporal LSTMs with dual
convolutional kernel before the LSTM module covering the memory architectures. [36] proposed replacing fully con-
application of the 1 × 1 kernel in inception architecture. We nected LSTM to convolutional LSTM. There are also several
also eliminate 1 × 1 convolutional kernels before each 1 × 1, hybrid architectures that incorporate convolutions in recur-
3 × 3, and 5 × 5 kernels. Our proposed model offers perfor- rent network architectures. Convolutional neural networks
mance improvement in video frame prediction compared to (CNNs) are widely used for various image classification

SN Computer Science
SN Computer Science (2023) 4:69 Page 3 of 16 69

tasks such as ALEXNET [12], VGG [13], ResNet [14], and Methods
GoogLeNet [15]. GoogLeNet proposed Inception CNN lay-
ers to create a wider network instead of a deeper one. Lotter We first describe the proposed inception recurrent neu-
et al. [18] developed an architecture that comprises convolu- ral network architecture-based method. Then we describe
tion, LSTM, and de-convolution for next-frame video pre- Inception LSTM and Inception GRU in detail. Finally, we
diction. Shi et al. [19] proposed a Convolutional LSTM that describe how the proposed architecture is applied for video
uses the convolution operation instead of the dot product. prediction.
Lotter et al. [11] proposed an architecture known as PredNet
that was inspired by predictive coding principles [21]. We Inception‑Inspired Recurrent Networks
used PredNet as the starting point for our implementation.
A GRU cell has a similar structure to an LSTM but differs in The convolution LSTM [44] takes multi-channel images as
terms of the parameters, the controlled exposure to memory input and applies convolution at each gate in the LSTM cell.
content, and the input gate’s location. Some of the experi- Video has complex spatio-temporal dynamics, where objects
ments such as by Rafael et al. [37] revealed that LSTM and may have different sizes, and the objects may also move at
GRU have comparable performance with the exception of varied paces within the video. For example, some objects
adding one bias to the forget gate. Given the similarities might move very slowly, while others have faster move-
and dissimilarities, it is hard to conclude which type of gat- ments. The embedding of convolution into the LSTM helps
ing units perform better [38]. There are other video frame capture the spatio-temporal features of images, an essential
prediction architectures, such as predCNN [39], VideoFlow requirement for video frame prediction with multiple objects
[40] and SDC-Net [41]. Still, our scope of analysis in this that have complex spatio-temporal dynamics [19]. The ker-
paper is limited to implementing inception-LSTM units in nel size selection also plays an essential role in capturing
the PredNet architecture. different objects’ spatio-temporal dynamics. Smaller kernel
Our proposed model integrates the inception modules sizes are suitable to capture slower motions, while larger
from GoogleNet [15] into the recurrent neural network kernels capture faster motions [19]. One of the limitations
modules. The inception modules are integrated into each of convolutional LSTM is that the kernel size is fixed, which
gate of convolutional RNN, thereby transforming the gates means the network can work well for either objects moving
from single kernel to multi-kernels. One of the recent archi- at a slow pace or fast pace. Introducing the design’s flex-
tectures from Alom et al. [42] also tested the integration of ibility to include multiple kernels in the recurrent neural
the inception module into a recurrent convolutional layer network architecture allows the capture of complex spatio-
for image classification. However, Alom et al. designed an temporal dynamics. We included three convolutions with
architecture that concatenates the inception unit, recurrent varying kernel sizes for each gate in one cell (see Fig. 1).
convolutional layers, and residual units. Our proposed design The network starts with a small kernel size of 1x1 to a larger
is quite different in that we integrate the inception module size of 5x5. This proposed architecture is sensitive to videos
inside the RNN module, thereby transforming the gates from that contain objects with varying image sizes.
single-kernel to multi-kernel units. The GoogleNet architec- In the Convolutional LSTM, the standard dot product
ture in our design helps capture multiple scales into convo- operations are replaced by the convolutional operations as
lutional layers without the need to introduce new layers. The follows:
inception architecture has multiple variations of kernel sizes,
namely the 1×1, 3×3, and 5×5. The proposed work builds it = 𝜎(Wix ∗ xt + Wih ∗ ht−1 + bi ) (1a)
on our prior work [43] where we experimented with the idea
with Inception LSTM with the KITTI dataset. The paper ft = 𝜎(Wfx ∗ xt + Wfh ∗ ht−1 + bf ) (1b)
presents the general Inception of recurrent neural network
architecture that differs from our prior work. Specifically, we c�t = it ⊙ tanh(Wcx ∗ xt + Wch ∗ ht−1 + bc ) (1c)
provide design for multiple recurrent neural network mod-
ules, namely LSTM and GRU, for different variations of
kernel sizes. We evaluated the proposed methods on three ct = ft ⊙ ct−1 + c�t (1d)
datasets: KITTI, KTH, and UCF-11. Finally, we investigate
the proposed neural network architecture’s effectiveness in ot = 𝜎(Wox ∗ xt + Woh ∗ ht−1 + bo ) (1e)
learning and transferring knowledge from one dataset to
another through transfer learning.

SN Computer Science
69 Page 4 of 16 SN Computer Science (2023) 4:69

Fig. 1  Inception LSTM inspired


by Inception V1. This module
uses three different kernel sizes
within the LSTM module

Fig. 2  Inception LSTM inspired


by Inception version 2 com-
bined with the error correction
module

ht = ot ⊙ tanh(ct ) (1f) ct = 𝜎(Wcx ∗ xt + Wch ∗ (rt ⊙ ht−1 + bc )) (2c)


The * indicates the convolution operation in the formula.
⊙ denotes element-wise multiplication. it , ft and ot denote
ht = zt ⊙ ht−1 + (1 − zt ) ⊙ ct (2d)
the values for the input, forget, and output gates, respec- rt is the reset gate, zt is update gate, ht is the hidden and
tively. Each gate uses the convolution operation and pro- c is candidate hidden state for current time t. Wzx , Wzh are
duces a multi-channel image. The above equations describe weights for update gate. Wrx , Wrh are weights for reset gate.
the basic convolutional LSTM used in Lotter et al. [11]. Wcx , Wch are wight parameters for candidate hidden state.
The GRU is a simplified version of the LSTM where the We present the two versions of the inception LSTM based on
cell-state is ignored in the architecture. As a result, the number the Inception configuration. Figure 1 shows a V1 Inception net
of gates for updating the information is reduced. This means embedded within an LSTM cell. Figure 2 shows the Inception
GRU has fewer parameters compared to the LSTM. In the LSTM version 2 within one layer of the PredNet architecture.
GRU, the value of the forget gate or reset gate is passed to the
output gate. The convolutional GRU is presented as follows: Inception LSTM
Zt = 𝜎(Wzx ∗ xt + Wzh ∗ ht−1 + bz ) (2a)
These designs use different kernel sizes, which offers the
flexibility to capture objects with different motion rates (fast
rt = 𝜎(Wrx ∗ xt + Wrh ∗ ht−1 + br ) (2b) motion and slow motion) in the video. Inception LSTM is a

SN Computer Science
SN Computer Science (2023) 4:69 Page 5 of 16 69

convolutional LSTM that uses multi-kernel gates in its archi- sequence. This design has the advantage of reducing the
tecture. The inputs to all gates are the stacking of convolu- number of parameters from 25nc to 18nc (nc denotes the num-
tion operations with different kernel sizes. The equations for ber of channels). The equations used in Inception LSTM
the inception LSTM version 1 (Fig. 1) are as follows: V2 for the input gate are given in equation (4a). The other
equations have a similar structure.
⎛⎡ Wi1×1 ∗ [xt , ht−1 ], ⎤ ⎞
it = 𝜎 ⎜⎢ W ∗ [x , h ], ⎥ + bi ⎟ (3a) ⎛⎡ Wi1×1 ∗ [xt , ht−1 ], ⎤ ⎞
⎜⎢ i3×3 t t−1
⎥ ⎟ ⎜⎢ ⎥ ⎟
⎝⎣ Wi5×5 ∗ [xt , ht−1 ] ⎦ ⎠ it = 𝜎 ⎜⎢ W 1
∗ [x , h ], (4a)
i3×3 t t−1 ⎥ + bi ⎟
⎜⎢ W 2 ∗ [W 3 ∗ [xt , ht−1 ]] ⎥ ⎟
⎝⎣ i3×3 i3×3 ⎦ ⎠
⎛⎡ Wf1×1 ∗ [xt , ht−1 ], ⎤ ⎞
ft = 𝜎 ⎜⎢ Wf3×3 ∗ [xt , ht−1 ], ⎥ + bf ⎟ (3b)
⎜⎢ ⎥ ⎟ ⎛⎡ ⎤ ⎞
⎝⎣ Wf5×5 ∗ [xt , ht−1 ] ⎦ ⎠ ⎜⎢
Wf1×1 ∗ [xt , ht−1 ],
⎥ ⎟
1
ft = 𝜎 ⎜⎢ Wf ∗ [xt , ht−1 ],
3×3 ⎥ + bf ⎟ (4b)
⎜⎢ W 2 ∗ [W 3 ∗ [xt , ht−1 ]] ⎥ ⎟
⎛⎡ Wg1×1 ∗ [xt , ht−1 ], ⎤ ⎞ ⎝⎣ f3×3 f3×3 ⎦ ⎠
gt = 𝜎 ⎜⎢ Wg3×3 ∗ [xt , ht−1 ], ⎥ + bg ⎟ (3c)
⎜⎢ ⎥ ⎟
⎝⎣ Wg5×5 ∗ [xt , ht−1 ] ⎦ ⎠ ⎛⎡ Wg1×1 ∗ [xt , ht−1 ], ⎤ ⎞
gt = 𝜎 ⎜⎢ Wg1 ∗ [xt , ht−1 ], ⎥ + bg ⎟ (4c)
⎜⎢ 2 3×3
⎥ ⎟
⎛⎡ Wo1×1 ∗ [xt , ht−1 ], ⎤ ⎞ ⎝⎣ Wg3×3 ∗ [Wg33×3 ∗ [xt , ht−1 ]] ⎦ ⎠
ot = 𝜎 ⎜⎢ Wo3×3 ∗ [xt , ht−1 ], ⎥ + bo ⎟ (3d)
⎜⎢ ⎥ ⎟
⎝⎣ Wo5×5 ∗ [xt , ht−1 ] ⎦ ⎠ ⎛⎡ Wo1×1 ∗ [xt , ht−1 ], ⎤ ⎞
ot = 𝜎 ⎜⎢ Wo1 ∗ [xt , ht−1 ], ⎥ + bo ⎟ (4d)
⎜⎢ 2 3×3
⎥ ⎟
ct = ft ⊙ ct−1 + it ⊙ gt (3e) ⎝⎣ Wo3×3 ∗ [Wo33×3 ∗ [xt , ht−1 ]] ⎦ ⎠

ht = ot ⊙ tanh(ct ) (3f) ct = ft ⊙ ct−1 + it ⊙ gt (4e)


Similar to convolutional LSTM it , ft and ot denote the values
for the input, forget, and output gates in the inception LSTM, ht = ot ⊙ tanh(ct ) (4f)
respectively. Wi x, Wf x, Wg x denote the trainable weights cor-
Inception LSTM V2 uses three 3 × 3 convolution kernels
responding to the current data input gates that are applied to
for each gate in comparison to the Inception LSTM V1 that
the concatenation of xt and ht−1 that, ht−1 shows the previous
has one 3 × 3 kernel. Inception V2 benefits from stacking of
output from the LSTM cell. The bi , bf , bc are the bias for each
two different 3 × 3 convolutions (W3×3
2
, and W3×3
3
) with lower
gate. The ⊙ symbol denotes element-wise multiplication,
computational cost than one W5×5 convolution.
and the × denotes the times symbol for kernel size. We use
Figure 2 shows the Inception V2 model embedded in
the Inception design for each gate instead of a single convo-
one PredNet architecture layer. In this model, the inputs to
lution kernel. The ∗ indicates the convolution operation in
the cell are x, c, and h. The xt is the current image frame
the formula, and the square brackets in the above equations
consisting of 3 features maps regarding red, green, and
denote kernel stacking. Each gate has three different kernels,
blue channels. The ht is the stack of feature maps, the out-
with sizes 1 × 1, 3 × 3, and 5 × 5 respectively, following the
put of the previous time step. In this model, a Conv2D can
design of an Inception layer. For example, Wi1×1 denotes the
match the output’s number of feature maps to the recur-
weights for the input gate, and 1 × 1 denotes the kernel size.
rent input feature map-moreover, the cell-state works com-
All convolutions in our inception networks are same padding
pletely like a standard convolutional LSTM. We initialized
convolution, and the outputs of the stacked convolutions are
the C0 to be a grey image with three different channels.
passed to the input gate through a non-linear sigmoid. The
The order in which each unit in the model is
cell state and recurrent connection (h) are defined similarly
updated must also be specified. Our implementation is
to the original convolutional LSTM. Figure 1 shows the V1
described in Algorithm 1 following the PredNet archi-
Inception LSTM cell.
tecture for more clarity.
The Inception V2 uses fewer parameters, where the
5 × 5 kernel is replaced by two 3 × 3 kernels connected in

SN Computer Science
69 Page 6 of 16 SN Computer Science (2023) 4:69

Algorithm 1 PredNet algorithm using Inception LSTM


Require: xt
1: At0 ← xt
2: El0 , Rl0 , ← 0
3: for t = 1 to T do
4: for l = L to 0 do
5: if l = L then
6: IncRN N tL = R (ELt−1 ,RL
t−1
)
7: else
8: IncRN N tl = R (Elt−1 Rlt−1 , U p(RL+1
t
))
9: for l = L to 0 do
10: if l = 0 then
11: Ât0 = RELU (IncLSTM (Rlt )))
12: else
13: Âtl = RELU (IncRNN (Rlt )))
14: Elt = [RELU (Atl − Âtl ); RELU (Âtl − Atl )]
15: if l > L then
16: Âtl = M axpool(IncRNN (Rlt )))

The loss function for the predNet [11] model is illustrated ⎛⎡ Wr1×1 ∗ [xt , ht−1 ], ⎤ ⎞
below: rt = 𝜎 ⎜⎢ Wr3×3 ∗ [xt , ht−1 ], ⎥ + br ⎟ (6b)
⎜⎢ ⎥ ⎟
⎝⎣ Wr5×5 ∗ [xt , ht−1 ] ⎦ ⎠
∑ ∑ 𝜆l ∑
Ltrain = 𝜆t Elt (5)
t l
nl nl

Where 𝜆t is the weighting factors by time, 𝜆l is the weight-


ing factors by layer, and nl is the number of units in layer l.
In addition, Elt is error unit that is obtained from predicted
( Â tl ) and target ( Atl ) images which is defined in Line 14 if
Algorithm. Moreover, Fig. 3 illustrates the Elt error units.

Inception GRU​

We also studied the Inception-like GRU and compared it


with Convolutional GRU and Inception LSTM. As a result,
the Inception GRU formulas are presented as follows:

⎛⎡ Wz1×1 ∗ [xt , ht−1 ], ⎤ ⎞


zt = 𝜎 ⎜⎢ Wz3×3 ∗ [xt , ht−1 ], ⎥ + bz ⎟ (6a)
⎜⎢ ⎥ ⎟
⎝⎣ Wz5×5 ∗ [xt , ht−1 ] ⎦ ⎠

Fig. 3  Inception recurrent neural network with in predictive coding


architecture (PredNet)

SN Computer Science
SN Computer Science (2023) 4:69 Page 7 of 16 69

Fig. 4  Inception GRU using the


inception V1 with the different
kernels

Fig. 5  Inception GRU V2 using


the inception V2 with the differ-
ent kernels

⎛⎡ Wc1×1 ∗ [xt , rt ⊙ ht−1 ], ⎤ ⎞ prediction using shared weights that preserves the Spatio-
ct = 𝜎 ⎜⎢ Wc3×3 ∗ [xt , rt ⊙ ht−1 ], ⎥ + bc ⎟ (6c) temporal dynamics of the image implicitly. The use of incep-
⎜⎢ ⎥ ⎟ tion in the architecture enables us not to include kernel size
⎝⎣ Wc5×5 ∗ [xt , rt ⊙ ht−1 ] ⎦ ⎠
as the hyper-parameter.
The inception GRU V2 is similar to the inception LSTM
ht = zt ⊙ ht−1 + (1 − zt ) ⊙ ct (6d) V2; that uses two 3 × 3 kernels parallel to 1 × 1 and 3 × 3
The Inception GRU uses three different kernels with kernels instead of using 5 × 5 kernel to reduce the compu-
size 1 × 1, 3 × 3 and 5 × 5, which is similar to the Inception tational cost. Figure 5 shows the architecture of inception
LSTM. However ct is Figs. 4 and 5 show the Inception GRU GRU V2. The Inception GRU V2 formula is a combina-
V1 and V2 models. tion of inception LSTM V2 and inception GRU presented
The inception recurrent neural networks are embedded as follows:
into the PredNet architecture [11]. The overall architecture
⎛⎡ Wz1×1 ∗ [xt , ht−1 ], ⎤ ⎞
is shown in 3. The system consists of an Inception RNN ⎜⎢ Wz1 ∗ [xt , ht−1 ], ⎥ + bz ⎟
zt = 𝜎
⎜⎢ 2 ⎥ ⎟ (7a)
module that extracts Spatio-temporal features of an image 3×3
⎝⎣ Wz3×3 ∗ [Wz33×3 ∗ [xt , ht−1 ]] ⎦ ⎠
using different convolutions. Convolution helps the image

SN Computer Science
69 Page 8 of 16 SN Computer Science (2023) 4:69

Fig. 6  KITTI dataset sample


video frames

Fig. 7  KTH dataset sample


video frames

⎛⎡ Wr1×1 ∗ [xt , ht−1 ], ⎤ ⎞ PredNet Implementation


rt = 𝜎 ⎜⎢ Wr1 ∗ [xt , ht−1 ], ⎥ + br ⎟ (7b)
⎜⎢ 2 3×3
⎥ ⎟ The inception LSTM proposed in the previous section
⎝⎣ Wr3×3 ∗ [Wr33×3 ∗ [xt , ht−1 ]] ⎦ ⎠
is evaluated in the PredNet architecture developed by
Lotter et al. [18], which is one of the earliest predic-
⎛⎡ Wc1×1 ∗ [xt , rt ⊙ ht−1 ], ⎤ ⎞ tive coding models implemented in the deep learn-
ct = 𝜎 ⎜⎢ Wc1 ∗ [xt , rt ⊙ ht−1 ], ⎥ + bc ⎟ (7c) ing framework. The representation layer in PredNet
⎜⎢ 2 3×3
⎥ ⎟
⎝⎣ Wc3×3 ∗ [Wc33×3 ∗ [xt , rt ⊙ ht−1 ]] ⎦ ⎠ consists of convolutional LSTMs (cLSTMs). Figure 3
shows the PredNet predictive element for a two-layer
model. The output of the representation module pro-
ht = zt ⊙ ht−1 + (1 − zt ) ⊙ ct (7d)
jects to the error calculation module that sends the
Inception GRU V2 has two 3 × 3 kernels instead of one output back to the representation module. The model
5 × 5 kernel in formula 6 for Inception GRU V1. In 7c, Wc5×5 learns to predict the next frame by comparing its pre-
replaced by two Wc3×3
2
and Wc3×3
3
. We can see with the same diction with the target frame using prediction error as
pattern for reset gate (7b) and update gate (7a) 5 × 5 kernel the cost function.
replaced by two 3 × 3 kernels.

SN Computer Science
SN Computer Science (2023) 4:69 Page 9 of 16 69

Fig. 8  UCF dataset sample


video frames

Datasets different frames of both datasets.y available at https://g​ ithub.​


com/​matin​hosse​iny/ [46].
KITTI dataset is constructed for real-world autonomous UCF11 contains 11 action categories [45]: basketball
driving vehicles to detect 3D objects in streets using two- shooting, biking/cycling, diving, golf swinging, horseback
color and gray-scale cameras, a laser scanner, and a GPS riding, soccer juggling, swinging, tennis swinging, trampo-
localization system. There are at least 15 pedestrians and 30 line jumping, volleyball spiking, and walking with a dog.
vehicles in each video, plus the environmental objects such This dataset gives us a more diverse number of activities.
as buildings, trees, etc. The image’s size is reduced to 160× 120 RGB images to
KTH dataset consists of six common human actions on 25 make it more compatible and comparable with the KTH data
participants in four different scenarios. We used the Walk- set in the transfer learning domain. Figure 8 shows UCF11
ing dataset for our experiments. Figures 6 and 7, show some sample frames that partly resemble KTH data-set.

Table 2  Performance of different models on KTH dataset


Table 1  Performance on the KITTI dataset
Model KTH
Model KITTI Layers MAE MSE SSIM PSNR
Layers MAE MSE SSIM PSNR
ConvLSTM 2 0.0441 0.0071 0.8676 21.48
ConvLSTM 2 0.0533 0.0102 0.8115 19.91 ConvLSTM 3 0.0106 0.0004 0.9617 33.97
ConvLSTM 3 0.0475 0.0081 0.8471 20.91 ConvLSTM 4 0.0116 0.0005 0.9558 33.01
ConvLSTM 4 0.0458 0.0076 0.8576 21.19 IncLSTM V1 2 0.0106 0.0004 0.9610 33.97
ConvGRU​ 2 0.0540 0.0102 0.8072 19.91 IncLSTM V1 3 0.0103 0.0004 0.9636 33.97
ConvGRU​ 3 0.0549 0.0102 0.8025 19.91 IncLSTM V1 4 0.0109 0.0004 0.9589 33.97
ConvGRU​ 4 0.0476 0.0079 0.8500 21.02 IncLSTM V2 2 0.0107 0.0005 0.9600 33.01
IncLSTM V1 2 0.0496 0.0090 0.8335 20.45 IncLSTM V2 3 0.0104 0.0004 0.9626 33.97
IncLSTM V1 3 0.0447 0.0073 0.8637 21.3 IncLSTM V2 4 0.0106 0.0004 0.9613 33.97
IncLSTM V1 4 0.0436 0.0070 0.8682 21.5 IncGRU V1 2 0.0108 0.0005 0.9597 33.01
IncLSTM V2 2 0.0499 0.0091 0.8313 20.40 IncGRU V1 3 0.0105 0.0004 0.9623 33.97
IncLSTM V2 3 0.0450 0.0074 0.8618 21.3 IncGRU V1 4 0.0117 0.0005 0.9558 33.01
IncLSTM V2 4 0.0441 0.0071 0.8676 21.48 IncGRU V2 2 0.0110 0.0005 0.9586 33.01
IncGRU V1 2 0.0502 0.0089 0.8315 20.50 IncGRU V2 3 0.0109 0.0004 0.9599 33.97
IncGRU V1 3 0.0475 0.0079 0.8472 21.02 IncGRU V2 4 0.0103 0.0004 0.9602 33.97
IncGRU V1 4 0.0519 0.0092 0.8203 20.36 ConvGRU​ 2 0.0144 0.0011 0.9310 29.58
IncGRU V2 2 0.0501 0.0089 0.8301 20.50 ConvGRU​ 3 0.0115 0.0005 0.9561 33.01
IncGRU V2 3 0.0467 0.0078 0.8527 21.07 ConvGRU​ 4 0.2207 0.1496 0.6386 8.250
IncGRU V2 4 0.0458 0.0075 0.8588 21.24
The number in the parenthesis indicates the number of layers

SN Computer Science
69 Page 10 of 16 SN Computer Science (2023) 4:69

using RGB frames). The two and three-layer variants have


channel counts of [3, 48] and [3, 48, 96], respectively. The
height and width of the input image for each higher layer are
downsampled using 2 × 2 max pooling.
Figure 13 shows the comparison of convolutional LSTM
versus Inception LSTM V1, V2, and Inception GRU for a
sample video frame. One can observe that the predicted
images for all the models appear similar with no visible dif-
ference. This study uses the Mean Squared Error (MSE), the
Mean Absolute Error (MAE), and the Structural Similarity
Index (SSIM) for quantitative comparison. Figure 12 shows
MSE and SSIM performance measures on both KTH and
KITTI datasets for different numbers of previous frames.
Our test sample includes 83 videos. We found the best
results when using nine previous frames to forecast future
frames. We calculate the mean and 95% confidence inter-
val. Although the MSE (left) of the Inception-like LSTM is
lower than the convolutional LSTM, the confidence inter-
Fig. 9  Prediction results of inception LSTM V1 with different num- vals show overlap. The SSIM measure shows the structural
ber of layers receiving ten input frames similarity of the predicted and ground-truth images. Incep-
tion V1 shows a better SSIM. The MSE and SSIM results
Experimental Results and Analysis indicate that the model reaches maximum performance after
receiving at least five frames of the previous history, so at
This paper evaluates the proposed video prediction archi- least five previous frames are relevant. We may be observing
tectures on KITTI, KTH, and UCF-11 datasets. We use the floor effects for the MSE.
same architecture configuration as PredNet with respect to Tables 1 and 2 compares the results of the Inception GRU,
the image frames (160×128×3 and 160×120×3 pixel) and Inception V1, V2, convolutional GRU, and the convolutional
feature maps in each layer match [3, 48, 96, 192]. The frame LSTM for the KTH and KITTI datasets. The MSE of the
size in KITTI dataset is 160 × 128 × 3 pixels after pre-pro- Inception V1 LSTM and Inception GRU is lower than the
cessing [11]. We test the model with two, three, and four lay- other methods. Also, Inception V1 with three layers gives
ers using 50 training epochs. We compared the model with better performance than the four-layer convolutional LSTM.
the state-of-the-art convolution PredNet models for video It is important to note that the Inception LSTM achieved
frame prediction. We use 41K frames of the KITTI to train better performance with fewer layers.
the model, 8.8K for the test, and 8.8K to validate the model.
The source code and performance details of the data are Results
publicly available at https://github.com/matinhosseiny/ [46]
We first conduct a training scheme on different datasets to
Network Training and Testing investigate our model performance in the video prediction
scheme. We then apply our model to a dataset and transfer
The models were tested on 2, 3 and 4 layer networks on the learning to evaluate the results. Furthermore, we visualized
KITTI [25] and KTH [47] and UCF11 [48] datasets (both the similarity and error of different proposed architectures
on the datasets.

Fig. 10  KITTI dataset next-


frame prediction performance
based on SSIM and MSE as
functions of the number of
previous frames used in the
history. Left: Mean Square
Error (MSE). Right: Structural
Similarity (SSIM)

SN Computer Science
SN Computer Science (2023) 4:69 Page 11 of 16 69

Evaluation on KITTI Dataset Figure 11 shows MSE and SSIM performance measures
on the KITTI data sets as a function of previous frames
Table 1 contains results for additional variations of the Pred- used in the history. We calculate the mean and 95% con-
Net on convolutional and inception evaluated on the KITTI fidence intervals. Although the confidence intervals show
dataset. This paper evaluates all models in terms of pixel overlap, the Inception LSTM V1 shows lower mean squared
prediction. We used Mean-squared error (MSE), mean abso- error results than other methods. The SSIM measure shows
lute error (MAE), and structural similarity index measure the structural similarity of the predicted and ground-truth
(SSIM) to compare the performance of the models.The peak images.
signal-to-noise ratio (PSNR) is a good metric to compare Figure 11b (right) shows the SSIM@ results. Inception
the actual and predicted image. Unlike mean squared error, V1 outperforms other models in terms of SSIM@.
which represents the cumulative squared error between
the predicted and the original image, PSNR represents a Evaluation on KTH Dataset
measure of the peak error. The lower the value of MSE, the
lower the error. However, the higher the PSNR shows, the Table 2 shows experimental results on the KTH dataset to
better-predicted image. We can find that the performance of check the inception recurrent units’ overall performance.
three-layer inception LSTM, either V1 or V2, outperforms The three-layer Inception V1 LSTM outperforms the four-
4L convolutional LSTM. Both inception LSTM versions are layer convolutional LSTM and GRU. The model can also
able to handle the spatio-temporal relationships of different handle the picture’s edges and spatio-temporal relationships
objects using only convolutions. Figure 9 shows the two- very well. However, the inception LSTM V2 works well but
layer model has a smoother image; however, the predictions not as well as the V1 model. Table 1 shows that the 4-layer
of the 4-layer network are similar to the actual image with Inception-LSTM models (V1 and V2) show the best results
more contrast of pixels, and edges are more visible using the while predicting the next frames of the KITTI dataset. How-
4-layer network (Fig. 10). ever, the 3-layer Inception models have lower computational
Figure 11 compares prediction outputs of different lay- costs and outperform both convolutional LSTM and GRU
ers with steps from one to ten frame inputs. Table 1 com- models. Table 2 shows that the 3-layer inception LSTM
pares the similarity and error of different inception models models (V1 and V2) outperform Convolutional LSTM and
with convolutional LSTM and GRU. Inception LSTM V1 GRU predicting the next video frame of the KTH dataset.
outperforms other models. Moreover, inception LSTM V2 Figures 12 shows MSE and SSIM performance meas-
with three-layer outperforms the four-layer convolutional ures on the KITTI data set. We calculate the mean and 95%
LSTM. Inception GRU V2 with three layers outperforms confidence interval. Although the confidence intervals show
both 3-layer convolutional LSTM and convolutional GRU. overlap, the mean squared error of the Inception LSTM is
Moreover, Inception GRU is comparable to the convolu- lower than the other methods. Figure 12b shows the SSIM@
tional LSTM 4 layers. Finally, two-layer inception LSTM results as a function of the number of previous frames. The
outperforms convolutional GRU with two and three layers MSE and SSIM results indicate that the model reaches maxi-
with the KITTI dataset. mum performance after receiving at least five frames of the

Fig. 11  KITTI dataset next-frame prediction performance based on SSIM and MSE as functions of the number of previous frames used in the
history. Left: Mean Square Error (MSE). Right: Structural Similarity (SSIM)

SN Computer Science
69 Page 12 of 16 SN Computer Science (2023) 4:69

Fig. 12  Testing KTH dataset next-frame prediction performance as a function of the number of previous frames used in the history. Left: Mean
Square Error (MSE). Right: Structural Similarity (SSIM)

(a) (b)

(c) (d)

Fig. 13  comparing performance of different architectures on KITTI and KTH datasets using MSE and SSIM metrics

SN Computer Science
SN Computer Science (2023) 4:69 Page 13 of 16 69

previous history. Figure 13 compares the MSE and SSIM of compared with the standard convolutional LSTM for next
different architectures. frame video prediction. The motivation was to remove the
hyper-parameter of kernel size from the model and enable
Evaluation on UCF11 Dataset multi-scale analysis. The Inception V1 LSTM variant gave
the best accuracy performance with fewer layers.
We tested the Inception V2 GRU using the UCF11 dataset. Our proposed Inception LSTMs and GRUs does not
Also, we train the model on the UCF11 dataset and test it on have all of the characteristics of an Inception module. For
KTH to evaluate how well the model performs for transfer instance, the Max pooling component is wholly omitted. For
learning. Table 3 shows the result of these experiments using our model, max-pooling was used between layers, so it was
the UCF11 and KTH test results. not needed within the inception module. A similar situation
holds for the 1 × 1 convolutions.
Comparison with Convolutional Architectures One of the limitations of the proposed Inception LSTM
is that it takes more training time compared to ConvLSTM.
We compare the performance of our method with convolu- This is because our method’s number of training parameters
tional architectures methods with the same configuration. is much higher than ConvLSTM. But the model achieves
Figure 14 provides the next ten predicted frames using better accuracy in terms of MSE, SSIM, and MAE, elimi-
different architectures on two different datasets that have nates manual hyper-parameter tuning, and shows promising
different characteristics. The first row of Fig. 13 shows the results for transfer learning.
actual images and each image is predicted using the n previ- Table 4 shows the training times for all different pro-
ous images. For example the left images does not have any posed architectures on KTH data-set. Although, both con-
information about the frames and the right image of each volutional LSTM and GRU take lower amount of time to
module has consumed the actual images on it’s left side. train the model, they have to run several times to fine-tune
One of the datasets consists of complex moving objects and the kernel sizes that makes convolutional architectures more
a camera, and the other has only a complex moving human. expensive. Moreover, using inception architectures provides
Moreover, the results are summarized in Tables 1 and 2. more comfort by eliminating the kernel size that reduces the
Both models are computationally expensive, but our model programming cost of the model effectively.
does not need to train the model several times to adjust the
hyper-parameters. Moreover, this network benefits from the
advantages of using different kernel sizes that capture dif- Limitations and Future Work
ferent motions with different speeds and scales.
The inception recurrent network architecture is inspired from
GoogLeNet to create a wider network architecture instead
Discussion of deeper architecture [14] inside LSTM and GRU archi-
tectures. Convolutional Recurrent Neural Network based
This paper studied novel Inception-like recurrent neural net- approaches require hyperparameter tuning to improve the
works that used multi-kernel gates, including an Inception performance of video frame prediction. As a result of intro-
LSTM and an Inception GRU, to be used within a predictive ducing the inception recurrent network, hyper-parameter
coding framework. The key idea was to introduce multiple tuning is eliminated. The major drawback of this work is
kernel sizes within a convolutional gate. The performance that the computational cost of Inception recurrent networks
of several variants of the proposed models was studied and slightly increases (compared to convolutional network), but
we are able to achieve better accuracy with some sacrifice
in training performance.

Table 3  Performance of different models on KTH dataset


Test data Layers MAE MSE SSIM Conclusion
UCF11 2 0.0241 0.0036 0.8994
This paper presents several novel Inception recurrent neu-
UCF11 3 0.0234 0.0035 0.9041
ral network architectures to be used with PredNet for video
UCF11 4 0.0235 0.0034 0.9046
prediction. We evaluated the Inception architectures on three
KTH 2 0.0117 0.0007 0.9510
widely used video datasets: KITTI, KTH, and UCF11. It
KTH 3 0.0113 0.0006 0.9556
was confirmed that using different kernel sizes in an RNN
KTH 4 0.0114 0.0006 0.9555
cell can improve the accuracy of video prediction. The pro-
The number in the parenthesis indicates the number of layers posed model worked well for both the GRU and LSTM RNN

SN Computer Science
69 Page 14 of 16 SN Computer Science (2023) 4:69

Fig. 14  Comparing the output


of the convolutional and Incep-
tion LSTMs on the KITTI
and KTH datasets. The actual
frame versus prediction using
Inception-inspired LSTM V1,
Inception-inspired LSTM V2,
convolutional LSTM and,
convolutional GRU with three
layers

Table 4  Train times of different Architecture Time (s) where RNN architectures were shown to perform well, such
architectures on KTH dataset as anomaly detection, activity recognition, large-scale video
Convolutional LSTM 5005 classification, and action detection.
Convolutional GRU​ 4188
Inception LSTM V1 9150
Inception LSTM V2 5536 Author Contributions Conceptualization: MH, SH AM; Methodology:
MH, SH and AM; Validation: MH, SH ,RG and AM, Formal analysis:
Inception GRU V1 9114
MH SH, and AM; Investigation: MH, SH, RG and AM; Resources: RG,
Inception GRU V2 5344 and AM; draft preparation: MH, SH, RG and AM; Writing—review
and editing: MH SH, RG, AM; Visualization: MH, SH AM; Supervi-
sion: AM and RG, Project administration: AM; Funding acquisition:
RG, and AM
architectures. Inception LSTM V1 shows better results than
the other convolutional and Inception RNNs. Moreover, we Funding This research was partially funded by NSF grant
observed that using Inception GRU V2 had prediction per- CNS-1451916
formance close to Inception LSTM V1 while lowering the
computational cost. Declarations
Perhaps the most significant aspect of this work is reduc- Conflict of interest The authors declare that they have no competing
ing the hyper-parameter count in the model and thereby interests.
reducing the model development complexity. A limitation of
Availability of data and materials The source code used for the new
the approach is that it has not been tested within other frame-
model, data and evaluation is made available at GitHub https://​github.​
works. This also provides opportunity. These methods can com/​matin​hosse​iny/​Incep​tion-​inspi​red-​LSTM-​for-​Video-​frame-​Predi​
be extended to a wide range of other significant problems ction.

SN Computer Science
SN Computer Science (2023) 4:69 Page 15 of 16 69

References precipitation nowcasting, In: Advances in neural information pro-


cessing systems; 2015. p. 802–810.
20. Villegas R, Yang J, Hong S, Lin X, Lee H. Decomposing motion
1. Finn C, Goodfellow I, Levine S. Unsupervised learning for physi-
and content for natural video sequence prediction, arXiv preprint
cal interaction through video prediction. Adv Neural Inf Process
arXiv:​1706.​08033; 2017.
Syst. 2016;29.
21. Rao RP, Ballard DH. Predictive coding in the visual cortex: a
2. Liu W, Luo W, Lian D, Gao S. Future frame prediction for
functional interpretation of some extra-classical receptive-field
anomaly detection–a new baseline, In: Proceedings of the IEEE
effects. Nat Neurosci. 1999;2:79.
conference on computer vision and pattern recognition; 2018. p.
22. Rane RP, Szügyi E, Saxena V, Ofner A, Stober S. Prednet and
6536–6545.
predictive coding: A critical review, In: Proceedings of the 2020
3. Hosseini M, Salehi MA, Gottumukkala R. Enabling interac-
International Conference on Multimedia Retrieval; 2020. p.
tive video streaming for public safety monitoring through batch
233–241.
scheduling, In: 2017 IEEE 19th International Conference on High
23. Rane R, Szügyi E, Saxena V, Ofner A, Stober S. Prednet and
Performance Computing and Communications; IEEE 15th Inter-
predictive coding: A critical review, arXiv preprint arXiv:​1906.​
national Conference on Smart City; IEEE 3rd International Con-
11902; 2019.
ference on Data Science and Systems (HPCC/SmartCity/DSS),
24. Hosseini M, Maida A. Hierarchical predictive coding models in
IEEE; 2017. p. 474–481.
a deep-learning framework, arXiv preprint arXiv:​2005.​03230;
4. Ishihara Y, Takahashi M. Empirical study of future image predic-
2020.
tion for image-based mobile robot navigation. Robot Auton Syst.
25. Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: The
2022. p. 104018.
kitti dataset. Int J Robot Res. 2013;32:1231–7.
5. Hosseini M, Sohrab F, Gottumukkala R, Bhupatiraju RT, Katra-
26. Laptev I, Caputo B, et al. Recognizing human actions: a local svm
gadda S, Raitoharju J, Iosifidis A, Gabbouj M. Empathicschool:
approach, In: null, IEEE; 2004. p. 32–36.
A multimodal dataset for real-time facial expressions and physi-
27. Jang Y, Kim G, Song Y. Video prediction with appearance and
ological data analysis under different stress conditions, arXiv pre-
motion conditions, In: International Conference on Machine
print arXiv:​2209.​13542; 2022.
Learning, PMLR; 2018. p. 2225–2234.
6. Liu Z, Yeh RA, Tang X, Liu Y, Agarwala A. Video frame synthe-
28. Liang X, Lee L, Dai W, Xing EP. Dual motion gan for future-flow
sis using deep voxel flow, In: Proceedings of the IEEE Interna-
embedded video prediction, In: proceedings of the IEEE interna-
tional Conference on Computer Vision; 2017. p. 4463–4471.
tional conference on computer vision; 2017. p. 1744–1752.
7. Medel JR, Savakis A. Anomaly detection in video using predictive
29. Ho Y-H, Cho C-Y, Peng W-H, Jin G-L. Sme-net: Sparse motion
convolutional long short-term memory networks, arXiv preprint
estimation for parametric video prediction through reinforcement
arXiv:​1612.​00390; 2016.
learning, In: Proceedings of the IEEE/CVF International Confer-
8. Xu H, Gao Y, Yu F, Darrell T. End-to-end learning of driving
ence on Computer Vision; 2019. p. 10462–10470.
models from large-scale video datasets, In: 2017 IEEE Confer-
30. Zhou Y, Berg TL. Learning temporal transformations from time-
ence on Computer Vision and Pattern Recognition (CVPR), IEEE;
lapse videos, CoRR arXiv:​1608.​07724; 2016.
2017. p. 3530–3538.
31. Hsieh J-T, Liu B, Huang D-A, Fei-Fei LF, Niebles JC. Learning to
9. Finn C, Levine S. Deep visual foresight for planning robot motion,
decompose and disentangle representations for video prediction,
In: Robotics and Automation (ICRA), 2017 IEEE International
In: Advances in Neural Information Processing Systems; 2018. p.
Conference on, IEEE; 2017. p. 2786–2793.
517–526.
10. Friston K. A theory of cortical responses. Phil Trans R Soc B.
32. Zhao M Zhuang C, Wang Y, Lee, TS. Predictive encoding of con-
2005;360:815–36.
textual relationships for perceptual inference, interpolation and
11. Lotter W, Kreiman G, Cox D. Deep predictive coding networks for
prediction, arXiv preprint arXiv:​1411.​3815; 2014.
video prediction and unsupervised learning, arXiv preprint arXiv:​
33. Goroshin R, Mathieu M, LeCun Y. Learning to linearize under
1605.​08104; 2016.
uncertainty, CoRR arXiv:​1506.​03011; 2015.
12. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification
34. Wojtkiewicz J, Hosseini M, Gottumukkala R, Chambers TL.
with deep convolutional neural networks, In: Advances in neural
Hour-ahead solar irradiance forecasting using multivariate gated
information processing systems; 2012. p. 1097–1105.
recurrent units. Energies. 2019;12:4055.
13. Simonyan K, Zisserman, A. Very deep convolutional networks for
35. Wang Y, Long M, Wang J, Gao Z, Philip SY. Predrnn: Recur-
large-scale image recognition, arXiv preprint arXiv:​1409.​1556;
rent neural networks for predictive learning using spatiotemporal
2014.
lstms, In: Advances in Neural Information Processing Systems;
14. Zagoruyko S, Komodakis N. Wide residual networks, arXiv pre-
2017. p. 879–888.
print arXiv:​1605.​07146; 2016.
36. Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K. W.-C. WOO,
15. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan
Convolutional lstm network: A machine learning approach for
D, Vanhoucke V, Rabinovich A. Going deeper with convolutions,
precipitation nowcasting. In: Cortes C, Lawrence N, Lee D, Sug-
In: Proceedings of the IEEE Conference on Computer Vision and
iyama M, Garnett R, editors. Advances in Neural Information
Pattern Recognition; 2015. p. 1–9.
Processing Systems, vol. 28. Curran Associates Inc; 2015. https://​
16. Srivastava N, Mansimov E, Salakhudinov R. Unsupervised learn-
proce​edings.​neuri​ps.​cc/​paper/​2015/​file/​07563​a3fe3​bbe7e​3ba84​
ing of video representations using lstms, In: International confer-
431ad​9d055​af-​Paper.​pdf.
ence on machine learning, PMLR; 2015. p. 843–852.
37. Jozefowicz R, Zaremba W, Sutskever I. An empirical exploration
17. Hosseini M, Katragadda S, Wojtkiewicz J, Gottumukkala R,
of recurrent network architectures, In: International Conference
Maida A, Chambers TL. Direct normal irradiance forecasting
on Machine Learning; 2015. p. 2342–2350.
using multivariate gated recurrent units. Energies. 2020;13:3914.
38. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation
18. Lotter W, Kreiman G, Cox D. Unsupervised learning of visual
of gated recurrent neural networks on sequence modeling, arXiv
structure using predictive generative networks, arXiv preprint
preprint arXiv:​1412.​3555; 2014.
arXiv:​1511.​06380; 2015.
39. Xu Z, Wang Y, Long M, Wang J, KLiss M. Predcnn: Predic-
19. Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c.
tive learning with cascade convolutions., In: IJCAI; 2018. p.
Convolutional lstm network: A machine learning approach for
2940–2947.

SN Computer Science
69 Page 16 of 16 SN Computer Science (2023) 4:69

40. Kumar M, Babaeizadeh M, Erhan D, Finn C, Levine S, Dinh L, 46. Hosseini M, Maida AS, Hosseini M, Raju G. Inception lstm for
Kingma D. Videoflow: A conditional flow-based model for sto- next-frame video prediction (student abstract). Proc AAAI Conf
chastic video generation, arXiv preprint arXiv:​1903.0​ 1434; 2019. Artif Intell. 2020;34:13809–10.
41. Reda FA, Liu G, Shih KJ, Kirby R, Barker J, Tarjan D, Tao A, 47. Schüldt C, Laptev I, Caputo B. Recognizing human actions: a
Catanzaro B. Sdc-net: Video prediction using spatially-displaced local SVM approach, In: Proc. Int. Conf. Pattern Recognition
convolution, In: Proceedings of the European Conference on (ICPR’04), Cambridge, U.K; 2004.
Computer Vision (ECCV); 2018. p. 718–733. 48. Reddy KK, Shah M. Recognizing 50 human action categories of
42. Alom M, Hasan M, Yakopcic C, Tarek M, Taha T. Inception web videos. Mach Vis Appl. 2013;24:971–81.
recurrent convolutional neural network for object recognition.
arXiv preprint arXiv:​1704.​07709; 2017. Publisher's Note Springer Nature remains neutral with regard to
43. Hosseini M, Maida AS, Hosseini M, Raju G. Inception-inspired jurisdictional claims in published maps and institutional affiliations.
lstm for next-frame video prediction, arXiv preprint arXiv:​1909.​
05622; 2019. Springer Nature or its licensor (e.g. a society or other partner) holds
44. Heidari M, Rafatirad S. Using transfer learning approach to imple- exclusive rights to this article under a publishing agreement with the
ment convolutional neural network model to recommend airline author(s) or other rightsholder(s); author self-archiving of the accepted
tickets by using online reviews, In: 2020 15th International Work- manuscript version of this article is solely governed by the terms of
shop on Semantic and Social media Adaptation and Personaliza- such publishing agreement and applicable law.
tion. SMA, IEEE; 2020. p. 1–6.
45. J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos
“in the wild”, In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition, IEEE; 2009. p. 1996–2003.

SN Computer Science

You might also like