Inception Recurrent Neural Network Architecture For Video Frame Prediction
Inception Recurrent Neural Network Architecture For Video Frame Prediction
https://doi.org/10.1007/s42979-022-01498-y
ORIGINAL RESEARCH
Received: 10 February 2022 / Accepted: 7 November 2022 / Published online: 27 November 2022
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2022
Abstract
Video frame prediction is needed for various computer-vision-based systems such as self-driving vehicles and video stream-
ing. This paper proposes a novel Inception-based convolutional recurrent neural network (RNN) as an enhancement to a basic
gated convolutional RNN. A basic gated convolutional RNN has fixed-size kernels that are hyperparameters of the network.
Our model replaces the single-size kernel in the convolutional RNN with Inception-like multi-channel kernels. Multiple
kernel sizes allow the capturing of spatio-temporal dynamics of multiple objects in the video compared to one single-sized
kernel. Our model is tested within a predictive coding framework to improve video frame prediction. We seek to determine
whether multi-kernel convolutional gated RNNs improve performance compared to basic convolutional RNNs. We study
different variants of the proposed multi-kernel convolutional RNNs, namely LSTM and GRU, with both Inception V1 and
Inception V2 configurations. We observe that video frame prediction offers improved performance compared to existing
PredNet-based video prediction methods, but with minor additional cost in training time.
Keywords Video frame prediction · Deep learning · Video prediction · PredNet · Predictive coding
SN Computer Science
Vol.:(0123456789)
69 Page 2 of 16 SN Computer Science (2023) 4:69
spatio-temporal dynamics in the video. LSTMs have been convolutional LSTM (ConvLSTM) and GRU (ConvGRU)
applied to video sequence learning by several researchers. within the PredNet video frame prediction architecture.
Srivastava et al. [16] designed an encoder-decoder LSTM The following is the organization of our paper. The fol-
for predicting future frames and current input. Lotter et al. lowing section provides related work, and the subsequent
[18] also designed an LSTM-based architecture consisting section provides the methods, namely the general Inception
of the convolution layer, LSTM, and de-convolution for next LSTM and GRU architectures, formulations, algorithms,
frame video prediction. One of the pioneering LSTM arti- and datasets. Section “Experimental Results and Analysis”
cles by Shi et al. [19] incorporated convolution operations provides the experimental results and analysis of proposed
into LSTMs for precipitation nowcasting. This convolutional methods on two well-known datasets, namely KITTI and
LSTM has become part of several deep learning architec- KTH. A brief discussion related to the methods is provided
tures. Villegas et al. [20] developed a video frame predic- in “Discussion”. Finally, we provide conclusions and scope
tion model using a combination of Motion Encoder and of future work in “Limitations and Future Work”.
Content Encoder using Convolutional LSTM. Lotter et al.
[11] proposed the PredNet architecture, which is the first
deep learning model inspired by predictive coding principles Related Work
[21]. PredNet is not a comprehensive emulation of predictive
coding [22, 23] and several improvements are possible. Our In general, video frame prediction aims to predict the next
prior work [24] introduced a new variation of the predictive few frames given the observed previous frames. Deep learn-
coding architecture based on Rao and Ballard [21] protocols. ing approaches to video frame prediction roughly fall into
The following are the key contributions of the paper: three categories. The Motion-features-based strategies cap-
ture the object trajectory dynamics to forecast future frames
• We introduce inception-inspired Recurrent Neural Net- [20, 27–30]. These object-centric methods are computa-
work architecture for video frame prediction. tionally efficient but do not perform well when multiple
• The proposed architecture is validated against two ver- objects are in the video. The other two categories are the
sions of Inception, namely Inception V2 that uses two well-known recurrent and non-recurrent methods. The fully-
3X3 kernels in sequence and Inception V1 that uses 5X5 connected networks and convolutional networks are some of
kernels. the prominent non-recurrent neural network architectures.
• We observe that the proposed 3-layer inception module Non-recurrent neural networks have low computational
outperforms 4-layer convolutional recurrent networks on overhead but have poor prediction performance compared
two standard benchmark datasets, namely KITTI [25], to recurrent neural networks because of their inability to
and KTH [26]. capture sequential dependencies [28, 31, 32]. Autoencoder
• We also observe that the proposed model works very is a popular non-recurrent architecture that has been used in
well for transfer learning where the model was trained several models to predict future frames [33].
on UCF11 dataset and tested on KTH dataset. Recurrent neural networks have become quite popular due
to their ability to model sequential tasks and accumulate and
We find that the proposed model automatically trains the forget selective information. Hsieh et al. [31] used 2-dimen-
best kernel size by deciding whether to forget or remem- sional recurrence as input in the autoencoder for decomposi-
ber the previous state instead of manually using predefined tion and disentanglement of motion and video components.
hyper-parameters (such as manually choosing kernel size). Both LSTMs and GRUs are popular gated recurrent neural
Furthermore, we eliminated max-pooling to reduce the com- network models. LSTMs and GRUs differ in the arrangement
putational cost because max-pooling is already used across of their gates [17, 34]. Nitish et. al. [16] uses the multi-layer
layers in the PredNet architecture. We evaluate the LSTM LSTM for video prediction. There are several LSTM-based
and GRU models for both Inception V1 and Inception ver- architectures for video frame prediction. The PredRNN [35]
sion 2 on standard benchmark datasets; There is a 3 × 3 and PredRNN++ used spatiotemporal LSTMs with dual
convolutional kernel before the LSTM module covering the memory architectures. [36] proposed replacing fully con-
application of the 1 × 1 kernel in inception architecture. We nected LSTM to convolutional LSTM. There are also several
also eliminate 1 × 1 convolutional kernels before each 1 × 1, hybrid architectures that incorporate convolutions in recur-
3 × 3, and 5 × 5 kernels. Our proposed model offers perfor- rent network architectures. Convolutional neural networks
mance improvement in video frame prediction compared to (CNNs) are widely used for various image classification
SN Computer Science
SN Computer Science (2023) 4:69 Page 3 of 16 69
tasks such as ALEXNET [12], VGG [13], ResNet [14], and Methods
GoogLeNet [15]. GoogLeNet proposed Inception CNN lay-
ers to create a wider network instead of a deeper one. Lotter We first describe the proposed inception recurrent neu-
et al. [18] developed an architecture that comprises convolu- ral network architecture-based method. Then we describe
tion, LSTM, and de-convolution for next-frame video pre- Inception LSTM and Inception GRU in detail. Finally, we
diction. Shi et al. [19] proposed a Convolutional LSTM that describe how the proposed architecture is applied for video
uses the convolution operation instead of the dot product. prediction.
Lotter et al. [11] proposed an architecture known as PredNet
that was inspired by predictive coding principles [21]. We Inception‑Inspired Recurrent Networks
used PredNet as the starting point for our implementation.
A GRU cell has a similar structure to an LSTM but differs in The convolution LSTM [44] takes multi-channel images as
terms of the parameters, the controlled exposure to memory input and applies convolution at each gate in the LSTM cell.
content, and the input gate’s location. Some of the experi- Video has complex spatio-temporal dynamics, where objects
ments such as by Rafael et al. [37] revealed that LSTM and may have different sizes, and the objects may also move at
GRU have comparable performance with the exception of varied paces within the video. For example, some objects
adding one bias to the forget gate. Given the similarities might move very slowly, while others have faster move-
and dissimilarities, it is hard to conclude which type of gat- ments. The embedding of convolution into the LSTM helps
ing units perform better [38]. There are other video frame capture the spatio-temporal features of images, an essential
prediction architectures, such as predCNN [39], VideoFlow requirement for video frame prediction with multiple objects
[40] and SDC-Net [41]. Still, our scope of analysis in this that have complex spatio-temporal dynamics [19]. The ker-
paper is limited to implementing inception-LSTM units in nel size selection also plays an essential role in capturing
the PredNet architecture. different objects’ spatio-temporal dynamics. Smaller kernel
Our proposed model integrates the inception modules sizes are suitable to capture slower motions, while larger
from GoogleNet [15] into the recurrent neural network kernels capture faster motions [19]. One of the limitations
modules. The inception modules are integrated into each of convolutional LSTM is that the kernel size is fixed, which
gate of convolutional RNN, thereby transforming the gates means the network can work well for either objects moving
from single kernel to multi-kernels. One of the recent archi- at a slow pace or fast pace. Introducing the design’s flex-
tectures from Alom et al. [42] also tested the integration of ibility to include multiple kernels in the recurrent neural
the inception module into a recurrent convolutional layer network architecture allows the capture of complex spatio-
for image classification. However, Alom et al. designed an temporal dynamics. We included three convolutions with
architecture that concatenates the inception unit, recurrent varying kernel sizes for each gate in one cell (see Fig. 1).
convolutional layers, and residual units. Our proposed design The network starts with a small kernel size of 1x1 to a larger
is quite different in that we integrate the inception module size of 5x5. This proposed architecture is sensitive to videos
inside the RNN module, thereby transforming the gates from that contain objects with varying image sizes.
single-kernel to multi-kernel units. The GoogleNet architec- In the Convolutional LSTM, the standard dot product
ture in our design helps capture multiple scales into convo- operations are replaced by the convolutional operations as
lutional layers without the need to introduce new layers. The follows:
inception architecture has multiple variations of kernel sizes,
namely the 1×1, 3×3, and 5×5. The proposed work builds it = 𝜎(Wix ∗ xt + Wih ∗ ht−1 + bi ) (1a)
on our prior work [43] where we experimented with the idea
with Inception LSTM with the KITTI dataset. The paper ft = 𝜎(Wfx ∗ xt + Wfh ∗ ht−1 + bf ) (1b)
presents the general Inception of recurrent neural network
architecture that differs from our prior work. Specifically, we c�t = it ⊙ tanh(Wcx ∗ xt + Wch ∗ ht−1 + bc ) (1c)
provide design for multiple recurrent neural network mod-
ules, namely LSTM and GRU, for different variations of
kernel sizes. We evaluated the proposed methods on three ct = ft ⊙ ct−1 + c�t (1d)
datasets: KITTI, KTH, and UCF-11. Finally, we investigate
the proposed neural network architecture’s effectiveness in ot = 𝜎(Wox ∗ xt + Woh ∗ ht−1 + bo ) (1e)
learning and transferring knowledge from one dataset to
another through transfer learning.
SN Computer Science
69 Page 4 of 16 SN Computer Science (2023) 4:69
SN Computer Science
SN Computer Science (2023) 4:69 Page 5 of 16 69
convolutional LSTM that uses multi-kernel gates in its archi- sequence. This design has the advantage of reducing the
tecture. The inputs to all gates are the stacking of convolu- number of parameters from 25nc to 18nc (nc denotes the num-
tion operations with different kernel sizes. The equations for ber of channels). The equations used in Inception LSTM
the inception LSTM version 1 (Fig. 1) are as follows: V2 for the input gate are given in equation (4a). The other
equations have a similar structure.
⎛⎡ Wi1×1 ∗ [xt , ht−1 ], ⎤ ⎞
it = 𝜎 ⎜⎢ W ∗ [x , h ], ⎥ + bi ⎟ (3a) ⎛⎡ Wi1×1 ∗ [xt , ht−1 ], ⎤ ⎞
⎜⎢ i3×3 t t−1
⎥ ⎟ ⎜⎢ ⎥ ⎟
⎝⎣ Wi5×5 ∗ [xt , ht−1 ] ⎦ ⎠ it = 𝜎 ⎜⎢ W 1
∗ [x , h ], (4a)
i3×3 t t−1 ⎥ + bi ⎟
⎜⎢ W 2 ∗ [W 3 ∗ [xt , ht−1 ]] ⎥ ⎟
⎝⎣ i3×3 i3×3 ⎦ ⎠
⎛⎡ Wf1×1 ∗ [xt , ht−1 ], ⎤ ⎞
ft = 𝜎 ⎜⎢ Wf3×3 ∗ [xt , ht−1 ], ⎥ + bf ⎟ (3b)
⎜⎢ ⎥ ⎟ ⎛⎡ ⎤ ⎞
⎝⎣ Wf5×5 ∗ [xt , ht−1 ] ⎦ ⎠ ⎜⎢
Wf1×1 ∗ [xt , ht−1 ],
⎥ ⎟
1
ft = 𝜎 ⎜⎢ Wf ∗ [xt , ht−1 ],
3×3 ⎥ + bf ⎟ (4b)
⎜⎢ W 2 ∗ [W 3 ∗ [xt , ht−1 ]] ⎥ ⎟
⎛⎡ Wg1×1 ∗ [xt , ht−1 ], ⎤ ⎞ ⎝⎣ f3×3 f3×3 ⎦ ⎠
gt = 𝜎 ⎜⎢ Wg3×3 ∗ [xt , ht−1 ], ⎥ + bg ⎟ (3c)
⎜⎢ ⎥ ⎟
⎝⎣ Wg5×5 ∗ [xt , ht−1 ] ⎦ ⎠ ⎛⎡ Wg1×1 ∗ [xt , ht−1 ], ⎤ ⎞
gt = 𝜎 ⎜⎢ Wg1 ∗ [xt , ht−1 ], ⎥ + bg ⎟ (4c)
⎜⎢ 2 3×3
⎥ ⎟
⎛⎡ Wo1×1 ∗ [xt , ht−1 ], ⎤ ⎞ ⎝⎣ Wg3×3 ∗ [Wg33×3 ∗ [xt , ht−1 ]] ⎦ ⎠
ot = 𝜎 ⎜⎢ Wo3×3 ∗ [xt , ht−1 ], ⎥ + bo ⎟ (3d)
⎜⎢ ⎥ ⎟
⎝⎣ Wo5×5 ∗ [xt , ht−1 ] ⎦ ⎠ ⎛⎡ Wo1×1 ∗ [xt , ht−1 ], ⎤ ⎞
ot = 𝜎 ⎜⎢ Wo1 ∗ [xt , ht−1 ], ⎥ + bo ⎟ (4d)
⎜⎢ 2 3×3
⎥ ⎟
ct = ft ⊙ ct−1 + it ⊙ gt (3e) ⎝⎣ Wo3×3 ∗ [Wo33×3 ∗ [xt , ht−1 ]] ⎦ ⎠
SN Computer Science
69 Page 6 of 16 SN Computer Science (2023) 4:69
The loss function for the predNet [11] model is illustrated ⎛⎡ Wr1×1 ∗ [xt , ht−1 ], ⎤ ⎞
below: rt = 𝜎 ⎜⎢ Wr3×3 ∗ [xt , ht−1 ], ⎥ + br ⎟ (6b)
⎜⎢ ⎥ ⎟
⎝⎣ Wr5×5 ∗ [xt , ht−1 ] ⎦ ⎠
∑ ∑ 𝜆l ∑
Ltrain = 𝜆t Elt (5)
t l
nl nl
Inception GRU
SN Computer Science
SN Computer Science (2023) 4:69 Page 7 of 16 69
⎛⎡ Wc1×1 ∗ [xt , rt ⊙ ht−1 ], ⎤ ⎞ prediction using shared weights that preserves the Spatio-
ct = 𝜎 ⎜⎢ Wc3×3 ∗ [xt , rt ⊙ ht−1 ], ⎥ + bc ⎟ (6c) temporal dynamics of the image implicitly. The use of incep-
⎜⎢ ⎥ ⎟ tion in the architecture enables us not to include kernel size
⎝⎣ Wc5×5 ∗ [xt , rt ⊙ ht−1 ] ⎦ ⎠
as the hyper-parameter.
The inception GRU V2 is similar to the inception LSTM
ht = zt ⊙ ht−1 + (1 − zt ) ⊙ ct (6d) V2; that uses two 3 × 3 kernels parallel to 1 × 1 and 3 × 3
The Inception GRU uses three different kernels with kernels instead of using 5 × 5 kernel to reduce the compu-
size 1 × 1, 3 × 3 and 5 × 5, which is similar to the Inception tational cost. Figure 5 shows the architecture of inception
LSTM. However ct is Figs. 4 and 5 show the Inception GRU GRU V2. The Inception GRU V2 formula is a combina-
V1 and V2 models. tion of inception LSTM V2 and inception GRU presented
The inception recurrent neural networks are embedded as follows:
into the PredNet architecture [11]. The overall architecture
⎛⎡ Wz1×1 ∗ [xt , ht−1 ], ⎤ ⎞
is shown in 3. The system consists of an Inception RNN ⎜⎢ Wz1 ∗ [xt , ht−1 ], ⎥ + bz ⎟
zt = 𝜎
⎜⎢ 2 ⎥ ⎟ (7a)
module that extracts Spatio-temporal features of an image 3×3
⎝⎣ Wz3×3 ∗ [Wz33×3 ∗ [xt , ht−1 ]] ⎦ ⎠
using different convolutions. Convolution helps the image
SN Computer Science
69 Page 8 of 16 SN Computer Science (2023) 4:69
SN Computer Science
SN Computer Science (2023) 4:69 Page 9 of 16 69
SN Computer Science
69 Page 10 of 16 SN Computer Science (2023) 4:69
SN Computer Science
SN Computer Science (2023) 4:69 Page 11 of 16 69
Evaluation on KITTI Dataset Figure 11 shows MSE and SSIM performance measures
on the KITTI data sets as a function of previous frames
Table 1 contains results for additional variations of the Pred- used in the history. We calculate the mean and 95% con-
Net on convolutional and inception evaluated on the KITTI fidence intervals. Although the confidence intervals show
dataset. This paper evaluates all models in terms of pixel overlap, the Inception LSTM V1 shows lower mean squared
prediction. We used Mean-squared error (MSE), mean abso- error results than other methods. The SSIM measure shows
lute error (MAE), and structural similarity index measure the structural similarity of the predicted and ground-truth
(SSIM) to compare the performance of the models.The peak images.
signal-to-noise ratio (PSNR) is a good metric to compare Figure 11b (right) shows the SSIM@ results. Inception
the actual and predicted image. Unlike mean squared error, V1 outperforms other models in terms of SSIM@.
which represents the cumulative squared error between
the predicted and the original image, PSNR represents a Evaluation on KTH Dataset
measure of the peak error. The lower the value of MSE, the
lower the error. However, the higher the PSNR shows, the Table 2 shows experimental results on the KTH dataset to
better-predicted image. We can find that the performance of check the inception recurrent units’ overall performance.
three-layer inception LSTM, either V1 or V2, outperforms The three-layer Inception V1 LSTM outperforms the four-
4L convolutional LSTM. Both inception LSTM versions are layer convolutional LSTM and GRU. The model can also
able to handle the spatio-temporal relationships of different handle the picture’s edges and spatio-temporal relationships
objects using only convolutions. Figure 9 shows the two- very well. However, the inception LSTM V2 works well but
layer model has a smoother image; however, the predictions not as well as the V1 model. Table 1 shows that the 4-layer
of the 4-layer network are similar to the actual image with Inception-LSTM models (V1 and V2) show the best results
more contrast of pixels, and edges are more visible using the while predicting the next frames of the KITTI dataset. How-
4-layer network (Fig. 10). ever, the 3-layer Inception models have lower computational
Figure 11 compares prediction outputs of different lay- costs and outperform both convolutional LSTM and GRU
ers with steps from one to ten frame inputs. Table 1 com- models. Table 2 shows that the 3-layer inception LSTM
pares the similarity and error of different inception models models (V1 and V2) outperform Convolutional LSTM and
with convolutional LSTM and GRU. Inception LSTM V1 GRU predicting the next video frame of the KTH dataset.
outperforms other models. Moreover, inception LSTM V2 Figures 12 shows MSE and SSIM performance meas-
with three-layer outperforms the four-layer convolutional ures on the KITTI data set. We calculate the mean and 95%
LSTM. Inception GRU V2 with three layers outperforms confidence interval. Although the confidence intervals show
both 3-layer convolutional LSTM and convolutional GRU. overlap, the mean squared error of the Inception LSTM is
Moreover, Inception GRU is comparable to the convolu- lower than the other methods. Figure 12b shows the SSIM@
tional LSTM 4 layers. Finally, two-layer inception LSTM results as a function of the number of previous frames. The
outperforms convolutional GRU with two and three layers MSE and SSIM results indicate that the model reaches maxi-
with the KITTI dataset. mum performance after receiving at least five frames of the
Fig. 11 KITTI dataset next-frame prediction performance based on SSIM and MSE as functions of the number of previous frames used in the
history. Left: Mean Square Error (MSE). Right: Structural Similarity (SSIM)
SN Computer Science
69 Page 12 of 16 SN Computer Science (2023) 4:69
Fig. 12 Testing KTH dataset next-frame prediction performance as a function of the number of previous frames used in the history. Left: Mean
Square Error (MSE). Right: Structural Similarity (SSIM)
(a) (b)
(c) (d)
Fig. 13 comparing performance of different architectures on KITTI and KTH datasets using MSE and SSIM metrics
SN Computer Science
SN Computer Science (2023) 4:69 Page 13 of 16 69
previous history. Figure 13 compares the MSE and SSIM of compared with the standard convolutional LSTM for next
different architectures. frame video prediction. The motivation was to remove the
hyper-parameter of kernel size from the model and enable
Evaluation on UCF11 Dataset multi-scale analysis. The Inception V1 LSTM variant gave
the best accuracy performance with fewer layers.
We tested the Inception V2 GRU using the UCF11 dataset. Our proposed Inception LSTMs and GRUs does not
Also, we train the model on the UCF11 dataset and test it on have all of the characteristics of an Inception module. For
KTH to evaluate how well the model performs for transfer instance, the Max pooling component is wholly omitted. For
learning. Table 3 shows the result of these experiments using our model, max-pooling was used between layers, so it was
the UCF11 and KTH test results. not needed within the inception module. A similar situation
holds for the 1 × 1 convolutions.
Comparison with Convolutional Architectures One of the limitations of the proposed Inception LSTM
is that it takes more training time compared to ConvLSTM.
We compare the performance of our method with convolu- This is because our method’s number of training parameters
tional architectures methods with the same configuration. is much higher than ConvLSTM. But the model achieves
Figure 14 provides the next ten predicted frames using better accuracy in terms of MSE, SSIM, and MAE, elimi-
different architectures on two different datasets that have nates manual hyper-parameter tuning, and shows promising
different characteristics. The first row of Fig. 13 shows the results for transfer learning.
actual images and each image is predicted using the n previ- Table 4 shows the training times for all different pro-
ous images. For example the left images does not have any posed architectures on KTH data-set. Although, both con-
information about the frames and the right image of each volutional LSTM and GRU take lower amount of time to
module has consumed the actual images on it’s left side. train the model, they have to run several times to fine-tune
One of the datasets consists of complex moving objects and the kernel sizes that makes convolutional architectures more
a camera, and the other has only a complex moving human. expensive. Moreover, using inception architectures provides
Moreover, the results are summarized in Tables 1 and 2. more comfort by eliminating the kernel size that reduces the
Both models are computationally expensive, but our model programming cost of the model effectively.
does not need to train the model several times to adjust the
hyper-parameters. Moreover, this network benefits from the
advantages of using different kernel sizes that capture dif- Limitations and Future Work
ferent motions with different speeds and scales.
The inception recurrent network architecture is inspired from
GoogLeNet to create a wider network architecture instead
Discussion of deeper architecture [14] inside LSTM and GRU archi-
tectures. Convolutional Recurrent Neural Network based
This paper studied novel Inception-like recurrent neural net- approaches require hyperparameter tuning to improve the
works that used multi-kernel gates, including an Inception performance of video frame prediction. As a result of intro-
LSTM and an Inception GRU, to be used within a predictive ducing the inception recurrent network, hyper-parameter
coding framework. The key idea was to introduce multiple tuning is eliminated. The major drawback of this work is
kernel sizes within a convolutional gate. The performance that the computational cost of Inception recurrent networks
of several variants of the proposed models was studied and slightly increases (compared to convolutional network), but
we are able to achieve better accuracy with some sacrifice
in training performance.
SN Computer Science
69 Page 14 of 16 SN Computer Science (2023) 4:69
Table 4 Train times of different Architecture Time (s) where RNN architectures were shown to perform well, such
architectures on KTH dataset as anomaly detection, activity recognition, large-scale video
Convolutional LSTM 5005 classification, and action detection.
Convolutional GRU 4188
Inception LSTM V1 9150
Inception LSTM V2 5536 Author Contributions Conceptualization: MH, SH AM; Methodology:
MH, SH and AM; Validation: MH, SH ,RG and AM, Formal analysis:
Inception GRU V1 9114
MH SH, and AM; Investigation: MH, SH, RG and AM; Resources: RG,
Inception GRU V2 5344 and AM; draft preparation: MH, SH, RG and AM; Writing—review
and editing: MH SH, RG, AM; Visualization: MH, SH AM; Supervi-
sion: AM and RG, Project administration: AM; Funding acquisition:
RG, and AM
architectures. Inception LSTM V1 shows better results than
the other convolutional and Inception RNNs. Moreover, we Funding This research was partially funded by NSF grant
observed that using Inception GRU V2 had prediction per- CNS-1451916
formance close to Inception LSTM V1 while lowering the
computational cost. Declarations
Perhaps the most significant aspect of this work is reduc- Conflict of interest The authors declare that they have no competing
ing the hyper-parameter count in the model and thereby interests.
reducing the model development complexity. A limitation of
Availability of data and materials The source code used for the new
the approach is that it has not been tested within other frame-
model, data and evaluation is made available at GitHub https://github.
works. This also provides opportunity. These methods can com/matinhosseiny/Inception-inspired-LSTM-for-Video-frame-Predi
be extended to a wide range of other significant problems ction.
SN Computer Science
SN Computer Science (2023) 4:69 Page 15 of 16 69
SN Computer Science
69 Page 16 of 16 SN Computer Science (2023) 4:69
40. Kumar M, Babaeizadeh M, Erhan D, Finn C, Levine S, Dinh L, 46. Hosseini M, Maida AS, Hosseini M, Raju G. Inception lstm for
Kingma D. Videoflow: A conditional flow-based model for sto- next-frame video prediction (student abstract). Proc AAAI Conf
chastic video generation, arXiv preprint arXiv:1903.0 1434; 2019. Artif Intell. 2020;34:13809–10.
41. Reda FA, Liu G, Shih KJ, Kirby R, Barker J, Tarjan D, Tao A, 47. Schüldt C, Laptev I, Caputo B. Recognizing human actions: a
Catanzaro B. Sdc-net: Video prediction using spatially-displaced local SVM approach, In: Proc. Int. Conf. Pattern Recognition
convolution, In: Proceedings of the European Conference on (ICPR’04), Cambridge, U.K; 2004.
Computer Vision (ECCV); 2018. p. 718–733. 48. Reddy KK, Shah M. Recognizing 50 human action categories of
42. Alom M, Hasan M, Yakopcic C, Tarek M, Taha T. Inception web videos. Mach Vis Appl. 2013;24:971–81.
recurrent convolutional neural network for object recognition.
arXiv preprint arXiv:1704.07709; 2017. Publisher's Note Springer Nature remains neutral with regard to
43. Hosseini M, Maida AS, Hosseini M, Raju G. Inception-inspired jurisdictional claims in published maps and institutional affiliations.
lstm for next-frame video prediction, arXiv preprint arXiv:1909.
05622; 2019. Springer Nature or its licensor (e.g. a society or other partner) holds
44. Heidari M, Rafatirad S. Using transfer learning approach to imple- exclusive rights to this article under a publishing agreement with the
ment convolutional neural network model to recommend airline author(s) or other rightsholder(s); author self-archiving of the accepted
tickets by using online reviews, In: 2020 15th International Work- manuscript version of this article is solely governed by the terms of
shop on Semantic and Social media Adaptation and Personaliza- such publishing agreement and applicable law.
tion. SMA, IEEE; 2020. p. 1–6.
45. J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos
“in the wild”, In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition, IEEE; 2009. p. 1996–2003.
SN Computer Science