Unsupervised Video Learning with LSTMs
Unsupervised Video Learning with LSTMs
Abstract (Sutskever et al., 2014; Cho et al., 2014), and caption gen-
We use Long Short Term Memory (LSTM) eration for images (Vinyals et al., 2014). They have also
networks to learn representations of video se- been applied on videos for recognizing actions and gener-
quences. Our model uses an encoder LSTM to ating natural language descriptions (Donahue et al., 2014).
map an input sequence into a fixed length rep- A general sequence to sequence learning framework was
resentation. This representation is decoded us- described by Sutskever et al. (2014) in which a recurrent
ing single or multiple decoder LSTMs to perform network is used to encode a sequence into a fixed length
different tasks, such as reconstructing the input representation, and then another recurrent network is used
sequence, or predicting the future sequence. We to decode a sequence out of that representation. In this
experiment with two kinds of input sequences work, we apply and extend this framework to learn rep-
– patches of image pixels and high-level repre- resentations of sequences of images. We choose to work
sentations (“percepts”) of video frames extracted in the unsupervised setting where we only have access to a
using a pretrained convolutional net. We ex- dataset of unlabelled videos.
plore different design choices such as whether
the decoder LSTMs should condition on the gen- 1.1. Why Unsupervised Learning?
erated output. We analyze the outputs of the Supervised learning has been extremely successful in learn-
model qualitatively to see how well the model ing good visual representations that not only produce good
can extrapolate the learned video representation results at the task they are trained for, but also transfer well
into the future and into the past. We further to other tasks and datasets. Therefore, it is natural to ex-
evaluate the representations by finetuning them tend the same approach to learning video representations.
for a supervised learning problem – human ac- This has led to research in 3D convolutional nets (Ji et al.,
tion recognition on the UCF-101 and HMDB-51 2013; Tran et al., 2014), different temporal fusion strategies
datasets. We show that the representations help (Karpathy et al., 2014) and exploring different ways of pre-
improve classification accuracy, especially when senting visual information to convolutional nets (Simonyan
there are only few training examples. Even mod- & Zisserman, 2014a). However, videos are much higher di-
els pretrained on unrelated datasets (300 hours of mensional entities compared to single images. Therefore, it
YouTube videos) can help action recognition per- becomes increasingly difficult to do credit assignment and
formance. learn long range structure, unless we collect much more
labelled data or do a lot of feature engineering (for exam-
ple computing the right kinds of flow features) to keep the
1. Introduction dimensionality low. The costly work of collecting more
Understanding temporal sequences is important for solv- labelled data and the tedious work of doing more clever en-
ing many problems in the AI-set. Recently, recurrent neu- gineering can go a long way in solving particular problems,
ral networks using the Long Short Term Memory (LSTM) but this is ultimately unsatisfying as a machine learning
architecture have been used successfully to perform var- solution. This highlights the need for using unsupervised
ious supervised sequence learning tasks, such as speech learning to find and represent structure in videos. More-
recognition (Graves & Jaitly, 2014), machine translation over, videos have a lot of structure in them (spatial and
temporal regularities) which makes them particularly well
Proceedings of the 32 nd International Conference on Machine suited as a domain for building unsupervised learning mod-
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- els.
right 2015 by the author(s).
Unsupervised Learning with LSTMs
1.2. Our Approach model for videos. The model uses a recurrent neural
network to predict the next frame or interpolate between
In this paper, we use the LSTM Encoder-Decoder frame-
frames. This work highlights the importance of choosing
work to learn video representations. The Encoder LSTM
the right loss function. It is argued that squared loss in
runs through a sequence of frames to come up with a rep-
input space is not the right objective because it does not
resentation. This representation is then decoded through
respond well to small distortions in input space. The pro-
another LSTM to produce a target sequence. We consider
posed solution is to quantize the image patches into a large
different choices of the target sequence. One choice is to
dictionary and train the model to predict the identity of
predict the same sequence as the input. The motivation is
the target patch. This does solve some of the problems of
similar to that of autoencoders – we wish to capture all that
squared loss but it introduces an arbitrary dictionary size
is needed to reproduce the input but at the same time go
into the picture and altogether removes the idea of patches
through the inductive biases imposed by the model. An-
being similar or dissimilar to one other. Other metrics such
other option is to predict the future frames. Here the mo-
as Structural Similarity (Wang et al., 2004) have also been
tivation is to learn a representation that extracts all that is
proposed. Designing a loss function that respects our no-
needed to extrapolate the motion and appearance beyond
tion of visual similarity is a very hard problem (in a sense,
what has been seen. These two natural choices can also be
almost as hard as the modeling problem we want to solve
combined. In this case, there are two decoder LSTMs – one
in the first place). Therefore, in this paper, we use the sim-
that decodes the representation into the input sequence and
ple squared loss objective function as a starting point and
another that decodes the same representation to predict the
focus on designing an encoder-decoder RNN architecture
future.
that can be used with any differentiable loss function.
The inputs to the model can, in principle, be any representa-
tion of individual video frames. However, for the purposes 2. Model Description
of evaluation, we limit our attention to two kinds of inputs.
The first is image patches. For this we use natural image In this section, we describe several variants of our LSTM
patches as well as a dataset of moving MNIST digits. The Encoder-Decoder model. The basic unit of our network
second is the high-level “percepts” extracted by applying a is the LSTM cell block that consists of four input termi-
convolutional net pretrained on ImageNet. These percepts nals, a memory cell and an output unit. Our implementa-
are the states of the last (and/or second-to-last) layer of rec- tion of LSTMs follows closely the one discussed by Graves
tified linear hidden units. (2013).
In order to evaluate the learned representations we quali-
2.1. LSTM Autoencoder Model
tatively analyze the reconstructions and predictions made
by the model. For a more quantitative evaluation, we use This model consists of two Recurrent Neural Nets, the en-
these LSTMs as initializations for the supervised task of ac- coder LSTM and the decoder LSTM as shown in Fig. 1.
tion recognition. If the unsupervised learning model comes The input to the model is a sequence of vectors (image
up with useful representations then the classifier should be patches or features). The encoder LSTM reads in this se-
able to perform better, especially when there are only a few quence. After the last input has been read, the cell state and
labelled examples. We find that this is indeed the case. output state of the encoder are copied over to the decoder
LSTM. The decoder outputs a prediction for the target se-
1.3. Related Work quence. The target sequence is same as the input sequence,
but in reverse order. Reversing the target sequence makes
The first approaches to learning representations of videos
the optimization easier because the model can get off the
in an unsupervised way were based on ICA (van Hateren &
ground by looking at low range correlations. This is also
Ruderman, 1998; Hurri & Hyvärinen, 2003). Mobahi et al.
inspired by a stack representation of lists (for example in
(2009) proposed a regularizer that encourages temporal co-
LISP). The encoder creates a list by pushing frames on top
herence using a contrastive hinge loss. Le et al. (2011) ap-
of the stack and the decoder unrolls this list by removing
proached this problem using multiple layers of Independent
frames from the top.
Subspace Analysis modules. Generative models for under-
standing transformations between pairs of consecutive im- The decoder can be conditional or unconditioned. A con-
ages are also well studied (Memisevic, 2013; Memisevic & ditional decoder receives the last generated output frame
Hinton, 2010; Susskind et al., 2011). This work was ex- as input, i.e., the dotted boxes in Fig. 1 are present. An
tended recently by Michalski et al. (2014) to model longer unconditioned decoder does not receive that input. This is
sequences. discussed in more detail in Sec. 2.3. The architecture can
be extended to multiple layers by stacking LSTMs on top
Recently, Ranzato et al. (2014) proposed a generative
of each other.
Unsupervised Learning with LSTMs
W1 W1 copy W2 W2 W1 W1 copy W2 W2
3.2. Datasets
Future Prediction vˆ4 vˆ5
We use the UCF-101 and HMDB-51 datasets for super-
vised tasks. The UCF-101 dataset (Soomro et al., 2012)
Figure 3. The Composite Model: The LSTM predicts the future contains 13,320 videos with an average length of 6.2 sec-
as well as the input sequence. onds belonging to 101 different action categories. The
dataset has 3 standard train/test splits with the training set
2.4. A Composite Model
containing around 9,500 videos in each split (the rest are
The two tasks – reconstructing the input and predicting the test). The HMDB-51 dataset (Kuehne et al., 2011) con-
future can be combined to create a composite model as tains 5100 videos belonging to 51 different action cate-
shown in Fig. 3. Here the encoder LSTM is asked to come gories. Mean length of the videos is 3.2 s. This also has
up with a state from which we can both predict the next few 3 train/test splits with 3570 videos in the training set and
frames as well as reconstruct the input. rest in test.
This composite model tries to overcome the shortcomings To train the unsupervised models, we used a subset of
that each model suffers on its own. A high-capacity au- the YouTube videos from the Sports-1M dataset (Karpa-
toencoder would suffer from the tendency to learn trivial thy et al., 2014). Even though this dataset is labelled for
representations that just memorize the inputs. However, actions, we did not do any supervised experiments on it
this memorization is not useful at all for predicting the fu- because of logistical constraints with working with such a
ture. Therefore, the composite model cannot just memo- huge dataset. We instead collected 300 hours of video by
rize information. On the other hand, the future predictor randomly sampling 10 second clips. We also used the su-
suffers form the tendency to store information only about pervised datasets (UCF-101 and HMDB-51) for unsuper-
the last few frames since those are most important for pre- vised training. However, we found that using them did not
dicting the future, i.e., in order to predict vt , the frames give any significant advantage over just using the YouTube
{vt−1 , . . . , vt−k } are much more important than v0 , for videos. Percepts were extracted using the convolutional
some small value of k. Therefore the representation at the neural net model of Simonyan & Zisserman (2014b). The
end of the encoder will have forgotten about a large part of videos have a resolution of 240 × 320 and were sampled at
the input. But if we ask the model to also predict all of the 30 frames per second. The central 224 × 224 patch from
input sequence, then it cannot just pay attention to the last each frame was forward proped to obtained the RGB per-
few frames. cepts. We used only a single patch for simplicity of doing
the experiments, although the performance can probably be
improved by taking multiple patches, doing horizontal flips
3. Experiments
and other distortions. We also computed flow percepts by
We design experiments to accomplish the following objec- training the temporal stream convolutional network as de-
tives: scribed by Simonyan & Zisserman (2014a). We found that
• Get a qualitative understanding of what the LSTM the fc6 features worked better than fc7 for single frame
learns to do. classification using both RGB and flow percepts. There-
• Measure the benefit of initializing networks for super- fore, we used the 4096-dimensional fc6 layer as the input
vised learning tasks with the weights found by unsu- representation of our data. Besides these percepts, we also
Unsupervised Learning with LSTMs
Input Sequence - Ground Truth Future -
80 50
45
70
40
Classification Accuracy
Classification Accuracy
60 35
30
50
25
40 20
UCF-101 UCF-101 HMDB-51 always possible to get lower reconstruction error by copy-
Model
RGB 1- frame flow RGB ing the inputs, we cannot use input reconstruction error as
Single Frame 72.2 72.2 40.1 a measure of how well a model is doing. However, we can
LSTM classifier 74.5 74.3 42.8 use the error in predicting the future as a reasonable mea-
Composite LSTM sure of performance. We can also use the performance on
75.8 74.9 44.1
Model + Finetuning supervised tasks as a proxy for how well the unsupervised
model is doing. In this section, we present results from
Table 1. Summary of Results on Action Recognition.
these two analyses.
Squared loss
Cross Entropy Future prediction results are summarized in Table 2. For
Model on image
on MNIST MNIST we compute the cross entropy of the predictions
patches
with respect to the ground truth. For natural image patches,
Future Predictor 350.2 225.2
Composite Model 344.9 210.7 we compute the squared loss. We see that the Compos-
Conditional Future Predictor 343.5 221.3 ite Model always does a better job of predicting the future
Composite Model with compared to the Future Predictor. This indicates that hav-
341.2 208.1
Conditional Future Predictor ing the autoencoder along with the future predictor to force
the model to remember more about the inputs actually helps
Table 2. Future prediction results on MNIST and image patches. predict the future better. Next, we compare each model
All models use 2 layers of LSTMs. with its conditional variant. Here, we find that the condi-
We further ran similar experiments on the optical flow per- tional models perform slightly better, as was also noted in
cepts extracted from the UCF-101 dataset. A temporal Fig. 4.
stream convolutional net, similar to the one proposed by Si- The performance on action recognition achieved by fine-
monyan & Zisserman (2014b), was trained on single frame tuning different unsupervised learning models is summa-
optical flows as well as on stacks of 10 optical flows. This rized in Table 3. Besides running the experiments on the
gave an accuracy of 72.2% and 77.5% respectively. Here full UCF-101 and HMDB-51 datasets, we also ran the ex-
again, our models took 16 frames as input, reconstructed periments on small subsets of these datasets where the ef-
them and predicted 13 frames into the future. LSTMs with fects of pretraining would be more pronounced. We find
128 hidden units improved the accuracy by 2.1% to 74.3% that all unsupervised models improve over the baseline
for the single frame case. Bigger LSTMs did not improve LSTM which is itself well-regularized using dropout. The
results. By pretraining the LSTM, we were able to further Autoencoder model seems to perform consistently better
improve the classification to 74.9% (±0.1). For stacks of than the Future Predictor. The Composite model, which
10 frames we improved very slightly to 77.7%. These re- combines the two, does better than either one alone. Con-
sults are summarized in Table 1. ditioning on the generated inputs does not seem to give a
clear advantage over not doing so. The Composite Model
3.5. Comparison of Different Model Variants with a conditional future predictor works the best, although
The aim of this set of experiments is to compare the differ- its performance is almost same as that of the Composite
ent variants of the model proposed in this paper. Since it is Model without conditioning.
Unsupervised Learning with LSTMs
Table 3. Comparison of different unsupervised pretraining methods. UCF-101 small is a subset containing 10 videos per class. HMDB-
51 small contains 4 videos per class.
3.6. Comparison with Action Recognition Benchmarks HMDB-
Method UCF-101
51
Finally, we compare our models to the state-of-the-art ac-
Spatial Convolutional Net (Simonyan &
tion recognition results. The performance is summarized in Zisserman, 2014a)
73.0 40.5
Table 4. The table is divided into three sets. The first set C3D (Tran et al., 2014) 72.3 -
compares models that use only RGB data (single or mul- C3D + fc6 (Tran et al., 2014) 76.4 -
tiple frames). The second set compares models that use LRCN (Donahue et al., 2014) 71.1 -
explicitly computed flow features only. Models in the third Composite LSTM Model 75.8 44.0
set use both. Temporal Convolutional Net (Simonyan &
83.7 54.6
Zisserman, 2014a)
On RGB data, our model performs at par with the best deep LRCN (Donahue et al., 2014) 77.0 -
models. It performs 4.7% better than the LRCN model that Composite LSTM Model 77.7 -
also used LSTMs on top of conv net features1 . Our model LRCN (Donahue et al., 2014) 82.9 -
performs better than C3D features that use a 3D convolu- Two-stream Convolutional Net (Simonyan &
tional net. However, when the C3D features are concate- 88.0 59.4
Zisserman, 2014a)
nated with fc6 percepts, they do slightly better than our Multi-skip feature stacking (Lan et al., 2014) 89.1 65.1
model. Composite LSTM Model 84.3 -
The improvement for flow features over using a randomly Table 4. Comparison with state-of-the-art action recognition
initialized LSTM network is quite small. We believe this models.
is partly due to the fact that the flow percepts already cap-
their properties through visualizations. More detailed anal-
ture a lot of the motion information that the LSTM would
ysis can be found in the expanded version of this paper
otherwise discover. Another contributing factor is that the
(Srivastava et al., 2015). There we found that on the mov-
temporal stream convolutional net that is used to extract
ing MNIST digits dataset, the model was able to generate
flow percepts overfits very readily (in the sense that it gets
persistent motion over long periods of time into the future
almost zero training error but much higher test error) in
even though it was trained for much shorter time scales.
spite of strong regularization. Therefore the statistics of
The learned features at the encoder and decoder when vi-
the percepts might be different between the training and
sualized show some important qualitative differences. In
test sets. This is not the case for RGB percepts because the
terms of performance on supervised tasks, we managed to
network there was trained on an entirely different dataset
get modest improvements only. The best performing model
(ImageNet).
was the Composite Model that combined an autoencoder
When we combine predictions from the RGB and flow and a future predictor. The conditional variants did not
models, we obtain 84.3 % accuracy on UCF-101. We be- give any significant improvements in terms of classification
lieve further improvements can be made by running the accuracy after finetuning, however they did give slightly
model over different patch locations and mirroring the lower prediction errors. More powerful decoders which in-
patches. Also, our model can be applied deeper inside the corporate some form of stochasticity are required to further
conv net instead of just at the top-level. address this question.
To get improvements on supervised tasks, the model can
4. Conclusions be extended by applying it convolutionally across patches
of the video and stacking multiple layers of such models.
We proposed models based on LSTMs that can learn good
In our future work, we plan to build temporal models from
video representations. We compared them and analyzed
the bottom up instead of using them only to model high-
1
However, the improvement is only partially from unsuper- level percepts. We will also use more powerful decoders
vised learning, since we used a better conv net model. that can model multimodal target distributions.
Unsupervised Learning with LSTMs
Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Ranzato, Marc’Aurelio, Szlam, Arthur, Bruna, Joan, Math-
Daan. DRAW: A recurrent neural network for image ieu, Michaël, Collobert, Ronan, and Chopra, Sumit.
generation. CoRR, abs/1502.04623, 2015. Video (language) modeling: a baseline for generative
models of natural videos. CoRR, abs/1412.6604, 2014.
Hurri, Jarmo and Hyvärinen, Aapo. Simple-cell-like re-
ceptive fields maximize temporal coherence in natural Simonyan, K. and Zisserman, A. Two-stream convolutional
video. Neural Computation, 15(3):663–691, 2003. networks for action recognition in videos. In Advances
in Neural Information Processing Systems, 2014a.
Ji, Shuiwang, Xu, Wei, Yang, Ming, and Yu, Kai. 3d con-
volutional neural networks for human action recognition. Simonyan, K. and Zisserman, A. Very deep convolu-
Pattern Analysis and Machine Intelligence, IEEE Trans- tional networks for large-scale image recognition. CoRR,
actions on, 35(1):221–231, Jan 2013. abs/1409.1556, 2014b.
Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Le- Soomro, K., Roshan Zamir, A., and Shah, M. UCF101: A
ung, Thomas, Sukthankar, Rahul, and Fei-Fei, Li. Large- dataset of 101 human actions classes from videos in the
scale video classification with convolutional neural net- wild. In CRCV-TR-12-01, 2012.
works. In CVPR, 2014.
Srivastava, Nitish, Mansimov, Elman, and Salakhutdinov,
Kingma, Diederik P. and Welling, Max. Auto-encoding Ruslan. Unsupervised learning of video representations
variational bayes. CoRR, abs/1312.6114, 2013. using LSTMs. CoRR, abs/1502.04681, 2015.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M.
T. HMDB: a large video database for human motion Modeling the joint density of two images under a variety
recognition. In Proceedings of the International Confer- of transformations. In Proceedings of IEEE Conference
ence on Computer Vision (ICCV), 2011. on Computer Vision and Pattern Recognition, 2011.
Unsupervised Learning with LSTMs