Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
24 views10 pages

Unsupervised Video Learning with LSTMs

usup video
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Unsupervised Video Learning with LSTMs

usup video
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unsupervised Learning of Video Representations using LSTMs

Nitish Srivastava NITISH @ CS . TORONTO . EDU


Elman Mansimov EMANSIM @ CS . TORONTO . EDU
Ruslan Salakhutdinov RSALAKHU @ CS . TORONTO . EDU
University of Toronto, 6 Kings College Road, Toronto, ON M5S 3G4 CANADA

Abstract (Sutskever et al., 2014; Cho et al., 2014), and caption gen-
We use Long Short Term Memory (LSTM) eration for images (Vinyals et al., 2014). They have also
networks to learn representations of video se- been applied on videos for recognizing actions and gener-
quences. Our model uses an encoder LSTM to ating natural language descriptions (Donahue et al., 2014).
map an input sequence into a fixed length rep- A general sequence to sequence learning framework was
resentation. This representation is decoded us- described by Sutskever et al. (2014) in which a recurrent
ing single or multiple decoder LSTMs to perform network is used to encode a sequence into a fixed length
different tasks, such as reconstructing the input representation, and then another recurrent network is used
sequence, or predicting the future sequence. We to decode a sequence out of that representation. In this
experiment with two kinds of input sequences work, we apply and extend this framework to learn rep-
– patches of image pixels and high-level repre- resentations of sequences of images. We choose to work
sentations (“percepts”) of video frames extracted in the unsupervised setting where we only have access to a
using a pretrained convolutional net. We ex- dataset of unlabelled videos.
plore different design choices such as whether
the decoder LSTMs should condition on the gen- 1.1. Why Unsupervised Learning?
erated output. We analyze the outputs of the Supervised learning has been extremely successful in learn-
model qualitatively to see how well the model ing good visual representations that not only produce good
can extrapolate the learned video representation results at the task they are trained for, but also transfer well
into the future and into the past. We further to other tasks and datasets. Therefore, it is natural to ex-
evaluate the representations by finetuning them tend the same approach to learning video representations.
for a supervised learning problem – human ac- This has led to research in 3D convolutional nets (Ji et al.,
tion recognition on the UCF-101 and HMDB-51 2013; Tran et al., 2014), different temporal fusion strategies
datasets. We show that the representations help (Karpathy et al., 2014) and exploring different ways of pre-
improve classification accuracy, especially when senting visual information to convolutional nets (Simonyan
there are only few training examples. Even mod- & Zisserman, 2014a). However, videos are much higher di-
els pretrained on unrelated datasets (300 hours of mensional entities compared to single images. Therefore, it
YouTube videos) can help action recognition per- becomes increasingly difficult to do credit assignment and
formance. learn long range structure, unless we collect much more
labelled data or do a lot of feature engineering (for exam-
ple computing the right kinds of flow features) to keep the
1. Introduction dimensionality low. The costly work of collecting more
Understanding temporal sequences is important for solv- labelled data and the tedious work of doing more clever en-
ing many problems in the AI-set. Recently, recurrent neu- gineering can go a long way in solving particular problems,
ral networks using the Long Short Term Memory (LSTM) but this is ultimately unsatisfying as a machine learning
architecture have been used successfully to perform var- solution. This highlights the need for using unsupervised
ious supervised sequence learning tasks, such as speech learning to find and represent structure in videos. More-
recognition (Graves & Jaitly, 2014), machine translation over, videos have a lot of structure in them (spatial and
temporal regularities) which makes them particularly well
Proceedings of the 32 nd International Conference on Machine suited as a domain for building unsupervised learning mod-
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- els.
right 2015 by the author(s).
Unsupervised Learning with LSTMs

1.2. Our Approach model for videos. The model uses a recurrent neural
network to predict the next frame or interpolate between
In this paper, we use the LSTM Encoder-Decoder frame-
frames. This work highlights the importance of choosing
work to learn video representations. The Encoder LSTM
the right loss function. It is argued that squared loss in
runs through a sequence of frames to come up with a rep-
input space is not the right objective because it does not
resentation. This representation is then decoded through
respond well to small distortions in input space. The pro-
another LSTM to produce a target sequence. We consider
posed solution is to quantize the image patches into a large
different choices of the target sequence. One choice is to
dictionary and train the model to predict the identity of
predict the same sequence as the input. The motivation is
the target patch. This does solve some of the problems of
similar to that of autoencoders – we wish to capture all that
squared loss but it introduces an arbitrary dictionary size
is needed to reproduce the input but at the same time go
into the picture and altogether removes the idea of patches
through the inductive biases imposed by the model. An-
being similar or dissimilar to one other. Other metrics such
other option is to predict the future frames. Here the mo-
as Structural Similarity (Wang et al., 2004) have also been
tivation is to learn a representation that extracts all that is
proposed. Designing a loss function that respects our no-
needed to extrapolate the motion and appearance beyond
tion of visual similarity is a very hard problem (in a sense,
what has been seen. These two natural choices can also be
almost as hard as the modeling problem we want to solve
combined. In this case, there are two decoder LSTMs – one
in the first place). Therefore, in this paper, we use the sim-
that decodes the representation into the input sequence and
ple squared loss objective function as a starting point and
another that decodes the same representation to predict the
focus on designing an encoder-decoder RNN architecture
future.
that can be used with any differentiable loss function.
The inputs to the model can, in principle, be any representa-
tion of individual video frames. However, for the purposes 2. Model Description
of evaluation, we limit our attention to two kinds of inputs.
The first is image patches. For this we use natural image In this section, we describe several variants of our LSTM
patches as well as a dataset of moving MNIST digits. The Encoder-Decoder model. The basic unit of our network
second is the high-level “percepts” extracted by applying a is the LSTM cell block that consists of four input termi-
convolutional net pretrained on ImageNet. These percepts nals, a memory cell and an output unit. Our implementa-
are the states of the last (and/or second-to-last) layer of rec- tion of LSTMs follows closely the one discussed by Graves
tified linear hidden units. (2013).
In order to evaluate the learned representations we quali-
2.1. LSTM Autoencoder Model
tatively analyze the reconstructions and predictions made
by the model. For a more quantitative evaluation, we use This model consists of two Recurrent Neural Nets, the en-
these LSTMs as initializations for the supervised task of ac- coder LSTM and the decoder LSTM as shown in Fig. 1.
tion recognition. If the unsupervised learning model comes The input to the model is a sequence of vectors (image
up with useful representations then the classifier should be patches or features). The encoder LSTM reads in this se-
able to perform better, especially when there are only a few quence. After the last input has been read, the cell state and
labelled examples. We find that this is indeed the case. output state of the encoder are copied over to the decoder
LSTM. The decoder outputs a prediction for the target se-
1.3. Related Work quence. The target sequence is same as the input sequence,
but in reverse order. Reversing the target sequence makes
The first approaches to learning representations of videos
the optimization easier because the model can get off the
in an unsupervised way were based on ICA (van Hateren &
ground by looking at low range correlations. This is also
Ruderman, 1998; Hurri & Hyvärinen, 2003). Mobahi et al.
inspired by a stack representation of lists (for example in
(2009) proposed a regularizer that encourages temporal co-
LISP). The encoder creates a list by pushing frames on top
herence using a contrastive hinge loss. Le et al. (2011) ap-
of the stack and the decoder unrolls this list by removing
proached this problem using multiple layers of Independent
frames from the top.
Subspace Analysis modules. Generative models for under-
standing transformations between pairs of consecutive im- The decoder can be conditional or unconditioned. A con-
ages are also well studied (Memisevic, 2013; Memisevic & ditional decoder receives the last generated output frame
Hinton, 2010; Susskind et al., 2011). This work was ex- as input, i.e., the dotted boxes in Fig. 1 are present. An
tended recently by Michalski et al. (2014) to model longer unconditioned decoder does not receive that input. This is
sequences. discussed in more detail in Sec. 2.3. The architecture can
be extended to multiple layers by stacking LSTMs on top
Recently, Ranzato et al. (2014) proposed a generative
of each other.
Unsupervised Learning with LSTMs

Learned vˆ3 vˆ2 vˆ1 Learned vˆ4 vˆ5 vˆ6


Representation Representation

W1 W1 copy W2 W2 W1 W1 copy W2 W2

v1 v2 v3 vˆ3 vˆ2 v1 v2 v3 vˆ4 vˆ5

Figure 1. LSTM Autoencoder Model Figure 2. LSTM Future Predictor Model


Why should this learn good features ? the experimental section, we explore these choices quan-
The state of the encoder LSTM after the last input frame titatively. Here we briefly discuss the modeling issues in-
has been read is the representation of the input video. The volved. A conditional decoder helps model multiple modes
decoder LSTM is being asked to reconstruct back the input in the target sequence distribution. Without that, we would
sequence from this representation. In order to do so, the end up averaging the multiple modes in the low-level input
representation must retain information about the appear- space. In order to fully exploit such a conditional decoder,
ance of the objects and the background as well as any mo- we would need some stochasticity in the way we gener-
tion contained in the video. This is exactly the information ate a frame. In the case of machine translation (Sutskever
that we would like the representation to contain. However, et al., 2014) this was achieved by sampling from the pre-
an important question for any autoencoder-style model is dicted multinomial distribution over words. Analogously,
what prevents it from learning a trivial identity mapping we could consider adding noise to the predicted frames in
by effectively copying the input to the output. Two factors our model. However, simple ways of doing this, for exam-
prevent this behaviour. First, the fact that there are only ple, adding independent Gaussian noise are unlikely to be
a fixed number of hidden units makes it unlikely that the effective since they only serve to corrupt the data and take it
model can learn trivial mappings for arbitrary length input away from the data manifold. We need to stay on the man-
sequences. Second, the same dynamics must be applied re- ifold in order to generate plausible alternatives. For exam-
cursively on the representation. This further makes it hard ple, if based on some input video, a ball is equally likely to
for the model to learn an identity mapping. move left or right, we need a stochastic process that can
predict the ball moving either to the left or to the right
2.2. LSTM Future Predictor Model but not a ball in the middle plus Gaussian noise. This re-
Another natural unsupervised learning task for sequences quires adding some noise in the higher layers of the model
is predicting the future. This is the approach used in lan- and then having a deterministic process generate the frame.
guage models. The design of the Future Predictor Model is Variational autoencoders and related methods (Kingma &
same as that of the Autoencoder Model, except that the de- Welling, 2013; Gregor et al., 2015) are promising ways of
coder LSTM in this case predicts frames of the video that doing this. We intend to explore these techniques in future.
come just after the input sequence (Fig. 2). Ranzato et al. In this work, however, we use a completely deterministic
(2014) use a similar model but predict only the next frame conditional decoder. At test time we feed in the generated
at each time step. This model, on the other hand, predicts a frame from the previous step without adding any noise. At
long sequence into the future. Here again we consider two training time we feed in the ground truth.
variants of the decoder – conditional and unconditioned. Modeling multiple modes is an issue only if we expect mul-
Why should this learn good features ? tiple modes in the target sequence distribution. For the
In order to predict the next few frames correctly, the model LSTM Autoencoder, there is only one correct target and
needs information about which objects are present and how hence the target distribution can be considered almost uni-
they are moving so that the motion can be extrapolated. The modal. But for the LSTM Future Predictor there is a possi-
hidden state coming out from the encoder will try to capture bility of multiple targets given an input. It should be noted
this information. that for videos the source of uncertainty about the future is
often completely external to the input. For example, there
is often no way to predict that a new object might come
2.3. Conditional Decoder
into the scene, or what kind of background will come into
For each of these two models, we consider two possibil- view as the camera moves. Therefore the future is reason-
ities - one in which the decoder LSTM is conditioned on ably predictable only for a very short time, making multiple
the last generated frame and the other in which it is not. In modes in the target distribution less of a concern.
Unsupervised Learning with LSTMs

pervised learning, especially with very few training


Input Reconstruction vˆ3 vˆ2 vˆ1
examples.
• Compare the different proposed models – Autoen-
coder, Future Predictor and Composite models and
W2 W2
their conditional variants.
• Compare with state-of-the-art action recognition
Learned
Representation
copy benchmarks.
vˆ3 vˆ2
3.1. Training
W1 W1
The proposed models were trained by backpropagation.
vˆ4 vˆ5 vˆ6 RMSProp gave much faster convergence than well-tuned
copy
stochastic gradient descent with momentum. More details
v1 v2 v3
about the training, weight initialization and other hyperpa-
W3 W3 rameters can be found in the expanded version of this paper
Sequence of Input Frames (Srivastava et al., 2015).

3.2. Datasets
Future Prediction vˆ4 vˆ5
We use the UCF-101 and HMDB-51 datasets for super-
vised tasks. The UCF-101 dataset (Soomro et al., 2012)
Figure 3. The Composite Model: The LSTM predicts the future contains 13,320 videos with an average length of 6.2 sec-
as well as the input sequence. onds belonging to 101 different action categories. The
dataset has 3 standard train/test splits with the training set
2.4. A Composite Model
containing around 9,500 videos in each split (the rest are
The two tasks – reconstructing the input and predicting the test). The HMDB-51 dataset (Kuehne et al., 2011) con-
future can be combined to create a composite model as tains 5100 videos belonging to 51 different action cate-
shown in Fig. 3. Here the encoder LSTM is asked to come gories. Mean length of the videos is 3.2 s. This also has
up with a state from which we can both predict the next few 3 train/test splits with 3570 videos in the training set and
frames as well as reconstruct the input. rest in test.
This composite model tries to overcome the shortcomings To train the unsupervised models, we used a subset of
that each model suffers on its own. A high-capacity au- the YouTube videos from the Sports-1M dataset (Karpa-
toencoder would suffer from the tendency to learn trivial thy et al., 2014). Even though this dataset is labelled for
representations that just memorize the inputs. However, actions, we did not do any supervised experiments on it
this memorization is not useful at all for predicting the fu- because of logistical constraints with working with such a
ture. Therefore, the composite model cannot just memo- huge dataset. We instead collected 300 hours of video by
rize information. On the other hand, the future predictor randomly sampling 10 second clips. We also used the su-
suffers form the tendency to store information only about pervised datasets (UCF-101 and HMDB-51) for unsuper-
the last few frames since those are most important for pre- vised training. However, we found that using them did not
dicting the future, i.e., in order to predict vt , the frames give any significant advantage over just using the YouTube
{vt−1 , . . . , vt−k } are much more important than v0 , for videos. Percepts were extracted using the convolutional
some small value of k. Therefore the representation at the neural net model of Simonyan & Zisserman (2014b). The
end of the encoder will have forgotten about a large part of videos have a resolution of 240 × 320 and were sampled at
the input. But if we ask the model to also predict all of the 30 frames per second. The central 224 × 224 patch from
input sequence, then it cannot just pay attention to the last each frame was forward proped to obtained the RGB per-
few frames. cepts. We used only a single patch for simplicity of doing
the experiments, although the performance can probably be
improved by taking multiple patches, doing horizontal flips
3. Experiments
and other distortions. We also computed flow percepts by
We design experiments to accomplish the following objec- training the temporal stream convolutional network as de-
tives: scribed by Simonyan & Zisserman (2014a). We found that
• Get a qualitative understanding of what the LSTM the fc6 features worked better than fc7 for single frame
learns to do. classification using both RGB and flow percepts. There-
• Measure the benefit of initializing networks for super- fore, we used the 4096-dimensional fc6 layer as the input
vised learning tasks with the weights found by unsu- representation of our data. Besides these percepts, we also
Unsupervised Learning with LSTMs
 Input Sequence -  Ground Truth Future -

 Input Reconstruction -  Future Prediction -

One Layer Composite Model

Two Layer Composite Model

Two Layer Composite Model with a Conditional Future Predictor


Figure 4. Reconstruction and future prediction obtained from the Composite Model on a dataset of moving MNIST digits.
trained the proposed models on 32 × 32 patches of pixels. how to separate superimposed digits and can model them
as they pass through each other. This shows some evidence
3.3. Visualization and Qualitative Analysis of disentangling the two independent factors of variation
in this sequence. The model can also correctly predict the
The aim of this set of experiments is to visualize the prop- motion after the digits bounce off the walls. In order to
erties of the proposed models. We first trained the models see if adding depth helps, we trained a two layer Compos-
on a dataset of moving MNIST digits. Each video was 20 ite Model, with each layer having 2048 units. We can see
frames long and consisted of 2 digits moving inside a 64 that adding depth helps the model make better predictions.
× 64 patch. The digits were chosen randomly from the Next, we changed the future predictor by making it condi-
training set and placed initially at random locations inside tional. We can see that this model makes even better pre-
the patch. Each digit was assigned a velocity whose di- dictions. More experiments and analysis, including visu-
rection was chosen uniformly randomly on a unit circle alization of learned features and evolution of the LSTMs
and whose magnitude was also chosen uniformly at ran- state can be found in the expanded version of this paper
dom over a fixed range. The digits bounced-off the edges (Srivastava et al., 2015).
of the 64 × 64 frame and overlapped if they were at the
same location. The reason for working with this dataset is Next, we tried to see if our models can also work with nat-
that it is infinite in size and can be generated quickly on the ural image patches. For this, we trained the models using
fly. This makes it possible to explore the model without ex- a conditional future predictor on sequences of 32 × 32 im-
pensive disk accesses or overfitting issues. Even though it age patches extracted from the UCF-101 dataset. In this
is simple to generate, this dataset has non-trivial properties case, we used linear output units and the squared error loss
because the digits occlude each other and bounce off walls. function. The input was 16 frames. The model was asked
to reconstruct these 16 frames and predict the future 13
We first trained a one layer Composite Model. The LSTM frames. Fig. 5 shows the reconstructions obtained from a
had 2048 units. The encoder took 10 frames as input. The two layer Composite model with 2048 units. We found that
decoder tried to reconstruct these 10 frames and the fu- the future predictions quickly blur out but the input recon-
ture predictor attempted to predict the next 10 frames. We structions look better. We then trained a bigger model with
used logistic output units with a cross entropy loss function. 4096 units. Even in this case, the future blurred out quickly.
Fig. 4 shows two examples of running this model. The true However, the reconstructions look sharper. We believe that
sequences are shown in the first two rows. The next two models that look at bigger contexts and use more powerful
rows show the reconstruction and future prediction from stochastic decoders are required to get better future predic-
the one layer Composite Model. The model figures out tions.
Unsupervised Learning with LSTMs
 Input Sequence - Ground Truth Future -

 Input Reconstruction - Future Prediction -

Two Layer Composite Model with 2048 LSTM units

Two Layer Composite Model with 4096 LSTM units


Figure 5. Reconstruction and future prediction obtained from the Composite Model on a dataset of natural image patches. The first two
rows show ground truth sequences. The model takes 16 frames as inputs. Only the last 10 frames of the input sequence are shown here.
The next 13 frames are the ground truth future. In the rows that follow, we show the reconstructed and predicted frames for two instances
of the model.
3.4. Action Recognition on UCF-101/HMDB-51 step. At test time, the predictions made at each time step are
averaged. To get a prediction for the entire video, we av-
The aim of this set of experiments is to see if the features
erage the predictions from all 16 frame blocks in the video
learned by unsupervised learning can help improve perfor-
with a stride of 8 frames. Using a smaller stride did not
mance on supervised tasks.
improve results.

y1 y2 ... yT The baseline for comparing these models is an identical


LSTM classifier but with randomly initialized weights. All
classifiers used dropout regularization, where we dropped
activations as they were communicated across layers but
W (2) W (2) ... W (2) not through time within the same LSTM as proposed in
Zaremba et al. (2014). We emphasize that this is a very
strong baseline and does significantly better than just using
single frames. Using dropout was crucial in order to train
W (1) W (1) ... W (1) good baseline models with very few training examples.
Fig. 7 compares three models - single frame classifier, base-
line LSTM classifier and the LSTM classifier initialized
...
with weights from the Composite Model. The number of
v1 v2 vT
labelled videos per class is varied. Note that having one
labelled video means having many labelled 16 frame se-
quences. We can see that for the case of very few train-
Figure 6. LSTM Classifier. ing examples, unsupervised learning gives a substantial im-
We used a two layer Composite Model with 2048 hid- provement. For example, for UCF-101, the performance
den units with no conditioning on either decoders. The improves from 29.6% to 34.3% when training on only one
model was trained on percepts extracted from 300 hours of labelled video. As the size of the labelled dataset grows, the
YouTube data. It was trained to autoencode 16 frames and improvement becomes smaller. Even for the full UCF-101
predict the next 13 frames. We initialize an LSTM classi- dataset we get a considerable improvement from 74.5% to
fier with the weights learned by the encoder LSTM from 75.8%. On HMDB-51, the improvement is from 42.8% to
this model. The model is shown in Fig. 6. The output from 44.0% for the full dataset (70 videos per class) and 14.4%
each LSTM goes into a softmax classifier that makes a pre- to 19.1% for one video per class. Although, the improve-
diction about the action being performed at each time step. ment in classification by using unsupervised learning was
Since only one action is being performed in each video in not as big as we expected, we still managed to yield an ad-
the datasets we consider, the target is the same at each time ditional improvement over a strong baseline.
Unsupervised Learning with LSTMs

80 50

45
70
40

Classification Accuracy
Classification Accuracy
60 35

30
50
25

40 20

Single Frame 15 Single Frame


30 LSTM LSTM
10
LSTM + Pretraining LSTM + Pretraining
20 5
1 2 4 10 20 50 100 1 2 4 8 16 32 64
Training Examples per class Training Examples per class

(a) UCF-101 RGB (b) HMDB-51 RGB


Figure 7. Effect of pretraining on action recognition with change in the size of the labelled training set. The error bars are over 10
different samples of training sets.

UCF-101 UCF-101 HMDB-51 always possible to get lower reconstruction error by copy-
Model
RGB 1- frame flow RGB ing the inputs, we cannot use input reconstruction error as
Single Frame 72.2 72.2 40.1 a measure of how well a model is doing. However, we can
LSTM classifier 74.5 74.3 42.8 use the error in predicting the future as a reasonable mea-
Composite LSTM sure of performance. We can also use the performance on
75.8 74.9 44.1
Model + Finetuning supervised tasks as a proxy for how well the unsupervised
model is doing. In this section, we present results from
Table 1. Summary of Results on Action Recognition.
these two analyses.
Squared loss
Cross Entropy Future prediction results are summarized in Table 2. For
Model on image
on MNIST MNIST we compute the cross entropy of the predictions
patches
with respect to the ground truth. For natural image patches,
Future Predictor 350.2 225.2
Composite Model 344.9 210.7 we compute the squared loss. We see that the Compos-
Conditional Future Predictor 343.5 221.3 ite Model always does a better job of predicting the future
Composite Model with compared to the Future Predictor. This indicates that hav-
341.2 208.1
Conditional Future Predictor ing the autoencoder along with the future predictor to force
the model to remember more about the inputs actually helps
Table 2. Future prediction results on MNIST and image patches. predict the future better. Next, we compare each model
All models use 2 layers of LSTMs. with its conditional variant. Here, we find that the condi-
We further ran similar experiments on the optical flow per- tional models perform slightly better, as was also noted in
cepts extracted from the UCF-101 dataset. A temporal Fig. 4.
stream convolutional net, similar to the one proposed by Si- The performance on action recognition achieved by fine-
monyan & Zisserman (2014b), was trained on single frame tuning different unsupervised learning models is summa-
optical flows as well as on stacks of 10 optical flows. This rized in Table 3. Besides running the experiments on the
gave an accuracy of 72.2% and 77.5% respectively. Here full UCF-101 and HMDB-51 datasets, we also ran the ex-
again, our models took 16 frames as input, reconstructed periments on small subsets of these datasets where the ef-
them and predicted 13 frames into the future. LSTMs with fects of pretraining would be more pronounced. We find
128 hidden units improved the accuracy by 2.1% to 74.3% that all unsupervised models improve over the baseline
for the single frame case. Bigger LSTMs did not improve LSTM which is itself well-regularized using dropout. The
results. By pretraining the LSTM, we were able to further Autoencoder model seems to perform consistently better
improve the classification to 74.9% (±0.1). For stacks of than the Future Predictor. The Composite model, which
10 frames we improved very slightly to 77.7%. These re- combines the two, does better than either one alone. Con-
sults are summarized in Table 1. ditioning on the generated inputs does not seem to give a
clear advantage over not doing so. The Composite Model
3.5. Comparison of Different Model Variants with a conditional future predictor works the best, although
The aim of this set of experiments is to compare the differ- its performance is almost same as that of the Composite
ent variants of the model proposed in this paper. Since it is Model without conditioning.
Unsupervised Learning with LSTMs

Method UCF-101 small UCF-101 HMDB-51 small HMDB-51


Baseline LSTM 63.7 74.5 25.3 42.8
Autoencoder 66.2 75.1 28.6 44.0
Future Predictor 64.9 74.9 27.3 43.1
Conditional Autoencoder 65.8 74.8 27.9 43.1
Conditional Future Predictor 65.1 74.9 27.4 43.4
Composite Model 67.0 75.8 29.1 44.1
Composite Model with Conditional Future Predictor 67.1 75.8 29.2 44.0

Table 3. Comparison of different unsupervised pretraining methods. UCF-101 small is a subset containing 10 videos per class. HMDB-
51 small contains 4 videos per class.
3.6. Comparison with Action Recognition Benchmarks HMDB-
Method UCF-101
51
Finally, we compare our models to the state-of-the-art ac-
Spatial Convolutional Net (Simonyan &
tion recognition results. The performance is summarized in Zisserman, 2014a)
73.0 40.5
Table 4. The table is divided into three sets. The first set C3D (Tran et al., 2014) 72.3 -
compares models that use only RGB data (single or mul- C3D + fc6 (Tran et al., 2014) 76.4 -
tiple frames). The second set compares models that use LRCN (Donahue et al., 2014) 71.1 -
explicitly computed flow features only. Models in the third Composite LSTM Model 75.8 44.0
set use both. Temporal Convolutional Net (Simonyan &
83.7 54.6
Zisserman, 2014a)
On RGB data, our model performs at par with the best deep LRCN (Donahue et al., 2014) 77.0 -
models. It performs 4.7% better than the LRCN model that Composite LSTM Model 77.7 -
also used LSTMs on top of conv net features1 . Our model LRCN (Donahue et al., 2014) 82.9 -
performs better than C3D features that use a 3D convolu- Two-stream Convolutional Net (Simonyan &
tional net. However, when the C3D features are concate- 88.0 59.4
Zisserman, 2014a)
nated with fc6 percepts, they do slightly better than our Multi-skip feature stacking (Lan et al., 2014) 89.1 65.1
model. Composite LSTM Model 84.3 -

The improvement for flow features over using a randomly Table 4. Comparison with state-of-the-art action recognition
initialized LSTM network is quite small. We believe this models.
is partly due to the fact that the flow percepts already cap-
their properties through visualizations. More detailed anal-
ture a lot of the motion information that the LSTM would
ysis can be found in the expanded version of this paper
otherwise discover. Another contributing factor is that the
(Srivastava et al., 2015). There we found that on the mov-
temporal stream convolutional net that is used to extract
ing MNIST digits dataset, the model was able to generate
flow percepts overfits very readily (in the sense that it gets
persistent motion over long periods of time into the future
almost zero training error but much higher test error) in
even though it was trained for much shorter time scales.
spite of strong regularization. Therefore the statistics of
The learned features at the encoder and decoder when vi-
the percepts might be different between the training and
sualized show some important qualitative differences. In
test sets. This is not the case for RGB percepts because the
terms of performance on supervised tasks, we managed to
network there was trained on an entirely different dataset
get modest improvements only. The best performing model
(ImageNet).
was the Composite Model that combined an autoencoder
When we combine predictions from the RGB and flow and a future predictor. The conditional variants did not
models, we obtain 84.3 % accuracy on UCF-101. We be- give any significant improvements in terms of classification
lieve further improvements can be made by running the accuracy after finetuning, however they did give slightly
model over different patch locations and mirroring the lower prediction errors. More powerful decoders which in-
patches. Also, our model can be applied deeper inside the corporate some form of stochasticity are required to further
conv net instead of just at the top-level. address this question.
To get improvements on supervised tasks, the model can
4. Conclusions be extended by applying it convolutionally across patches
of the video and stacking multiple layers of such models.
We proposed models based on LSTMs that can learn good
In our future work, we plan to build temporal models from
video representations. We compared them and analyzed
the bottom up instead of using them only to model high-
1
However, the improvement is only partially from unsuper- level percepts. We will also use more powerful decoders
vised learning, since we used a better conv net model. that can model multimodal target distributions.
Unsupervised Learning with LSTMs

Acknowledgments Lan, Zhen-Zhong, Lin, Ming, Li, Xuanchong, Hauptmann,


Alexander G., and Raj, Bhiksha. Beyond gaussian pyra-
We gratefully acknowledge support from IARPA AL- mid: Multi-skip feature stacking for action recognition.
ADDIN project, ONR Grant N00014-14-1-0232, Sam- CoRR, abs/1411.6660, 2014.
sung, and NVIDIA Corporation with the donation of a GPU
used for this research. Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning
hierarchical spatio-temporal features for action recog-
References nition with independent subspace analysis. In CVPR,
2011.
Cho, Kyunghyun, van Merrienboer, Bart, Gülçehre, Çaglar,
Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, Memisevic, Roland. Learning to relate images. IEEE
and Bengio, Yoshua. Learning phrase representations us- Transactions on Pattern Analysis and Machine Intelli-
ing RNN encoder-decoder for statistical machine trans- gence, 35(8):1829–1846, 2013.
lation. In Proceedings of the 2014 Conference on Empir-
ical Methods in Natural Language Processing, EMNLP Memisevic, Roland and Hinton, Geoffrey E. Learning to
2014, pp. 1724–1734, 2014. represent spatial transformations with factored higher-
order boltzmann machines. Neural Computation, 22(6):
Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, 1473–1492, June 2010.
Rohrbach, Marcus, Venugopalan, Subhashini, Saenko,
Kate, and Darrell, Trevor. Long-term recurrent convo- Michalski, Vincent, Memisevic, Roland, and Konda,
lutional networks for visual recognition and description. Kishore. Modeling deep temporal dependencies with
CoRR, abs/1411.4389, 2014. recurrent grammar cells. In Advances in Neural Infor-
mation Processing Systems 27, pp. 1925–1933. Curran
Graves, Alex. Generating sequences with recurrent neural Associates, Inc., 2014.
networks. CoRR, abs/1308.0850, 2013.
Mobahi, Hossein, Collobert, Ronan, and Weston, Jason.
Graves, Alex and Jaitly, Navdeep. Towards end-to-end Deep learning from temporal coherence in video. In Pro-
speech recognition with recurrent neural networks. In ceedings of the 26th Annual International Conference on
Proceedings of the 31st International Conference on Ma- Machine Learning, ICML ’09, pp. 737–744, New York,
chine Learning (ICML-14), pp. 1764–1772, 2014. NY, USA, 2009. ACM.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Ranzato, Marc’Aurelio, Szlam, Arthur, Bruna, Joan, Math-
Daan. DRAW: A recurrent neural network for image ieu, Michaël, Collobert, Ronan, and Chopra, Sumit.
generation. CoRR, abs/1502.04623, 2015. Video (language) modeling: a baseline for generative
models of natural videos. CoRR, abs/1412.6604, 2014.
Hurri, Jarmo and Hyvärinen, Aapo. Simple-cell-like re-
ceptive fields maximize temporal coherence in natural Simonyan, K. and Zisserman, A. Two-stream convolutional
video. Neural Computation, 15(3):663–691, 2003. networks for action recognition in videos. In Advances
in Neural Information Processing Systems, 2014a.
Ji, Shuiwang, Xu, Wei, Yang, Ming, and Yu, Kai. 3d con-
volutional neural networks for human action recognition. Simonyan, K. and Zisserman, A. Very deep convolu-
Pattern Analysis and Machine Intelligence, IEEE Trans- tional networks for large-scale image recognition. CoRR,
actions on, 35(1):221–231, Jan 2013. abs/1409.1556, 2014b.

Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Le- Soomro, K., Roshan Zamir, A., and Shah, M. UCF101: A
ung, Thomas, Sukthankar, Rahul, and Fei-Fei, Li. Large- dataset of 101 human actions classes from videos in the
scale video classification with convolutional neural net- wild. In CRCV-TR-12-01, 2012.
works. In CVPR, 2014.
Srivastava, Nitish, Mansimov, Elman, and Salakhutdinov,
Kingma, Diederik P. and Welling, Max. Auto-encoding Ruslan. Unsupervised learning of video representations
variational bayes. CoRR, abs/1312.6114, 2013. using LSTMs. CoRR, abs/1502.04681, 2015.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M.
T. HMDB: a large video database for human motion Modeling the joint density of two images under a variety
recognition. In Proceedings of the International Confer- of transformations. In Proceedings of IEEE Conference
ence on Computer Vision (ICCV), 2011. on Computer Vision and Pattern Recognition, 2011.
Unsupervised Learning with LSTMs

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. V. Se-


quence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems 27,
pp. 3104–3112. 2014.
Tran, Du, Bourdev, Lubomir D., Fergus, Rob, Torresani,
Lorenzo, and Paluri, Manohar. C3D: generic features
for video analysis. CoRR, abs/1412.0767, 2014.
van Hateren, J. H. and Ruderman, D. L. Independent
component analysis of natural image sequences yields
spatio-temporal filters similar to simple cells in primary
visual cortex. Proceedings. Biological sciences / The
Royal Society, 265(1412):2315–2320, 1998.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-


han, Dumitru. Show and tell: A neural image caption
generator. CoRR, abs/1411.4555, 2014.
Wang, Zhou, Bovik, A. C., Sheikh, H. R., and Simoncelli,
E. P. Image quality assessment: From error visibility to
structural similarity. Trans. Img. Proc., 13(4):600–612,
2004.
Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol.
Recurrent neural network regularization. CoRR,
abs/1409.2329, 2014.

You might also like