Icml09 ConvolutionalDeepBeliefNetworks
Icml09 ConvolutionalDeepBeliefNetworks
probabilistic max-pooling, a novel technique that allows 2.2. Deep belief networks
higher-layer units to cover larger areas of the input in a The RBM by itself is limited in what it can represent.
probabilistically sound way. To the best of our knowl- Its real power emerges when RBMs are stacked to form
edge, ours is the first translation invariant hierarchical a deep belief network, a generative model consisting of
generative model which supports both top-down and many layers. In a DBN, each layer comprises a set of
bottom-up probabilistic inference and scales to real- binary or real-valued units. Two adjacent layers have a
istic image sizes. The first, second, and third layers full set of connections between them, but no two units
of our network learn edge detectors, object parts, and in the same layer are connected. Hinton et al. (2006)
objects respectively. We show that these representa- proposed an efficient algorithm for training deep belief
tions achieve excellent performance on several visual networks, by greedily training each layer (from low-
recognition tasks and allow “hidden” object parts to est to highest) as an RBM using the previous layer’s
be inferred from high-level object information. activations as inputs. This procedure works well in
practice.
2. Preliminaries
2.1. Restricted Boltzmann machines 3. Algorithms
The restricted Boltzmann machine (RBM) is a two- RBMs and DBNs both ignore the 2-D structure of im-
layer, bipartite, undirected graphical model with a set ages, so weights that detect a given feature must be
of binary hidden units h, a set of (binary or real- learned separately for every location. This redundancy
valued) visible units v, and symmetric connections be- makes it difficult to scale these models to full images.
tween these two layers represented by a weight matrix (However, see also (Raina et al., 2009).) In this sec-
W . The probabilistic semantics for an RBM is defined tion, we introduce our model, the convolutional DBN,
by its energy function as follows: whose weights are shared among all locations in an im-
age. This model scales well because inference can be
1
P (v, h) = exp(−E(v, h)), done efficiently using convolution.
Z
3.1. Notation
where Z is the partition function. If the visible units
are binary-valued, we define the energy function as: For notational convenience, we will make several sim-
plifying assumptions. First, we assume that all inputs
to the algorithm are NV × NV images, even though
X X X
E(v, h) = − vi Wij hj − bj hj − c i vi ,
i,j j i there is no requirement that the inputs be square,
equally sized, or even two-dimensional. We also as-
where bj are hidden unit biases and ci are visible unit sume that all units are binary-valued, while noting
biases. If the visible units are real-valued, we can de- that it is straightforward to extend the formulation to
fine the energy function as: the real-valued visible units (see Section 2.1). We use
1X 2 X X X ∗ to denote convolution1 , and • to denote element-wise
E(v, h) = vi − vi Wij hj − bj hj − c i vi . product followed by summation, i.e., A • B = trAT B.
2 i i,j j i We place a tilde above an array (Ã) to denote flipping
the array horizontally and vertically.
From the energy function, it is clear that the hid-
den units are conditionally independent of one another 3.2. Convolutional RBM
given the visible layer, and vice versa. In particular, First, we introduce the convolutional RBM (CRBM).
the units of a binary layer (conditioned on the other Intuitively, the CRBM is similar to the RBM, but
layer) are independent Bernoulli random variables. If the weights between the hidden and visible layers are
the visible layer is real-valued, the visible units (condi- shared among all locations in an image. The basic
tioned on the hidden layer) are Gaussian with diagonal CRBM consists of two layers: an input layer V and a
covariance. Therefore, we can perform efficient block hidden layer H (corresponding to the lower two layers
Gibbs sampling by alternately sampling each layer’s in Figure 1). The input layer consists of an NV × NV
units (in parallel) given the other layer. We will often array of binary units. The hidden layer consists of K
refer to a unit’s expected value as its activation. “groups”, where each group is an NH × NH array of
In principle, the RBM parameters can be optimized binary units, resulting in NH 2 K hidden units. Each
by performing stochastic gradient ascent on the log- of the K groups is associated with a NW × NW filter
likelihood of training data. Unfortunately, computing 1
The convolution of an m × m array with an n × n array
the exact gradient of the log-likelihood is intractable. may result in an (m + n − 1) × (m + n − 1) array or an
Instead, one typically uses the contrastive divergence (m − n + 1) × (m − n + 1) array. Rather than invent a
approximation (Hinton, 2002), which has been shown cumbersome notation to distinguish these cases, we let it
be determined by context.
to work well in practice.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
Table 1. Classification accuracy for the Caltech-101 data these categories (such as elephants and chairs) have
Training Size 15 30 fairly high intra-class appearance variation, due to de-
CDBN (first layer) 53.2±1.2% 60.5±1.1% formable shapes or different viewpoints. Despite this,
CDBN (first+second layers) 57.7±1.5% 65.4±0.5% our model still learns hierarchical, part-based repre-
Raina et al. (2007) 46.6% - sentations fairly robustly.
Ranzato et al. (2007) - 54.0%
Mutch and Lowe (2006) 51.0% 56.0% Higher layers in the CDBN learn features which are
Lazebnik et al. (2006) 54.0% 64.6% not only higher level, but also more specific to particu-
Zhang et al. (2006) 59.0±0.56% 66.2±0.5%
lar object categories. We now quantitatively measure
tirely from natural scenes, which are completely un- the specificity of each layer by determining how in-
related to the classification task. Hence, the strong dicative each individual feature is of object categories.
performance of these features implies that our CDBN (This contrasts with most work in object classifica-
learned a highly general representation of images. tion, which focuses on the informativeness of the en-
tire feature set, rather than individual features.) More
4.3. Handwritten digit classification specifically, we consider three CDBNs trained on faces,
We further evaluated the performance of our model motorbikes, and cars, respectively. For each CDBN,
on the MNIST handwritten digit classification task, we test the informativeness of individual features from
a widely-used benchmark for testing hierarchical rep- each layer for distinguishing among these three cate-
resentations. We trained 40 first layer bases from gories. For each feature,10 we computed area under the
MNIST digits, each 12x12 pixels, and 40 second layer precision-recall curve (larger means more specific).11
bases, each 6x6. The pooling ratio C was 2 for both As shown in Figure 4, the higher-level representations
layers. The first layer bases learned “strokes” that are more selective for the specific object class.
comprise the digits, and the second layer bases learned
We further tested if the CDBN can learn hierarchi-
bigger digit-parts that combine the strokes. We con-
cal object-part representations when trained on im-
structed feature vectors by concatenating the first and
ages from several object categories (rather than just
second (pooling) layer activations, and used an SVM
one). We trained the second and third layer represen-
for classification using these features. For each labeled
tations using unlabeled images randomly selected from
training set size, we report the test error averaged over
four object categories (cars, faces, motorbikes, and air-
10 randomly chosen training sets, as shown in Table 2.
planes). As shown in Figure 3 (far right), the second
For the full training set, we obtained 0.8% test error.
layer learns class-specific as well as shared parts, and
Our result is comparable to the state-of-the-art (Ran-
the third layer learns more object-specific representa-
zato et al., 2007; Weston et al., 2008).8
tions. (The training examples were unlabeled, so in a
4.4. Unsupervised learning of object parts sense, this means the third layer implicitly clusters the
We now show that our algorithm can learn hierarchi- images by object category.) As before, we quantita-
cal object-part representations in an unsupervised set- tively measured the specificity of each layer’s individ-
ting. Building on the first layer representation learned ual features to object categories. Because the train-
from natural images, we trained two additional CDBN ing was completely unsupervised, whereas the AUC-
layers using unlabeled images from single Caltech-101 PR statistic requires knowing which specific object or
categories.9 As shown in Figure 3, the second layer object parts the learned bases should represent, we
learned features corresponding to object parts, even instead computed conditional entropy.12 Informally
though the algorithm was not given any labels speci- speaking, conditional entropy measures the entropy of
fying the locations of either the objects or their parts. 10
For a given image, we computed the layerwise activa-
The third layer learned to combine the second layer’s tions using our algorithm, partitioned the activation into
part representations into more complex, higher-level LxL regions for each group, and computed the q% highest
features. Our model successfully learned hierarchi- quantile activation for each region and each group. If the
q% highest quantile activation in region i is γ, we then de-
cal object-part representations of most of the other fine a Bernoulli random variable Xi,L,q with probability γ
Caltech-101 categories as well. We note that some of of being 1. To measure the informativeness between a fea-
ture and the class label, we computed the mutual informa-
formance of their convolutional neural network model.
8 tion between Xi,L,q and the class label. Results reported
We note that Hinton and Salakhutdinov (2006)’s are using (L, q) values that maximized the average mutual
method is non-convolutional. information (averaging over i).
9
The images were unlabeled in that the position of the 11
For each feature, by comparing its values over pos-
object is unspecified. Training was on up to 100 images, itive examples and negative examples, we obtained the
and testing was on different images than the training set. precision-recall curve for each classification problem.
The pooling ratio for the first layer was set as 3. The 12
We computed the quantile features γ for each layer
second layer contained 40 bases, each 10x10, and the third
as previously described, and measured conditional entropy
layer contained 24 bases, each 14x14. The pooling ratio in
H(class|γ > 0.95).
both cases was 2.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
Table 2. Test error for MNIST dataset
Labeled training samples 1,000 2,000 3,000 5,000 60,000
CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%
Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%
Hinton and Salakhutdinov (2006) - - - - 1.20%
Weston et al. (2008) 2.73% - 1.83% - 1.50%
faces cars elephants chairs faces, cars, airplanes, motorbikes
Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific object
categories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of four
object categories (faces, cars, airplanes, motorbikes).
Faces Motorbikes Cars
0.6 0.6 0.6
first layer first layer first layer
second layer second layer second layer
0.4 third layer 0.4 third layer 0.4 third layer
0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Area under the PR curve (AUC) Area under the PR curve (AUC) Area under the PR curve (AUC)
ever, with full Gibbs sampling, the bottom-up inputs Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond
combine with the context provided by the third layer bags of features: Spatial pyramid matching for rec-
which has detected the object. This combined evi- ognizing natural scene categories. IEEE Conference
dence significantly improves the second layer represen- on Computer Vision and Pattern Recognition.
tation. Selected examples are shown in Figure 6. LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W., & Jackel, L. D. (1989).
5. Conclusion
Backpropagation applied to handwritten zip code
We presented the convolutional deep belief network, a recognition. Neural Computation, 1, 541–551.
scalable generative model for learning hierarchical rep-
Lee, H., Ekanadham, C., & Ng, A. Y. (2008). Sparse
resentations from unlabeled images, and showed that
deep belief network model for visual area V2. Ad-
our model performs well in a variety of visual recog-
vances in Neural Information Processing Systems.
nition tasks. We believe our approach holds promise
as a scalable algorithm for learning hierarchical repre- Lee, T. S., & Mumford, D. (2003). Hierarchical
sentations from high-dimensional, complex data. bayesian inference in the visual cortex. Journal of
the Optical Society of America A, 20, 1434–1448.
Acknowledgment
Mutch, J., & Lowe, D. G. (2006). Multiclass object
We give warm thanks to Daniel Oblinger and Rajat recognition with sparse, localized features. IEEE
Raina for helpful discussions. This work was sup- Conf. on Computer Vision and Pattern Recognition.
ported by the DARPA transfer learning program under
Olshausen, B. A., & Field, D. J. (1996). Emergence
contract number FA8750-05-2-0249.
of simple-cell receptive field properties by learning
References a sparse code for natural images. Nature, 381, 607–
609.
Bell, A. J., & Sejnowski, T. J. (1997). The ‘indepen-
dent components’ of natural scenes are edge filters. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y.
Vision Research, 37, 3327–3338. (2007). Self-taught learning: Transfer learning from
unlabeled data. International Conference on Ma-
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. chine Learning (pp. 759–766).
(2006). Greedy layer-wise training of deep networks.
Adv. in Neural Information Processing Systems. Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-
scale deep unsupervised learning using graphics pro-
Berg, A. C., Berg, T. L., & Malik, J. (2005). Shape cessors. International Conf. on Machine Learning.
matching and object recognition using low distor-
tion correspondence. IEEE Conference on Com- Ranzato, M., Huang, F.-J., Boureau, Y.-L., & LeCun,
puter Vision and Pattern Recognition (pp. 26–33). Y. (2007). Unsupervised learning of invariant fea-
ture hierarchies with applications to object recog-
Desjardins, G., & Bengio, Y. (2008). Empirical eval- nition. IEEE Conference on Computer Vision and
uation of convolutional RBMs for vision (Technical Pattern Recognition.
Report).
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning (2006). Efficient learning of sparse representations
generative visual models from few training exam- with an energy-based model. Advances in Neural
ples: an incremental Bayesian approach tested on Information Processing Systems (pp. 1137–1144).
101 object categories. CVPR Workshop on Gen.-
Model Based Vision. Taylor, G., Hinton, G. E., & Roweis, S. (2007). Mod-
eling human motion using binary latent variables.
Grosse, R., Raina, R., Kwong, H., & Ng, A. (2007). Adv. in Neural Information Processing Systems.
Shift-invariant sparse coding for audio classification.
Proceedings of the Conference on Uncertainty in AI. Varma, M., & Ray, D. (2007). Learning the discrimina-
tive power-invariance trade-off. International Con-
Hinton, G. E. (2002). Training products of experts by ference on Computer Vision.
minimizing contrastive divergence. Neural Compu-
tation, 14, 1771–1800. Weston, J., Ratle, F., & Collobert, R. (2008). Deep
learning via semi-supervised embedding. Interna-
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A tional Conference on Machine Learning.
fast learning algorithm for deep belief nets. Neural
Computation, 18, 1527–1554. Yu, K., Xu, W., & Gong, Y. (2009). Deep learn-
ing with kernel regularization for visual recognition.
Hinton, G. E., & Salakhutdinov, R. (2006). Reduc- Adv. Neural Information Processing Systems.
ing the dimensionality of data with neural networks.
Science, 313, 504–507. Zhang, H., Berg, A. C., Maire, M., & Malik, J. (2006).
SVM-KNN: Discriminative nearest neighbor classifi-
Ito, M., & Komatsu, H. (2004). Representation of cation for visual category recognition. IEEE Confer-
angles embedded within contour stimuli in area V2 ence on Computer Vision and Pattern Recognition.
of macaque monkeys. J. Neurosci., 24, 3313–3324.