Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
101 views8 pages

Icml09 ConvolutionalDeepBeliefNetworks

This document presents the convolutional deep belief network (CDBN), a hierarchical generative model for unsupervised learning of visual representations from images. The CDBN addresses scaling issues with previous models by incorporating translation invariance using shared local feature detectors and probabilistic max-pooling. It is the first translation invariant hierarchical model to support both bottom-up and top-down probabilistic inference over full-sized images. The CDBN learns edge detectors in the first layer, object parts in the second layer, and objects in the third layer from unlabeled images.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views8 pages

Icml09 ConvolutionalDeepBeliefNetworks

This document presents the convolutional deep belief network (CDBN), a hierarchical generative model for unsupervised learning of visual representations from images. The CDBN addresses scaling issues with previous models by incorporating translation invariance using shared local feature detectors and probabilistic max-pooling. It is the first translation invariant hierarchical model to support both bottom-up and top-down probabilistic inference over full-sized images. The CDBN learns edge detectors in the first layer, object parts in the second layer, and objects in the third layer from unlabeled images.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Convolutional Deep Belief Networks

for Scalable Unsupervised Learning of Hierarchical Representations

Honglak Lee [email protected]


Roger Grosse [email protected]
Rajesh Ranganath [email protected]
Andrew Y. Ng [email protected]
Computer Science Department, Stanford University, Stanford, CA 94305, USA

Abstract lower-level ambiguities in the image or infer the loca-


There has been much interest in unsuper- tions of hidden object parts.
vised learning of hierarchical generative mod- Deep architectures consist of feature detector units ar-
els such as deep belief networks. Scaling ranged in layers. Lower layers detect simple features
such models to full-sized, high-dimensional and feed into higher layers, which in turn detect more
images remains a difficult problem. To ad- complex features. There have been several approaches
dress this problem, we present the convolu- to learning deep networks (LeCun et al., 1989; Bengio
tional deep belief network, a hierarchical gen- et al., 2006; Ranzato et al., 2006; Hinton et al., 2006).
erative model which scales to realistic image In particular, the deep belief network (DBN) (Hinton
sizes. This model is translation-invariant and et al., 2006) is a multilayer generative model where
supports efficient bottom-up and top-down each layer encodes statistical dependencies among the
probabilistic inference. Key to our approach units in the layer below it; it is trained to (approxi-
is probabilistic max-pooling, a novel technique mately) maximize the likelihood of its training data.
which shrinks the representations of higher DBNs have been successfully used to learn high-level
layers in a probabilistically sound way. Our structure in a wide variety of domains, including hand-
experiments show that the algorithm learns written digits (Hinton et al., 2006) and human motion
useful high-level visual features, such as ob- capture data (Taylor et al., 2007). We build upon the
ject parts, from unlabeled images of objects DBN in this paper because we are interested in learn-
and natural scenes. We demonstrate excel- ing a generative model of images which can be trained
lent performance on several visual recogni- in a purely unsupervised manner.
tion tasks and show that our model can per-
form hierarchical (bottom-up and top-down) While DBNs have been successful in controlled do-
inference over full-sized images. mains, scaling them to realistic-sized (e.g., 200x200
pixel) images remains challenging for two reasons.
First, images are high-dimensional, so the algorithms
1. Introduction must scale gracefully and be computationally tractable
even when applied to large images. Second, objects
The visual world can be described at many levels: pixel can appear at arbitrary locations in images; thus it
intensities, edges, object parts, objects, and beyond. is desirable that representations be invariant at least
The prospect of learning hierarchical models which to local translations of the input. We address these
simultaneously represent multiple levels has recently issues by incorporating translation invariance. Like
generated much interest. Ideally, such “deep” repre- LeCun et al. (1989) and Grosse et al. (2007), we
sentations would learn hierarchies of feature detectors, learn feature detectors which are shared among all lo-
and further be able to combine top-down and bottom- cations in an image, because features which capture
up processing of an image. For instance, lower layers useful information in one part of an image can pick up
could support object detection by spotting low-level the same information elsewhere. Thus, our model can
features indicative of object parts. Conversely, infor- represent large images using only a small number of
mation about objects in the higher layers could resolve feature detectors.
Appearing in Proceedings of the 26 th International Confer- This paper presents the convolutional deep belief net-
ence on Machine Learning, Montreal, Canada, 2009. Copy- work, a hierarchical generative model that scales to
right 2009 by the author(s)/owner(s).
full-sized images. Another key to our approach is
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

probabilistic max-pooling, a novel technique that allows 2.2. Deep belief networks
higher-layer units to cover larger areas of the input in a The RBM by itself is limited in what it can represent.
probabilistically sound way. To the best of our knowl- Its real power emerges when RBMs are stacked to form
edge, ours is the first translation invariant hierarchical a deep belief network, a generative model consisting of
generative model which supports both top-down and many layers. In a DBN, each layer comprises a set of
bottom-up probabilistic inference and scales to real- binary or real-valued units. Two adjacent layers have a
istic image sizes. The first, second, and third layers full set of connections between them, but no two units
of our network learn edge detectors, object parts, and in the same layer are connected. Hinton et al. (2006)
objects respectively. We show that these representa- proposed an efficient algorithm for training deep belief
tions achieve excellent performance on several visual networks, by greedily training each layer (from low-
recognition tasks and allow “hidden” object parts to est to highest) as an RBM using the previous layer’s
be inferred from high-level object information. activations as inputs. This procedure works well in
practice.
2. Preliminaries
2.1. Restricted Boltzmann machines 3. Algorithms
The restricted Boltzmann machine (RBM) is a two- RBMs and DBNs both ignore the 2-D structure of im-
layer, bipartite, undirected graphical model with a set ages, so weights that detect a given feature must be
of binary hidden units h, a set of (binary or real- learned separately for every location. This redundancy
valued) visible units v, and symmetric connections be- makes it difficult to scale these models to full images.
tween these two layers represented by a weight matrix (However, see also (Raina et al., 2009).) In this sec-
W . The probabilistic semantics for an RBM is defined tion, we introduce our model, the convolutional DBN,
by its energy function as follows: whose weights are shared among all locations in an im-
age. This model scales well because inference can be
1
P (v, h) = exp(−E(v, h)), done efficiently using convolution.
Z
3.1. Notation
where Z is the partition function. If the visible units
are binary-valued, we define the energy function as: For notational convenience, we will make several sim-
plifying assumptions. First, we assume that all inputs
to the algorithm are NV × NV images, even though
X X X
E(v, h) = − vi Wij hj − bj hj − c i vi ,
i,j j i there is no requirement that the inputs be square,
equally sized, or even two-dimensional. We also as-
where bj are hidden unit biases and ci are visible unit sume that all units are binary-valued, while noting
biases. If the visible units are real-valued, we can de- that it is straightforward to extend the formulation to
fine the energy function as: the real-valued visible units (see Section 2.1). We use
1X 2 X X X ∗ to denote convolution1 , and • to denote element-wise
E(v, h) = vi − vi Wij hj − bj hj − c i vi . product followed by summation, i.e., A • B = trAT B.
2 i i,j j i We place a tilde above an array (Ã) to denote flipping
the array horizontally and vertically.
From the energy function, it is clear that the hid-
den units are conditionally independent of one another 3.2. Convolutional RBM
given the visible layer, and vice versa. In particular, First, we introduce the convolutional RBM (CRBM).
the units of a binary layer (conditioned on the other Intuitively, the CRBM is similar to the RBM, but
layer) are independent Bernoulli random variables. If the weights between the hidden and visible layers are
the visible layer is real-valued, the visible units (condi- shared among all locations in an image. The basic
tioned on the hidden layer) are Gaussian with diagonal CRBM consists of two layers: an input layer V and a
covariance. Therefore, we can perform efficient block hidden layer H (corresponding to the lower two layers
Gibbs sampling by alternately sampling each layer’s in Figure 1). The input layer consists of an NV × NV
units (in parallel) given the other layer. We will often array of binary units. The hidden layer consists of K
refer to a unit’s expected value as its activation. “groups”, where each group is an NH × NH array of
In principle, the RBM parameters can be optimized binary units, resulting in NH 2 K hidden units. Each
by performing stochastic gradient ascent on the log- of the K groups is associated with a NW × NW filter
likelihood of training data. Unfortunately, computing 1
The convolution of an m × m array with an n × n array
the exact gradient of the log-likelihood is intractable. may result in an (m + n − 1) × (m + n − 1) array or an
Instead, one typically uses the contrastive divergence (m − n + 1) × (m − n + 1) array. Rather than invent a
approximation (Hinton, 2002), which has been shown cumbersome notation to distinguish these cases, we let it
be determined by context.
to work well in practice.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

(NW , NV − NH + 1); the filter weights are shared


across all the hidden units within the group. In addi- NP pkα Pk (pooling layer)
tion, each hidden group has a bias bk and all visible
units share a single bias c. NH C hki,j Hk (detection layer)
We define the energy function E(v, h) as:
1
P (v, h) =exp(−E(v, h)) Wk
Z
K X
X NH XNW
E(v, h) = − hkij Wrs
k
vi+r−1,j+s−1 NV NW v V (visible layer)
k=1 i,j=1 r,s=1
K NH NV
X X X Figure 1. Convolutional RBM with probabilistic max-
− bk hkij − c vij . (1) pooling. For simplicity, only group k of the detection layer
k=1 i,j=1 i,j=1 and the pooing layer are shown. The basic CRBM corre-
sponds to a simplified structure with only visible layer and
Using the operators defined previously,
detection (hidden) layer. See text for details.
K K
vij . To simplify the notation, we consider a model with a
X X X X
E(v, h) = − hk • (W̃ k ∗ v) − bk hki,j − c
k=1 k=1 i,j i,j
visible layer V , a detection layer H, and a pooling layer
P , as shown in Figure 1. The detection and pooling
As with standard RBMs (Section 2.1), we can perform layers both have K groups of units, and each group
block Gibbs sampling using the following conditional of the pooling layer has NP × NP binary units. For
distributions: each k ∈ {1, ..., K}, the pooling layer P k shrinks the
representation of the detection layer H k by a factor
P (hkij = 1|v) = σ((W̃ k ∗ v)ij + bk ) of C along each dimension, where C is a small in-
teger such as 2 or 3. I.e., the detection layer H k is
X
P (vij = 1|h) = σ(( W k ∗ hk )ij + c),
k partitioned into blocks of size C × C, and each block
α is connected to exactly one binary unit pkα in the
where σ is the sigmoid function. Gibbs sampling forms pooling layer (i.e., NP = NH /C). Formally, we define
the basis of our inference and learning algorithms. Bα , {(i, j) : hij belongs to the block α.}.
3.3. Probabilistic max-pooling The detection units in the block B and the pooling α
In order to learn high-level representations, we stack unit pα are connected in a single potential which en-
CRBMs into a multilayer architecture analogous to forces the following constraints: at most one of the
DBNs. This architecture is based on a novel opera- detection units may be on, and the pooling unit is on
tion that we call probabilistic max-pooling. if and only if a detection unit is on. Equivalently, we
can consider these C 2 +1 units as a single random vari-
In general, higher-level feature detectors need informa-
able which may take on one of C 2 + 1 possible values:
tion from progressively larger input regions. Existing
one value for each of the detection units being on, and
translation-invariant representations, such as convolu-
one value indicating that all units are off.
tional networks, often involve two kinds of layers in
alternation: “detection” layers, whose responses are We formally define the energy function of this simpli-
computed by convolving a feature detector with the fied probabilistic max-pooling-CRBM as follows:
previous layer, and “pooling” layers, which shrink the
XX“ k ”
representation of the detection layers by a constant
X
E(v, h) = − hi,j (W̃ k ∗ v)i,j + bk hki,j − c vi,j
factor. More specifically, each unit in a pooling layer k i,j i,j
computes the maximum activation of the units in a X
subj. to hki,j ≤ 1, ∀k, α.
small region of the detection layer. Shrinking the rep- (i,j)∈Bα
resentation with max-pooling allows higher-layer rep-
resentations to be invariant to small translations of the We now discuss sampling the detection layer H and
input and reduces the computational burden. the pooling layer P given the visible layer V . Group k
receives the following bottom-up signal from layer V :
Max-pooling was intended only for feed-forward archi-
tectures. In contrast, we are interested in a generative I(hkij ) , bk + (W̃ k ∗ v)ij . (2)
model of images which supports both top-down and
bottom-up inference. Therefore, we designed our gen- Now, we sample each block independently as a multi-
erative model so that inference involves max-pooling- nomial function of its inputs. Suppose hki,j is a hid-
like behavior. den unit contained in block α (i.e., (i, j) ∈ Bα ), the
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
0
increase in energy caused by turning on unit hki,j is set of shared weights Γ = {Γ1,1 , . . . , ΓK,K }, where
−I(hki,j ), and the conditional probability is given by: Γk,` is a weight matrix connecting pooling unit P k to
`
detection unit H 0 . The definition can be extended to
exp(I(hki,j )) deeper networks in a straightforward way.
P (hki,j = 1|v) =
exp(I(hki0 ,j 0 ))
P
1+ (i0 ,j 0 )∈Bα Note that an energy function for this sub-network con-
1 sists of two kinds of potentials: unary terms for each
P (pkα = 0|v) = .
exp(I(hki0 ,j 0 )) of the groups in the detection layers, and interaction
P
1+ (i0 ,j 0 )∈Bα
terms between V and H and between P and H 0 :
Sampling the visible layer V given the hidden layer X X X
H can be performed in the same way as described in E(v, h, p, h0 ) = − v • (W k ∗ hk ) − bk hkij
Section 3.2. k k ij
0` `
X X X
0
3.4. Training via sparsity regularization − k
p • (Γ k`
∗h )− b` h0 ij
k,` ` ij
Our model is overcomplete in that the size of the rep-
resentation is much larger than the size of the inputs. To sample the detection layer H and pooling layer P ,
In fact, since the first hidden layer of the network con- note that the detection layer H k receives the following
tains K groups of units, each roughly the size of the bottom-up signal from layer V :
image, it is overcomplete roughly by a factor of K. In
general, overcomplete models run the risk of learning I(hkij ) , bk + (W̃ k ∗ v)ij , (3)
trivial solutions, such as feature detectors represent-
ing single pixels. One common solution is to force the and the pooling layer P k receives the following top-
representation to be “sparse,” in that only a tiny frac- down signal from layer H 0 :
tion of the units should be active in relation to a given
stimulus (Olshausen & Field, 1996; Lee et al., 2008). `
X
I(pkα ) , (Γk` ∗ h0 )α . (4)
In our approach, like Lee et al. (2008), we regularize `
the objective function (data log-likelihood) to encour-
age each of the hidden units to have a mean activation Now, we sample each of the blocks independently as a
close to some small constant ρ. For computing the multinomial function of their inputs, as in Section 3.3.
gradient of sparsity regularization term, we followed If (i, j) ∈ Bα , the conditional probability is given by:
Lee et al. (2008)’s method.
exp(I(hki,j ) + I(pkα ))
3.5. Convolutional deep belief network P (hki,j = 1|v, h0 ) = P k k
1+ (i0 ,j 0 )∈Bα exp(I(hi0 ,j 0 ) + I(pα ))
Finally, we are ready to define the convolutional deep
1
belief network (CDBN), our hierarchical generative P (pkα = 0|v, h0 ) = P k k
.
1+ (i0 ,j 0 )∈Bα exp(I(h i0 ,j 0 ) + I(pα ))
model for full-sized images. Analogously to DBNs, this
architecture consists of several max-pooling-CRBMs As an alternative to block Gibbs sampling, mean-field
stacked on top of one another. The network defines an can be used to approximate the posterior distribution.2
energy function by summing together the energy func-
tions for all of the individual pairs of layers. Training 3.7. Discussion
is accomplished with the same greedy, layer-wise pro- Our model used undirected connections between lay-
cedure described in Section 2.2: once a given layer is ers. This contrasts with Hinton et al. (2006), which
trained, its weights are frozen, and its activations are used undirected connections between the top two lay-
used as input to the next layer. ers, and top-down directed connections for the layers
3.6. Hierarchical probabilistic inference below. Hinton et al. (2006) proposed approximat-
ing the posterior distribution using a single bottom-up
Once the parameters have all been learned, we com- pass. This feed-forward approach often can effectively
pute the network’s representation of an image by sam- estimate the posterior when the image contains no oc-
pling from the joint distribution over all of the hidden clusions or ambiguities, but the higher layers cannot
layers conditioned on the input image. To sample from help resolve ambiguities in the lower layers. Although
this distribution, we use block Gibbs sampling, where Gibbs sampling may more accurately estimate the pos-
the units of each layer are sampled in parallel (see Sec- terior in this network, applying block Gibbs sampling
tions 2.1 & 3.3). would be difficult because the nodes in a given layer
To illustrate the algorithm, we describe a case with one 2
In all our experiments except for Section 4.5, we used
visible layer V , a detection layer H, a pooling layer P , the mean-field approximation to estimate the hidden layer
and another, subsequently-higher detection layer H 0 . activations given the input images. We found that five
Suppose H 0 has K 0 groups of nodes, and there is a mean-field iterations sufficed.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

are not conditionally independent of one another given


the layers above and below. In contrast, our treatment
using undirected edges enables combining bottom-up
and top-down information more efficiently, as shown
in Section 4.5.
In our approach, probabilistic max-pooling helps to
address scalability by shrinking the higher layers;
weight-sharing (convolutions) further speeds up the
algorithm. For example, inference in a three-layer
network (with 200x200 input images) using weight-
sharing but without max-pooling was about 10 times
slower. Without weight-sharing, it was more than 100
times slower.
In work that was contemporary to and done indepen- Figure 2. The first layer bases (top) and the second layer
dently of ours, Desjardins and Bengio (2008) also ap- bases (bottom) learned from natural images. Each second
plied convolutional weight-sharing to RBMs and ex- layer basis (filter) was visualized as a weighted linear com-
perimented on small image patches. Our work, how- bination of the first layer bases.
ever, develops more sophisticated elements such as unlabeled data do not share the same class labels, or
probabilistic max-pooling to make the algorithm more the same generative distribution, as the labeled data.
scalable. This framework, where generic unlabeled data improve
performance on a supervised learning task, is known
4. Experimental results as self-taught learning. In their experiments, they used
4.1. Learning hierarchical representations sparse coding to train a single-layer representation,
from natural images and then used the learned representation to construct
We first tested our model’s ability to learn hierarchi- features for supervised learning tasks.
cal representations of natural images. Specifically, we We used a similar procedure to evaluate our two-layer
trained a CDBN with two hidden layers from the Ky- CDBN, described in Section 4.1, on the Caltech-101
oto natural image dataset.3 The first layer consisted object classification task.6 The results are shown in
of 24 groups (or “bases”)4 of 10x10 pixel filters, while Table 1. First, we observe that combining the first
the second layer consisted of 100 bases, each one 10x10 and second layers significantly improves the classifica-
as well.5 As shown in Figure 2 (top), the learned first tion accuracy relative to the first layer alone. Overall,
layer bases are oriented, localized edge filters; this re- we achieve 57.7% test accuracy using 15 training im-
sult is consistent with much prior work (Olshausen & ages per class, and 65.4% test accuracy using 30 train-
Field, 1996; Bell & Sejnowski, 1997; Ranzato et al., ing images per class. Our result is competitive with
2006). We note that the sparsity regularization dur- state-of-the-art results using highly-specialized single
ing training was necessary for learning these oriented features, such as SIFT, geometric blur, and shape-
edge filters; when this term was removed, the algo- context (Lazebnik et al., 2006; Berg et al., 2005; Zhang
rithm failed to learn oriented edges. et al., 2006).7 Recall that the CDBN was trained en-
The learned second layer bases are shown in Fig- 6
Details: Given an image from the Caltech-101
ure 2 (bottom), and many of them empirically re- dataset (Fei-Fei et al., 2004), we scaled the image so that
sponded selectively to contours, corners, angles, and its longer side was 150 pixels, and computed the activations
surface boundaries in the images. This result is qual- of the first and second (pooling) layers of our CDBN. We
itatively consistent with previous work (Ito & Ko- repeated this procedure after reducing the input image by
half and concatenated all the activations to construct fea-
matsu, 2004; Lee et al., 2008). tures. We used an SVM with a spatial pyramid matching
kernel for classification, and the parameters of the SVM
4.2. Self-taught learning for object recognition were cross-validated. We randomly selected 15/30 training
Raina et al. (2007) showed that large unlabeled data set and 15/30 test set images respectively, and normal-
can help in supervised learning tasks, even when the ized the result such that classification accuracy for each
class was equally weighted (following the standard proto-
3
http://www.cnbc.cmu.edu/cplab/data_kyoto.html col). We report results averaged over 10 random trials.
7
4
We will call one hidden group’s weights a “basis.” Varma and Ray (2007) reported better performance
5 than ours (87.82% for 15 training images/class), but they
Since the images were real-valued, we used Gaussian
combined many state-of-the-art features (or kernels) to im-
visible units for the first-layer CRBM. The pooling ratio C
prove the performance. In another approach, Yu et al.
for each layer was 2, so the second-layer bases cover roughly
(2009) used kernel regularization using a (previously pub-
twice as large an area as the first-layer ones.
lished) state-of-the-art kernel matrix to improve the per-
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Table 1. Classification accuracy for the Caltech-101 data these categories (such as elephants and chairs) have
Training Size 15 30 fairly high intra-class appearance variation, due to de-
CDBN (first layer) 53.2±1.2% 60.5±1.1% formable shapes or different viewpoints. Despite this,
CDBN (first+second layers) 57.7±1.5% 65.4±0.5% our model still learns hierarchical, part-based repre-
Raina et al. (2007) 46.6% - sentations fairly robustly.
Ranzato et al. (2007) - 54.0%
Mutch and Lowe (2006) 51.0% 56.0% Higher layers in the CDBN learn features which are
Lazebnik et al. (2006) 54.0% 64.6% not only higher level, but also more specific to particu-
Zhang et al. (2006) 59.0±0.56% 66.2±0.5%
lar object categories. We now quantitatively measure
tirely from natural scenes, which are completely un- the specificity of each layer by determining how in-
related to the classification task. Hence, the strong dicative each individual feature is of object categories.
performance of these features implies that our CDBN (This contrasts with most work in object classifica-
learned a highly general representation of images. tion, which focuses on the informativeness of the en-
tire feature set, rather than individual features.) More
4.3. Handwritten digit classification specifically, we consider three CDBNs trained on faces,
We further evaluated the performance of our model motorbikes, and cars, respectively. For each CDBN,
on the MNIST handwritten digit classification task, we test the informativeness of individual features from
a widely-used benchmark for testing hierarchical rep- each layer for distinguishing among these three cate-
resentations. We trained 40 first layer bases from gories. For each feature,10 we computed area under the
MNIST digits, each 12x12 pixels, and 40 second layer precision-recall curve (larger means more specific).11
bases, each 6x6. The pooling ratio C was 2 for both As shown in Figure 4, the higher-level representations
layers. The first layer bases learned “strokes” that are more selective for the specific object class.
comprise the digits, and the second layer bases learned
We further tested if the CDBN can learn hierarchi-
bigger digit-parts that combine the strokes. We con-
cal object-part representations when trained on im-
structed feature vectors by concatenating the first and
ages from several object categories (rather than just
second (pooling) layer activations, and used an SVM
one). We trained the second and third layer represen-
for classification using these features. For each labeled
tations using unlabeled images randomly selected from
training set size, we report the test error averaged over
four object categories (cars, faces, motorbikes, and air-
10 randomly chosen training sets, as shown in Table 2.
planes). As shown in Figure 3 (far right), the second
For the full training set, we obtained 0.8% test error.
layer learns class-specific as well as shared parts, and
Our result is comparable to the state-of-the-art (Ran-
the third layer learns more object-specific representa-
zato et al., 2007; Weston et al., 2008).8
tions. (The training examples were unlabeled, so in a
4.4. Unsupervised learning of object parts sense, this means the third layer implicitly clusters the
We now show that our algorithm can learn hierarchi- images by object category.) As before, we quantita-
cal object-part representations in an unsupervised set- tively measured the specificity of each layer’s individ-
ting. Building on the first layer representation learned ual features to object categories. Because the train-
from natural images, we trained two additional CDBN ing was completely unsupervised, whereas the AUC-
layers using unlabeled images from single Caltech-101 PR statistic requires knowing which specific object or
categories.9 As shown in Figure 3, the second layer object parts the learned bases should represent, we
learned features corresponding to object parts, even instead computed conditional entropy.12 Informally
though the algorithm was not given any labels speci- speaking, conditional entropy measures the entropy of
fying the locations of either the objects or their parts. 10
For a given image, we computed the layerwise activa-
The third layer learned to combine the second layer’s tions using our algorithm, partitioned the activation into
part representations into more complex, higher-level LxL regions for each group, and computed the q% highest
features. Our model successfully learned hierarchi- quantile activation for each region and each group. If the
q% highest quantile activation in region i is γ, we then de-
cal object-part representations of most of the other fine a Bernoulli random variable Xi,L,q with probability γ
Caltech-101 categories as well. We note that some of of being 1. To measure the informativeness between a fea-
ture and the class label, we computed the mutual informa-
formance of their convolutional neural network model.
8 tion between Xi,L,q and the class label. Results reported
We note that Hinton and Salakhutdinov (2006)’s are using (L, q) values that maximized the average mutual
method is non-convolutional. information (averaging over i).
9
The images were unlabeled in that the position of the 11
For each feature, by comparing its values over pos-
object is unspecified. Training was on up to 100 images, itive examples and negative examples, we obtained the
and testing was on different images than the training set. precision-recall curve for each classification problem.
The pooling ratio for the first layer was set as 3. The 12
We computed the quantile features γ for each layer
second layer contained 40 bases, each 10x10, and the third
as previously described, and measured conditional entropy
layer contained 24 bases, each 14x14. The pooling ratio in
H(class|γ > 0.95).
both cases was 2.
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
Table 2. Test error for MNIST dataset
Labeled training samples 1,000 2,000 3,000 5,000 60,000
CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%
Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%
Hinton and Salakhutdinov (2006) - - - - 1.20%
Weston et al. (2008) 2.73% - 1.83% - 1.50%
faces cars elephants chairs faces, cars, airplanes, motorbikes

Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific object
categories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of four
object categories (faces, cars, airplanes, motorbikes).
Faces Motorbikes Cars
0.6 0.6 0.6
first layer first layer first layer
second layer second layer second layer
0.4 third layer 0.4 third layer 0.4 third layer

0.2 0.2 0.2

0 0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
Area under the PR curve (AUC) Area under the PR curve (AUC) Area under the PR curve (AUC)

Features Faces Motorbikes Cars


First layer 0.39±0.17 0.44±0.21 0.43±0.19
Second layer 0.86±0.13 0.69±0.22 0.72±0.23
Third layer 0.95±0.03 0.81±0.13 0.87±0.15
Figure 6. Hierarchical probabilistic inference. For each col-
Figure 4. (top) Histogram of the area under the precision- umn: (top) input image. (middle) reconstruction from the
recall curve (AUC-PR) for three classification problems second layer units after single bottom-up pass, by project-
using class-specific object-part representations. (bottom) ing the second layer activations into the image space. (bot-
Average AUC-PR for each classification problem. tom) reconstruction from the second layer units after 20
iterations of block Gibbs sampling.
1
first layer
0.8 second layer tion, you can still recognize the face and further infer
third layer
0.6
the darkened parts by combining the image with your
0.4
prior knowledge of faces. In this experiment, we show
that our model can tractably perform such (approxi-
0.2
mate) hierarchical probabilistic inference in full-sized
0
0 0.5 1 1.5
Conditional entropy
2 images. More specifically, we tested the network’s abil-
ity to infer the locations of hidden object parts.
Figure 5. Histogram of conditional entropy for the repre-
sentation learned from the mixture of four object classes. To generate the examples for evaluation, we used
Caltech-101 face images (distinct from the ones the
the posterior over class labels when a feature is ac- network was trained on). For each image, we simu-
tive. Since lower conditional entropy corresponds to a lated an occlusion by zeroing out the left half of the
more peaked posterior, it indicates greater specificity. image. We then sampled from the joint posterior over
As shown in Figure 5, the higher-layer features have all of the hidden layers by performing Gibbs sampling.
progressively less conditional entropy, suggesting that Figure 6 shows a visualization of these samples. To en-
they activate more selectively to specific object classes. sure that the filling-in required top-down information,
4.5. Hierarchical probabilistic inference we compare with a “control” condition where only a
single upward pass was performed.
Lee and Mumford (2003) proposed that the human vi-
sual cortex can conceptually be modeled as performing In the control (upward-pass only) condition, since
“hierarchical Bayesian inference.” For example, if you there is no evidence from the first layer, the second
observe a face image with its left half in dark illumina- layer does not respond much to the left side. How-
Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

ever, with full Gibbs sampling, the bottom-up inputs Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond
combine with the context provided by the third layer bags of features: Spatial pyramid matching for rec-
which has detected the object. This combined evi- ognizing natural scene categories. IEEE Conference
dence significantly improves the second layer represen- on Computer Vision and Pattern Recognition.
tation. Selected examples are shown in Figure 6. LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W., & Jackel, L. D. (1989).
5. Conclusion
Backpropagation applied to handwritten zip code
We presented the convolutional deep belief network, a recognition. Neural Computation, 1, 541–551.
scalable generative model for learning hierarchical rep-
Lee, H., Ekanadham, C., & Ng, A. Y. (2008). Sparse
resentations from unlabeled images, and showed that
deep belief network model for visual area V2. Ad-
our model performs well in a variety of visual recog-
vances in Neural Information Processing Systems.
nition tasks. We believe our approach holds promise
as a scalable algorithm for learning hierarchical repre- Lee, T. S., & Mumford, D. (2003). Hierarchical
sentations from high-dimensional, complex data. bayesian inference in the visual cortex. Journal of
the Optical Society of America A, 20, 1434–1448.
Acknowledgment
Mutch, J., & Lowe, D. G. (2006). Multiclass object
We give warm thanks to Daniel Oblinger and Rajat recognition with sparse, localized features. IEEE
Raina for helpful discussions. This work was sup- Conf. on Computer Vision and Pattern Recognition.
ported by the DARPA transfer learning program under
Olshausen, B. A., & Field, D. J. (1996). Emergence
contract number FA8750-05-2-0249.
of simple-cell receptive field properties by learning
References a sparse code for natural images. Nature, 381, 607–
609.
Bell, A. J., & Sejnowski, T. J. (1997). The ‘indepen-
dent components’ of natural scenes are edge filters. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y.
Vision Research, 37, 3327–3338. (2007). Self-taught learning: Transfer learning from
unlabeled data. International Conference on Ma-
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. chine Learning (pp. 759–766).
(2006). Greedy layer-wise training of deep networks.
Adv. in Neural Information Processing Systems. Raina, R., Madhavan, A., & Ng, A. Y. (2009). Large-
scale deep unsupervised learning using graphics pro-
Berg, A. C., Berg, T. L., & Malik, J. (2005). Shape cessors. International Conf. on Machine Learning.
matching and object recognition using low distor-
tion correspondence. IEEE Conference on Com- Ranzato, M., Huang, F.-J., Boureau, Y.-L., & LeCun,
puter Vision and Pattern Recognition (pp. 26–33). Y. (2007). Unsupervised learning of invariant fea-
ture hierarchies with applications to object recog-
Desjardins, G., & Bengio, Y. (2008). Empirical eval- nition. IEEE Conference on Computer Vision and
uation of convolutional RBMs for vision (Technical Pattern Recognition.
Report).
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning (2006). Efficient learning of sparse representations
generative visual models from few training exam- with an energy-based model. Advances in Neural
ples: an incremental Bayesian approach tested on Information Processing Systems (pp. 1137–1144).
101 object categories. CVPR Workshop on Gen.-
Model Based Vision. Taylor, G., Hinton, G. E., & Roweis, S. (2007). Mod-
eling human motion using binary latent variables.
Grosse, R., Raina, R., Kwong, H., & Ng, A. (2007). Adv. in Neural Information Processing Systems.
Shift-invariant sparse coding for audio classification.
Proceedings of the Conference on Uncertainty in AI. Varma, M., & Ray, D. (2007). Learning the discrimina-
tive power-invariance trade-off. International Con-
Hinton, G. E. (2002). Training products of experts by ference on Computer Vision.
minimizing contrastive divergence. Neural Compu-
tation, 14, 1771–1800. Weston, J., Ratle, F., & Collobert, R. (2008). Deep
learning via semi-supervised embedding. Interna-
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A tional Conference on Machine Learning.
fast learning algorithm for deep belief nets. Neural
Computation, 18, 1527–1554. Yu, K., Xu, W., & Gong, Y. (2009). Deep learn-
ing with kernel regularization for visual recognition.
Hinton, G. E., & Salakhutdinov, R. (2006). Reduc- Adv. Neural Information Processing Systems.
ing the dimensionality of data with neural networks.
Science, 313, 504–507. Zhang, H., Berg, A. C., Maire, M., & Malik, J. (2006).
SVM-KNN: Discriminative nearest neighbor classifi-
Ito, M., & Komatsu, H. (2004). Representation of cation for visual category recognition. IEEE Confer-
angles embedded within contour stimuli in area V2 ence on Computer Vision and Pattern Recognition.
of macaque monkeys. J. Neurosci., 24, 3313–3324.

You might also like