Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University
April 15, 2015
Today: Reading:
• Artificial neural networks • Mitchell: Chapter 4
• Backpropagation • Bishop: Chapter 5
• Recurrent networks • Quoc Le tutorial:
• Convolutional networks • Ruslan Salakhutdinov tutorial:
• Deep belief networks
• Deep Boltzman machines
Artificial Neural Networks to learn f: X à Y
• f might be non-linear function
• X (vector of) continuous and/or discrete vars
• Y (vector of) continuous and/or discrete vars
• Represent f by network of logistic units
• Each unit is a logistic function
• MLE: train weights of all units to minimize sum of squared
errors of predicted network outputs
• MAP: train to minimize sum of squared errors plus weight
magnitudes
ALVINN
[Pomerleau 1993]
M(C)LE Training for Neural Networks
• Consider regression problem f:XàY , for scalar Y
y = f(x) + ε assume noise N(0,σε), iid
deterministic
• Let’s maximize the conditional data likelihood
Learned
neural network
MAP Training for Neural Networks
• Consider regression problem f:XàY , for scalar Y
y = f(x) + ε noise N(0,σε)
deterministic
Gaussian P(W) = N(0,σΙ)
ln P(W) ↔ c ∑i wi2
xd = input
td = target output
od = observed unit
output
wi = weight i
(MLE)
xd = input
td = target output
od = observed unit
output
wij = wt from i to j
w0
left strt right up
Semantic Memory Model Based on ANN’s
[McClelland & Rogers, Nature 2003]
No hierarchy given.
Train with assertions,
e.g., Can(Canary,Fly)
Training Networks on Time Series
• Suppose we want to predict next state of world
– and it depends on history of unknown length
– e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
Recurrent Networks: Time Series
• Suppose we want to predict next state of world
– and it depends on history of unknown length
– e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
• Idea: use hidden layer in network to capture state history
Recurrent Networks on Time Series
How can we train recurrent net??
Convolutional Neural Nets for Image Recognition
[Le Cun, 1992]
• specialized architecture: mix different types of units, not
completely connected, motivated by primate visual cortex
• many shared parameters, stochastic gradient training
• very successful! now many specialized architectures for
vision, speech, translation, …
Deep Belief Networks [Hinton & Salakhutdinov, 2006]
• Problem: training networks with many hidden layers
doesn’t work very well
– local minima, very slow training if initialize with zero weights
• Deep belief networks
– autoencoder networks to learn low dimensional encodings
– but more layers, to learn better encodings
Deep Belief Networks [Hinton & Salakhutdinov, 2006]
original image
reconstructed from
2000-1000-500-30 DBN
reconstructed from
2000-300, linear PCA
versus
[Hinton & Salakhutdinov, 2006]
Deep Belief Networks: Training
Encoding of digit images in two dimensions
[Hinton & Salakhutdinov, 2006]
784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet
Very Large Scale Use of DBN’s [Quoc Le, et al., ICML, 2012]
Data: 10 million 200x200 unlabeled images, sampled from YouTube
Training: use 1000 machines (16000 cores) for 1 week
Learned network: 3 multi-stage layers, 1.15 billion parameters
Achieves 15.8% (was 9.5%) accuracy classifying 1 of 20k ImageNet items
Real
images
that most
excite the
feature:
Image
synthesized
to most
excite the
feature:
Restricted Boltzman Machine
• Bipartite graph, logistic activation
• Inference: fill in any nodes, estimate other
nodes
• consider vi, hj are boolean variables
h1 h2 h3
v1 v2 … vn
Impact
of
Deep
Learning
•
Speech
Recogni4on
•
Computer
Vision
•
Recommender
Systems
•
Language
Understanding
•
Drug
Discovery
and
Medical
Image
Analysis
[Courtesy
of
R.
Salakhutdinov]
Feature
Representa4ons:
Tradi4onally
Data Feature Learning
extraction algorithm
Object
detec4on
Image
vision
features
Recogni4on
Audio
classifica4on
Speaker
Audio
audio
features
iden4fica4on
[Courtesy
of
R.
Salakhutdinov]
Computer
Vision
Features
SIFT
Textons
HoG
RIFT
GIST
[Courtesy,
R.
Salakhutdinov]
Audio
Features
Spectrogram
MFCC
Flux
ZCR
Rolloff
[Courtesy,
R.
Salakhutdinov]
Audio
Features
Representa4on
Learning:
Spectrogram
MFCC
Can
we
automa4cally
learn
these
representa4ons?
Flux
ZCR
Rolloff
[Courtesy,
R.
Salakhutdinov]
Restricted
Boltzmann
Machines
hidden
variables
Pair-‐wise
Unary
Graphical
Models:
Powerful
Feature
Detectors
framework
for
represen4ng
dependency
structure
between
random
variables.
Image
visible
variables
RBM
is
a
Markov
Random
Field
with:
•
Stochas4c
binary
visible
variables
•
Stochas4c
binary
hidden
variables
•
Bipar4te
connec4ons.
[Courtesy,
R.
Salakhutdinov]
Markov
random
fields,
Boltzmann
machines,
log-‐linear
models.
Learning
Features
Observed
Data
Learned
W:
“edges”
Subset
of
25,000
characters
Subset
of
1000
features
Sparse
New
Image:
representa8ons
=
….
Logis4c
Func4on:
Suitable
for
modeling
binary
images
[Courtesy,
R.
Salakhutdinov]
Model
Learning
Hidden
units
Given
a
set
of
i.i.d.
training
examples
,
we
want
to
learn
model
parameters
.
Maximize
log-‐likelihood
objec4ve:
Image
visible
units
Deriva4ve
of
the
log-‐likelihood:
Difficult
to
compute:
exponen4ally
many
configura4ons
[Courtesy,
R.
Salakhutdinov]
RBMs
for
Real-‐valued
Data
hidden
variables
Pair-‐wise
Unary
Image
visible
variables
Gaussian-‐Bernoulli
RBM:
•
Stochas4c
real-‐valued
visible
variables
•
Stochas4c
binary
hidden
variables
•
Bipar4te
connec4ons.
[Courtesy,
R.
Salakhutdinov]
(Salakhutdinov & Hinton, NIPS 2007; Salakhutdinov & Murray, ICML 2008)
RBMs
for
Real-‐valued
Data
hidden
variables
Pair-‐wise
Unary
Image
visible
variables
Learned
features
(out
of
10,000)
4
million
unlabelled
images
[Courtesy,
R.
Salakhutdinov]
RBMs
for
Real-‐valued
Data
hidden
variables
Pair-‐wise
Unary
Image
visible
variables
Learned
features
(out
of
10,000)
4
million
unlabelled
images
= 0.9 * + 0.8 * + 0.6 * …
New
Image
[Courtesy,
R.
Salakhutdinov]
RBMs
for
Word
Counts
Pair-‐wise
Unary
0 1
D
XXXK F D
XXK F
X
1 @
P✓ (v, h) = exp k k
Wij vi hj + k k
v i bi + hj a j A
0
Z(✓) i=1 k=1 j=1 i=1 k=1 j=1
0
0
⇣ PF ⌘
1
+ exp bki h W k
j=1 j ij
0
P✓ (vik = 1|h) = P ⇣ PF ⌘
K q q
q=1 exp bi + j=1 hj Wij
Replicated
Soemax
Model:
undirected
topic
model:
•
Stochas4c
1-‐of-‐K
visible
variables.
•
Stochas4c
binary
hidden
variables
•
Bipar4te
connec4ons.
[Courtesy,
R.
Salakhutdinov]
(Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)
RBMs
for
Word
Counts
Pair-‐wise
Unary
0 1
D
XXXK F D
XXK F
X
1 @
P✓ (v, h) = exp k k
Wij vi hj + k k
v i bi + hj a j A
0
Z(✓) i=1 k=1 j=1 i=1 k=1 j=1
0
0
⇣ PF ⌘
1
+
exp bki h W k
j=1 j ij
0
P✓ (vik = 1|h) = P ⇣ PF ⌘
K q q
q=1 exp bi + j=1 hj Wij
Learned
features:
``topics’’
russian
clinton
computer
trade
stock
Reuters
dataset:
russia
house
system
country
wall
804,414
unlabeled
moscow
president
product
import
street
newswire
stories
yeltsin
bill
soeware
world
point
soviet
congress
develop
economy
dow
Bag-‐of-‐Words
[Courtesy,
R.
Salakhutdinov]
Different
Data
Modali4es
•
Binary/Gaussian/Soemax
RBMs:
All
have
binary
hidden
variables
but
use
them
to
model
different
kinds
of
data.
Binary
0
0
0
1
0
Real-‐valued
1-‐of-‐K
•
It
is
easy
to
infer
the
states
of
the
hidden
variables:
[Courtesy,
R.
Salakhutdinov]
Product
of
Experts
The
joint
distribu4on
is
given
by:
Marginalizing
over
hidden
variables:
Product
of
Experts
government
clinton
bribery
oil
stock
…
auhority
house
corrup4on
barrel
wall
power
president
dishonesty
exxon
street
empire
bill
pu4n
pu4n
point
pu4n
congress
fraud
drill
dow
Topics
“government”,
”corrup4on”
and
”oil”
can
combine
to
give
very
high
Pu4n
probability
to
a
word
“Pu4n”.
[Courtesy,
R.
Salakhutdinov]
(Srivastava & Salakhutdinov, NIPS 2012)
Deep
Boltzmann
Machines
Low-‐level
features:
Edges
Built
from
unlabeled
inputs.
Input:
Pixels
Image
[Courtesy,
R.
Salakhutdinov]
(Salakhutdinov & Hinton, Neural Computation 2012)
Deep
Boltzmann
Machines
Learn
simpler
representa4ons,
then
compose
more
complex
ones
Higher-‐level
features:
Combina4on
of
edges
Low-‐level
features:
Edges
Built
from
unlabeled
inputs.
Input:
Pixels
Image
[Courtesy,
R.
Salakhutdinov]
(Salakhutdinov 2008, Salakhutdinov & Hinton 2012)
Model
Formula4on
h3 Same
as
RBMs
W3 model
parameters
h2 • Dependencies
between
hidden
variables.
requires
W 2 a
• pproximate
All
connec4ons
a i
re
nference
u ndirected.
t o
h1 train,
but
• iBolom-‐up
t
can
be
dTone…
and
op-‐down:
and
Ws1cales
to
millions
of
examples
v
Input
Top-‐down
Bolom-‐up
[Courtesy,
R.
Salakhutdinov]
Samples
Generated
by
the
Model
Training
Data
Model-‐Generated
Samples
Data
[Courtesy,
R.
Salakhutdinov]
Handwri4ng
Recogni4on
MNIST
Dataset
Op4cal
Character
Recogni4on
60,000
examples
of
10
digits
42,152
examples
of
26
English
lelers
Learning
Algorithm
Error
Learning
Algorithm
Error
Logis4c
regression
12.0%
Logis4c
regression
22.14%
K-‐NN
3.09%
K-‐NN
18.92%
Neural
Net
(Plal
2005)
1.53%
Neural
Net
14.62%
SVM
(Decoste
et.al.
2002)
1.40%
SVM
(Larochelle
et.al.
2009)
9.70%
Deep
Autoencoder
1.40%
Deep
Autoencoder
10.05%
(Bengio
et.
al.
2007)
(Bengio
et.
al.
2007)
Deep
Belief
Net
1.20%
Deep
Belief
Net
9.68%
(Hinton
et.
al.
2006)
(Larochelle
et.
al.
2009)
DBM
0.95%
DBM
8.40%
Permuta4on-‐invariant
version.
[Courtesy,
R.
Salakhutdinov]
3-‐D
object
Recogni4on
NORB
Dataset:
24,000
examples
Learning
Algorithm
Error
Logis4c
regression
22.5%
K-‐NN
(LeCun
2004)
18.92%
SVM
(Bengio
&
LeCun
2007)
11.6%
Deep
Belief
Net
(Nair
&
Hinton
9.0%
2009)
DBM
7.2%
Palern
Comple4on
[Courtesy,
R.
Salakhutdinov]
Learning
Shared
Representa4ons
Across
Sensory
Modali4es
“Concept”
sunset,
pacific
ocean,
baker
beach,
seashore,
ocean
[Courtesy,
R.
Salakhutdinov]
A
Simple
Mul4modal
Model
•
Use
a
joint
binary
hidden
layer.
•
Problem:
Inputs
have
very
different
sta4s4cal
proper4es.
•
Difficult
to
learn
cross-‐modal
features.
0
Real-‐valued
0
0
1
1-‐of-‐K
0
[Courtesy,
R.
Salakhutdinov]
Mul4modal
DBM
Gaussian
model
Replicated
Soemax
0
Dense,
real-‐valued
0
Word
image
features
0
counts
1
0
[Courtesy,
R.
Salakhutdinov]
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
Mul4modal
DBM
Gaussian
model
Replicated
Soemax
0
Dense,
real-‐valued
0
Word
image
features
0
counts
1
0
[Courtesy,
R.
Salakhutdinov]
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
Mul4modal
DBM
Gaussian
model
Replicated
Soemax
0
Dense,
real-‐valued
0
Word
image
features
0
counts
1
0
[Courtesy,
R.
Salakhutdinov]
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
Mul4modal
DBM
Bolom-‐up
+
Top-‐down
Gaussian
model
Replicated
Soemax
0
Dense,
real-‐valued
0
Word
image
features
0
counts
1
0
[Courtesy,
R.
Salakhutdinov]
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
Mul4modal
DBM
Bolom-‐up
+
Top-‐down
Gaussian
model
Replicated
Soemax
0
Dense,
real-‐valued
0
Word
image
features
0
counts
1
0
[Courtesy,
R.
Salakhutdinov]
(Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)
Text
Generated
from
Images
Given Generated
Given Generated
dog,
cat,
pet,
kilen,
insect,
bulerfly,
insects,
bug,
bulerflies,
puppy,
ginger,
tongue,
lepidoptera
kily,
dogs,
furry
sea,
france,
boat,
mer,
graffi4,
streetart,
stencil,
beach,
river,
bretagne,
s4cker,
urbanart,
graff,
plage,
brilany
sanfrancisco
portrait,
child,
kid,
canada,
nature,
ritralo,
kids,
children,
sunrise,
ontario,
fog,
boy,
cute,
boys,
italy
mist,
bc,
morning
[Courtesy,
R.
Salakhutdinov]
Text
Generated
from
Images
Given Generated
portrait,
women,
army,
soldier,
mother,
postcard,
soldiers
obama,
barackobama,
elec4on,
poli4cs,
president,
hope,
change,
sanfrancisco,
conven4on,
rally
water,
glass,
beer,
bolle,
drink,
wine,
bubbles,
splash,
drops,
drop
Images
Generated
from
Text
Given Retrieved
water,
red,
sunset
nature,
flower,
red,
green
blue,
green,
yellow,
colors
chocolate,
cake
[Courtesy,
R.
Salakhutdinov]
MIR-‐Flickr
Dataset
•
1
million
images
along
with
user-‐assigned
tags.
nikon,
abigfave,
food,
cupcake,
sculpture,
beauty,
d80
goldstaraward,
d80,
stone
vegan
nikond80
anawesomeshot,
nikon,
green,
light,
white,
yellow,
sky,
geotagged,
theperfectphotographer,
photoshop,
apple,
d70
abstract,
lines,
bus,
reflec4on,
cielo,
flash,
damniwishidtakenthat,
graphic
bilbao,
reflejo
spiritofphotography
[Courtesy,
R.
Salakhutdinov]
Huiskes
et.
al.
Results
•
Logis4c
regression
on
top-‐level
representa4on.
•
Mul4modal
Inputs
Mean
Average
Precision
Learning
Algorithm
MAP
Precision@50
Random
0.124
0.124
LDA
[Huiskes
et.
al.]
0.492
0.754
Labeled
25K
SVM
[Huiskes
et.
al.]
0.475
0.758
examples
DBM-‐Labelled
0.526
0.791
Deep
Belief
Net
0.638
0.867
+
1
Million
Autoencoder
0.638
0.875
unlabelled
DBM
0.641
0.873
[Courtesy,
R.
Salakhutdinov]
Artificial Neural Networks: Summary
• Highly non-linear regression/classification
• Hidden layers learn intermediate representations
• Potentially millions of parameters to estimate
• Stochastic gradient descent, local minima problems
• Deep networks have produced real progress in many fields
– computer vision
– speech recognition
– mapping images to text
– recommender systems
– …
• They learn very useful non-linear representations