0% found this document useful (0 votes)

105 views68 pages

Tom M CMU ANN Lecture Notes

This document contains notes from a machine learning course. It discusses several topics: 1. Artificial neural networks and backpropagation for training neural networks. 2. Recurrent neural networks and convolutional neural networks for processing time series and image data. 3. Deep belief networks and restricted Boltzmann machines for representation learning through layer-wise unsupervised pre-training of deep models. 4. Applications of deep learning in areas like computer vision, speech recognition, and natural language processing that have driven recent advances through learning powerful feature representations from data.

Uploaded by

San Deep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views68 pages

Tom M CMU ANN Lecture Notes

Uploaded by

San Deep

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Machine Learning 10-601

Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University

April 15, 2015

Today: Reading:
•  Artificial neural networks •  Mitchell: Chapter 4
•  Backpropagation •  Bishop: Chapter 5
•  Recurrent networks •  Quoc Le tutorial:
•  Convolutional networks •  Ruslan Salakhutdinov tutorial:
•  Deep belief networks
•  Deep Boltzman machines
Artificial Neural Networks to learn f: X à Y
•  f might be non-linear function
•  X (vector of) continuous and/or discrete vars
•  Y (vector of) continuous and/or discrete vars

•  Represent f by network of logistic units

•  Each unit is a logistic function

•  MLE: train weights of all units to minimize sum of squared

errors of predicted network outputs
•  MAP: train to minimize sum of squared errors plus weight
magnitudes
ALVINN
[Pomerleau 1993]
M(C)LE Training for Neural Networks

•  Consider regression problem f:XàY , for scalar Y

y = f(x) + ε assume noise N(0,σε), iid

deterministic

•  Let’s maximize the conditional data likelihood

Learned
neural network
MAP Training for Neural Networks

•  Consider regression problem f:XàY , for scalar Y

y = f(x) + ε noise N(0,σε)

deterministic

Gaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2
xd = input
td = target output
od = observed unit
output
wi = weight i
(MLE)

xd = input
td = target output
od = observed unit
output
wij = wt from i to j
w0
left strt right up
Semantic Memory Model Based on ANN’s
[McClelland & Rogers, Nature 2003]

No hierarchy given.

Train with assertions,

e.g., Can(Canary,Fly)
Training Networks on Time Series
•  Suppose we want to predict next state of world
–  and it depends on history of unknown length
–  e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
Recurrent Networks: Time Series
•  Suppose we want to predict next state of world
–  and it depends on history of unknown length
–  e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns

•  Idea: use hidden layer in network to capture state history

Recurrent Networks on Time Series

How can we train recurrent net??

Convolutional Neural Nets for Image Recognition
[Le Cun, 1992]

•  specialized architecture: mix different types of units, not

completely connected, motivated by primate visual cortex
•  many shared parameters, stochastic gradient training
•  very successful! now many specialized architectures for
vision, speech, translation, …
Deep Belief Networks [Hinton & Salakhutdinov, 2006]

•  Problem: training networks with many hidden layers

doesn’t work very well
–  local minima, very slow training if initialize with zero weights

•  Deep belief networks

–  autoencoder networks to learn low dimensional encodings

–  but more layers, to learn better encodings

Deep Belief Networks [Hinton & Salakhutdinov, 2006]

original image

reconstructed from
2000-1000-500-30 DBN
reconstructed from
2000-300, linear PCA

versus
[Hinton & Salakhutdinov, 2006]
Deep Belief Networks: Training
Encoding of digit images in two dimensions
[Hinton & Salakhutdinov, 2006]

784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet

Very Large Scale Use of DBN’s [Quoc Le, et al., ICML, 2012]
Data: 10 million 200x200 unlabeled images, sampled from YouTube
Training: use 1000 machines (16000 cores) for 1 week
Learned network: 3 multi-stage layers, 1.15 billion parameters
Achieves 15.8% (was 9.5%) accuracy classifying 1 of 20k ImageNet items

Real
images
that most
excite the
feature:

Image
synthesized
to most
excite the
feature:
Restricted Boltzman Machine
•  Bipartite graph, logistic activation
•  Inference: fill in any nodes, estimate other
nodes
•  consider vi, hj are boolean variables
h1 h2 h3

v1 v2 … vn
Impact of Deep Learning
•  Speech Recogni4on

•  Computer Vision

•  Recommender Systems

•  Language Understanding
•  Drug Discovery and Medical
Image Analysis
[Courtesy of R. Salakhutdinov]
Feature Representa4ons: Tradi4onally

Data Feature Learning

extraction algorithm

Object
detec4on

Image vision features Recogni4on

Audio
classiﬁca4on

Speaker
Audio audio features iden4ﬁca4on
[Courtesy of R. Salakhutdinov]
Computer Vision Features

SIFT Textons

HoG RIFT
GIST

[Courtesy, R. Salakhutdinov]

Audio Features

Spectrogram MFCC

Flux ZCR Rolloﬀ

[Courtesy, R. Salakhutdinov]
Audio Features

Representa4on Learning:
Spectrogram MFCC
Can we automa4cally learn
these representa4ons?

Flux ZCR Rolloﬀ

[Courtesy, R. Salakhutdinov]
Restricted Boltzmann Machines
hidden variables Pair-‐wise Unary
Graphical Models: Powerful Feature Detectors
framework for represen4ng
dependency structure between
random variables.

Image visible variables

RBM is a Markov Random Field with:

•  Stochas4c binary visible variables
•  Stochas4c binary hidden variables
•  Bipar4te connec4ons.
[Courtesy, R. Salakhutdinov] Markov random ﬁelds, Boltzmann machines, log-‐linear models.
Learning Features
Observed Data Learned W: “edges”
Subset of 25,000 characters Subset of 1000 features

Sparse
New Image: representa8ons

= ….

Logis4c Func4on: Suitable for

modeling binary images

[Courtesy, R. Salakhutdinov]

Model Learning
Hidden units

Given a set of i.i.d. training examples

, we want to learn
model parameters .
Maximize log-‐likelihood objec4ve:

Image visible units

Deriva4ve of the log-‐likelihood:

Diﬃcult to compute: exponen4ally many

conﬁgura4ons

[Courtesy, R. Salakhutdinov]

RBMs for Real-‐valued Data
hidden variables Pair-‐wise Unary

Image visible variables

Gaussian-‐Bernoulli RBM:
•  Stochas4c real-‐valued visible variables
•  Stochas4c binary hidden variables
•  Bipar4te connec4ons.
[Courtesy, R. Salakhutdinov] (Salakhutdinov & Hinton, NIPS 2007; Salakhutdinov & Murray, ICML 2008)
RBMs for Real-‐valued Data
hidden variables Pair-‐wise Unary

Image visible variables

Learned features (out of 10,000)

4 million unlabelled images

[Courtesy, R. Salakhutdinov]

RBMs for Real-‐valued Data
hidden variables Pair-‐wise Unary

Image visible variables

Learned features (out of 10,000)

4 million unlabelled images

= 0.9 * + 0.8 * + 0.6 * …

New Image
[Courtesy, R. Salakhutdinov]
RBMs for Word Counts
Pair-‐wise Unary
0 1
D
XXXK F D
XXK F
X
1 @
P✓ (v, h) = exp k k
Wij vi hj + k k
v i bi + hj a j A
0 Z(✓) i=1 k=1 j=1 i=1 k=1 j=1
0
0
⇣ PF ⌘
1 + exp bki h W k
j=1 j ij
0 P✓ (vik = 1|h) = P ⇣ PF ⌘
K q q
q=1 exp bi + j=1 hj Wij

Replicated Soemax Model: undirected topic model:

•  Stochas4c 1-‐of-‐K visible variables.
•  Stochas4c binary hidden variables
•  Bipar4te connec4ons.
[Courtesy, R. Salakhutdinov] (Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)
RBMs for Word Counts
Pair-‐wise Unary
0 1
D
XXXK F D
XXK F
X
1 @
P✓ (v, h) = exp k k
Wij vi hj + k k
v i bi + hj a j A
0 Z(✓) i=1 k=1 j=1 i=1 k=1 j=1
0
0
⇣ PF ⌘
1 +
exp bki h W k
j=1 j ij
0 P✓ (vik = 1|h) = P ⇣ PF ⌘
K q q
q=1 exp bi + j=1 hj Wij

Learned features: ``topics’’

russian clinton computer trade stock
Reuters dataset: russia house system country wall
804,414 unlabeled moscow president product import street
newswire stories yeltsin bill soeware world point
soviet congress develop economy dow
Bag-‐of-‐Words
[Courtesy, R. Salakhutdinov]
Diﬀerent Data Modali4es
•  Binary/Gaussian/Soemax RBMs: All have binary hidden
variables but use them to model diﬀerent kinds of data.

Binary 0
0
0
1
0
Real-‐valued 1-‐of-‐K

•  It is easy to infer the states of the hidden variables:

[Courtesy, R. Salakhutdinov]

Product of Experts
The joint distribu4on is given by:

Marginalizing over hidden variables: Product of Experts

government clinton bribery oil stock …

auhority house corrup4on barrel wall
power president dishonesty exxon street
empire bill pu4n pu4n point
pu4n congress fraud drill dow

Topics “government”, ”corrup4on”

and ”oil” can combine to give very high
Pu4n probability to a word “Pu4n”.

[Courtesy, R. Salakhutdinov] (Srivastava & Salakhutdinov, NIPS 2012)

Deep Boltzmann Machines

Low-‐level features:
Edges

Built from unlabeled inputs.

Input: Pixels
Image
[Courtesy, R. Salakhutdinov] (Salakhutdinov & Hinton, Neural Computation 2012)
Deep Boltzmann Machines
Learn simpler representa4ons,
then compose more complex ones

Higher-‐level features:
Combina4on of edges

Low-‐level features:
Edges

Built from unlabeled inputs.

Input: Pixels
Image
[Courtesy, R. Salakhutdinov] (Salakhutdinov 2008, Salakhutdinov & Hinton 2012)
Model Formula4on

h3 Same as RBMs

W3 model parameters
h2 •  Dependencies between hidden variables.
requires
W 2 a
•  pproximate
All connec4ons a i
re nference
u ndirected. t o
h1 train, but •  iBolom-‐up
t can be dTone…
and op-‐down:
and Ws1cales to millions of examples
v
Input
Top-‐down Bolom-‐up

[Courtesy, R. Salakhutdinov]

Samples Generated by the Model
Training Data Model-‐Generated Samples

Data

[Courtesy, R. Salakhutdinov]

Handwri4ng Recogni4on
MNIST Dataset Op4cal Character Recogni4on
60,000 examples of 10 digits 42,152 examples of 26 English lelers
Learning Algorithm Error Learning Algorithm Error
Logis4c regression 12.0% Logis4c regression 22.14%
K-‐NN 3.09% K-‐NN 18.92%
Neural Net (Plal 2005) 1.53% Neural Net 14.62%
SVM (Decoste et.al. 2002) 1.40% SVM (Larochelle et.al. 2009) 9.70%
Deep Autoencoder 1.40% Deep Autoencoder 10.05%
(Bengio et. al. 2007) (Bengio et. al. 2007)

Deep Belief Net 1.20% Deep Belief Net 9.68%

(Hinton et. al. 2006) (Larochelle et. al. 2009)

DBM 0.95% DBM 8.40%

Permuta4on-‐invariant version.
[Courtesy, R. Salakhutdinov]
3-‐D object Recogni4on
NORB Dataset: 24,000 examples
Learning Algorithm Error
Logis4c regression 22.5%
K-‐NN (LeCun 2004) 18.92%
SVM (Bengio & LeCun 2007) 11.6%
Deep Belief Net (Nair & Hinton 9.0%
2009)

DBM 7.2%

Palern
Comple4on

[Courtesy, R. Salakhutdinov]

Learning Shared Representa4ons
Across Sensory Modali4es

“Concept”

sunset, paciﬁc ocean,

baker beach, seashore,
ocean

[Courtesy, R. Salakhutdinov]

A Simple Mul4modal Model
•  Use a joint binary hidden layer.
•  Problem: Inputs have very diﬀerent sta4s4cal
proper4es.
•  Diﬃcult to learn cross-‐modal features.

0
Real-‐valued 0
0
1
1-‐of-‐K
0