Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
105 views68 pages

Tom M CMU ANN Lecture Notes

This document contains notes from a machine learning course. It discusses several topics: 1. Artificial neural networks and backpropagation for training neural networks. 2. Recurrent neural networks and convolutional neural networks for processing time series and image data. 3. Deep belief networks and restricted Boltzmann machines for representation learning through layer-wise unsupervised pre-training of deep models. 4. Applications of deep learning in areas like computer vision, speech recognition, and natural language processing that have driven recent advances through learning powerful feature representations from data.

Uploaded by

San Deep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views68 pages

Tom M CMU ANN Lecture Notes

This document contains notes from a machine learning course. It discusses several topics: 1. Artificial neural networks and backpropagation for training neural networks. 2. Recurrent neural networks and convolutional neural networks for processing time series and image data. 3. Deep belief networks and restricted Boltzmann machines for representation learning through layer-wise unsupervised pre-training of deep models. 4. Applications of deep learning in areas like computer vision, speech recognition, and natural language processing that have driven recent advances through learning powerful feature representations from data.

Uploaded by

San Deep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Machine Learning 10-601

Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University

April 15, 2015

Today: Reading:
•  Artificial neural networks •  Mitchell: Chapter 4
•  Backpropagation •  Bishop: Chapter 5
•  Recurrent networks •  Quoc Le tutorial:
•  Convolutional networks •  Ruslan Salakhutdinov tutorial:
•  Deep belief networks
•  Deep Boltzman machines
Artificial Neural Networks to learn f: X à Y
•  f might be non-linear function
•  X (vector of) continuous and/or discrete vars
•  Y (vector of) continuous and/or discrete vars

•  Represent f by network of logistic units


•  Each unit is a logistic function

•  MLE: train weights of all units to minimize sum of squared


errors of predicted network outputs
•  MAP: train to minimize sum of squared errors plus weight
magnitudes
ALVINN
[Pomerleau 1993]
M(C)LE Training for Neural Networks

•  Consider regression problem f:XàY , for scalar Y


y = f(x) + ε assume noise N(0,σε), iid

deterministic

•  Let’s maximize the conditional data likelihood

Learned
neural network
MAP Training for Neural Networks

•  Consider regression problem f:XàY , for scalar Y


y = f(x) + ε noise N(0,σε)

deterministic

Gaussian P(W) = N(0,σΙ)

ln P(W) ↔ c ∑i wi2
xd = input
td = target output
od = observed unit
output
wi = weight i
(MLE)

xd = input
td = target output
od = observed unit
output
wij = wt from i to j
w0
left strt right up
Semantic Memory Model Based on ANN’s
[McClelland & Rogers, Nature 2003]

No hierarchy given.

Train with assertions,


e.g., Can(Canary,Fly)
Training Networks on Time Series
•  Suppose we want to predict next state of world
–  and it depends on history of unknown length
–  e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
Recurrent Networks: Time Series
•  Suppose we want to predict next state of world
–  and it depends on history of unknown length
–  e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns

•  Idea: use hidden layer in network to capture state history


Recurrent Networks on Time Series

How can we train recurrent net??


Convolutional Neural Nets for Image Recognition
[Le Cun, 1992]

•  specialized architecture: mix different types of units, not


completely connected, motivated by primate visual cortex
•  many shared parameters, stochastic gradient training
•  very successful! now many specialized architectures for
vision, speech, translation, …
Deep Belief Networks [Hinton & Salakhutdinov, 2006]

•  Problem: training networks with many hidden layers


doesn’t work very well
–  local minima, very slow training if initialize with zero weights

•  Deep belief networks


–  autoencoder networks to learn low dimensional encodings

–  but more layers, to learn better encodings


Deep Belief Networks [Hinton & Salakhutdinov, 2006]

original image

reconstructed from
2000-1000-500-30 DBN
reconstructed from
2000-300, linear PCA

versus
[Hinton & Salakhutdinov, 2006]
Deep Belief Networks: Training
Encoding of digit images in two dimensions
[Hinton & Salakhutdinov, 2006]

784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet


Very Large Scale Use of DBN’s [Quoc Le, et al., ICML, 2012]
Data: 10 million 200x200 unlabeled images, sampled from YouTube
Training: use 1000 machines (16000 cores) for 1 week
Learned network: 3 multi-stage layers, 1.15 billion parameters
Achieves 15.8% (was 9.5%) accuracy classifying 1 of 20k ImageNet items

Real
images
that most
excite the
feature:

Image
synthesized
to most
excite the
feature:
Restricted Boltzman Machine
•  Bipartite graph, logistic activation
•  Inference: fill in any nodes, estimate other
nodes
•  consider vi, hj are boolean variables
h1 h2 h3

v1 v2 … vn
Impact  of  Deep  Learning  
•   Speech  Recogni4on  

•   Computer  Vision  

•   Recommender  Systems    

•   Language  Understanding    
•   Drug  Discovery  and  Medical  
Image  Analysis    
[Courtesy  of  R.  Salakhutdinov]  
Feature  Representa4ons:  Tradi4onally  

Data Feature Learning


extraction algorithm

Object  
detec4on  

Image   vision  features   Recogni4on  

Audio  
classifica4on  

Speaker  
Audio   audio  features   iden4fica4on  
[Courtesy  of  R.  Salakhutdinov]  
Computer  Vision  Features  

SIFT   Textons  

HoG   RIFT  
GIST  

[Courtesy,  R.  Salakhutdinov]  


Audio  Features  

Spectrogram   MFCC  

Flux   ZCR   Rolloff  


[Courtesy,  R.  Salakhutdinov]  
Audio  Features  

Representa4on  Learning:  
Spectrogram   MFCC  
Can  we  automa4cally  learn  
these  representa4ons?  

Flux   ZCR   Rolloff  


[Courtesy,  R.  Salakhutdinov]  
Restricted  Boltzmann  Machines  
   hidden  variables  Pair-­‐wise   Unary  
Graphical  Models:  Powerful   Feature  Detectors  
framework  for  represen4ng  
dependency  structure  between  
random  variables.  

Image            visible  variables  

RBM  is  a  Markov  Random  Field  with:  


•   Stochas4c  binary  visible  variables                                                    
•   Stochas4c  binary  hidden  variables                                              
•   Bipar4te  connec4ons.  
[Courtesy,  R.  Salakhutdinov]   Markov  random  fields,  Boltzmann  machines,  log-­‐linear  models.    
Learning  Features  
Observed    Data     Learned  W:    “edges”  
Subset  of  25,000  characters   Subset  of  1000  features  

Sparse  
New  Image:   representa8ons  

=   ….  

Logis4c  Func4on:  Suitable  for  


modeling  binary  images  

[Courtesy,  R.  Salakhutdinov]  


Model  Learning  
   Hidden  units  

Given  a  set  of  i.i.d.  training  examples    


                                 ,  we  want  to  learn    
model  parameters                .        
Maximize  log-­‐likelihood  objec4ve:  

Image            visible  units  

Deriva4ve  of  the  log-­‐likelihood:  

Difficult  to  compute:  exponen4ally  many    


configura4ons  

[Courtesy,  R.  Salakhutdinov]  


RBMs  for  Real-­‐valued  Data  
   hidden  variables   Pair-­‐wise   Unary  

Image            visible  variables  

Gaussian-­‐Bernoulli  RBM:  
•   Stochas4c  real-­‐valued  visible  variables                                    
•   Stochas4c  binary  hidden  variables                                            
•   Bipar4te  connec4ons.  
[Courtesy,  R.  Salakhutdinov]   (Salakhutdinov & Hinton, NIPS 2007; Salakhutdinov & Murray, ICML 2008)
RBMs  for  Real-­‐valued  Data  
   hidden  variables   Pair-­‐wise   Unary  

Image            visible  variables  

Learned  features  (out  of  10,000)  


4  million  unlabelled  images  

[Courtesy,  R.  Salakhutdinov]  


RBMs  for  Real-­‐valued  Data  
   hidden  variables   Pair-­‐wise   Unary  

Image            visible  variables  

Learned  features  (out  of  10,000)  


4  million  unlabelled  images  

= 0.9 * + 0.8 * + 0.6 * …


New  Image  
[Courtesy,  R.  Salakhutdinov]  
RBMs  for  Word  Counts  
Pair-­‐wise   Unary  
0 1
D
XXXK F D
XXK F
X
1 @
P✓ (v, h) = exp k k
Wij vi hj + k k
v i bi + hj a j A
0   Z(✓) i=1 k=1 j=1 i=1 k=1 j=1
0  
0  
⇣ PF ⌘
1   + exp bki h W k
j=1 j ij
0   P✓ (vik = 1|h) = P ⇣ PF ⌘
K q q
q=1 exp bi + j=1 hj Wij

Replicated  Soemax  Model:  undirected  topic  model:  


•   Stochas4c  1-­‐of-­‐K  visible  variables.  
•   Stochas4c  binary  hidden  variables                                              
•   Bipar4te  connec4ons.  
[Courtesy,  R.  Salakhutdinov]  (Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012)
RBMs  for  Word  Counts  
Pair-­‐wise   Unary  
0 1
D
XXXK F D
XXK F
X
1 @
P✓ (v, h) = exp k k
Wij vi hj + k k
v i bi + hj a j A
0   Z(✓) i=1 k=1 j=1 i=1 k=1 j=1
0  
0  
⇣ PF ⌘
1   +
exp bki h W k
j=1 j ij
0   P✓ (vik = 1|h) = P ⇣ PF ⌘
K q q
q=1 exp bi + j=1 hj Wij

Learned  features:  ``topics’’  


russian   clinton   computer   trade   stock  
Reuters  dataset:   russia   house   system   country   wall  
804,414  unlabeled   moscow   president   product   import   street  
newswire  stories   yeltsin   bill   soeware   world   point  
soviet   congress   develop   economy   dow  
Bag-­‐of-­‐Words    
[Courtesy,  R.  Salakhutdinov]  
Different  Data  Modali4es  
•   Binary/Gaussian/Soemax  RBMs:  All  have  binary  hidden  
variables  but  use  them  to  model  different  kinds  of  data.  

Binary   0  
0  
0  
1  
0  
Real-­‐valued   1-­‐of-­‐K  

•   It  is  easy  to  infer  the  states  of  the  hidden  variables:    

[Courtesy,  R.  Salakhutdinov]  


Product  of  Experts  
The  joint  distribu4on  is  given  by:  

Marginalizing  over  hidden  variables:   Product  of  Experts  

government   clinton   bribery   oil   stock   …  


auhority   house   corrup4on   barrel   wall    
power   president   dishonesty   exxon   street    
empire   bill   pu4n   pu4n   point    
pu4n   congress   fraud   drill   dow    

Topics  “government”,  ”corrup4on”  


and  ”oil”  can  combine  to  give  very  high  
Pu4n   probability  to  a  word  “Pu4n”.  

[Courtesy,  R.  Salakhutdinov]   (Srivastava & Salakhutdinov, NIPS 2012)


Deep  Boltzmann  Machines  

Low-­‐level  features:  
Edges  

Built  from  unlabeled  inputs.    

Input:  Pixels  
Image  
[Courtesy,  R.  Salakhutdinov]   (Salakhutdinov & Hinton, Neural Computation 2012)
Deep  Boltzmann  Machines  
Learn  simpler  representa4ons,  
then  compose  more  complex  ones  

Higher-­‐level  features:  
Combina4on  of  edges  

Low-­‐level  features:  
Edges  

Built  from  unlabeled  inputs.    

Input:  Pixels  
Image  
[Courtesy,  R.  Salakhutdinov]   (Salakhutdinov 2008, Salakhutdinov & Hinton 2012)
Model  Formula4on  

h3 Same  as  RBMs  

W3 model  parameters  
h2 •  Dependencies  between  hidden  variables.  
requires  
W 2 a
•  pproximate  
All   connec4ons   a i
re   nference  
u ndirected.   t o  
h1 train,  but  •  iBolom-­‐up  
t  can  be   dTone…  
and   op-­‐down:  
and  Ws1cales  to  millions  of  examples  
v
Input  
Top-­‐down   Bolom-­‐up  

[Courtesy,  R.  Salakhutdinov]  


Samples  Generated  by  the  Model  
Training  Data   Model-­‐Generated  Samples  

Data  

[Courtesy,  R.  Salakhutdinov]  


Handwri4ng  Recogni4on  
MNIST  Dataset   Op4cal  Character  Recogni4on  
60,000  examples  of  10  digits   42,152  examples  of  26  English  lelers    
Learning  Algorithm   Error   Learning  Algorithm   Error  
Logis4c  regression   12.0%   Logis4c  regression   22.14%  
K-­‐NN     3.09%   K-­‐NN     18.92%  
Neural  Net  (Plal  2005)   1.53%   Neural  Net   14.62%  
SVM  (Decoste  et.al.  2002)   1.40%   SVM  (Larochelle  et.al.  2009)   9.70%  
Deep  Autoencoder   1.40%   Deep  Autoencoder   10.05%  
(Bengio  et.  al.  2007)     (Bengio  et.  al.  2007)    

Deep  Belief  Net   1.20%   Deep  Belief  Net   9.68%  


(Hinton  et.  al.  2006)     (Larochelle  et.  al.  2009)    

DBM     0.95%   DBM   8.40%  

Permuta4on-­‐invariant  version.  
[Courtesy,  R.  Salakhutdinov]  
3-­‐D  object  Recogni4on  
NORB  Dataset:  24,000  examples  
Learning  Algorithm   Error  
Logis4c  regression   22.5%  
K-­‐NN  (LeCun  2004)   18.92%  
SVM  (Bengio  &  LeCun    2007)   11.6%  
Deep  Belief  Net  (Nair  &  Hinton     9.0%  
2009)    

DBM   7.2%  

Palern  
Comple4on  

[Courtesy,  R.  Salakhutdinov]  


Learning  Shared  Representa4ons  
Across  Sensory  Modali4es  

“Concept”  

sunset,  pacific  ocean,  


baker  beach,  seashore,  
ocean  

[Courtesy,  R.  Salakhutdinov]  


A  Simple  Mul4modal  Model  
•   Use  a  joint  binary  hidden  layer.  
•   Problem:    Inputs  have  very  different  sta4s4cal  
proper4es.  
•   Difficult  to  learn  cross-­‐modal  features.  

0  
Real-­‐valued   0  
0  
1  
1-­‐of-­‐K  
0  

[Courtesy,  R.  Salakhutdinov]  


Mul4modal  DBM  

Gaussian  model  
Replicated  Soemax  

0  
Dense,  real-­‐valued   0   Word  
image  features   0  
counts  
1  
0  

[Courtesy,  R.  Salakhutdinov]   (Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)  
Mul4modal  DBM  

Gaussian  model  
Replicated  Soemax  

0  
Dense,  real-­‐valued   0   Word  
image  features   0  
counts  
1  
0  

[Courtesy,  R.  Salakhutdinov]   (Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)  
Mul4modal  DBM  

Gaussian  model  
Replicated  Soemax  

0  
Dense,  real-­‐valued   0   Word  
image  features   0  
counts  
1  
0  

[Courtesy,  R.  Salakhutdinov]   (Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)  
Mul4modal  DBM  

Bolom-­‐up  
+  
Top-­‐down  

Gaussian  model  
Replicated  Soemax  

0  
Dense,  real-­‐valued   0   Word  
image  features   0  
counts  
1  
0  

[Courtesy,  R.  Salakhutdinov]   (Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)  
Mul4modal  DBM  

Bolom-­‐up  
+  
Top-­‐down  

Gaussian  model  
Replicated  Soemax  

0  
Dense,  real-­‐valued   0   Word  
image  features   0  
counts  
1  
0  

[Courtesy,  R.  Salakhutdinov]   (Srivastava & Salakhutdinov, NIPS 2012, JMLR 2014)  
Text  Generated  from  Images  
Given Generated       Given Generated      
  dog,  cat,  pet,  kilen,     insect,  bulerfly,  insects,  
        bug,  bulerflies,  
puppy,  ginger,  tongue,  
lepidoptera  
kily,  dogs,  furry  

sea,  france,  boat,  mer,   graffi4,  streetart,  stencil,  


beach,  river,  bretagne,   s4cker,  urbanart,  graff,  
plage,  brilany   sanfrancisco  

portrait,  child,  kid,   canada,  nature,  


ritralo,  kids,  children,   sunrise,  ontario,  fog,  
boy,  cute,  boys,  italy   mist,  bc,  morning  

[Courtesy,  R.  Salakhutdinov]  


Text  Generated  from  Images  
Given Generated      
  portrait,  women,  army,  soldier,  
    mother,  postcard,  soldiers  

obama,  barackobama,  elec4on,  


poli4cs,  president,  hope,  change,  
sanfrancisco,  conven4on,  rally  

water,  glass,  beer,  bolle,  


drink,  wine,  bubbles,  splash,  
drops,  drop  
Images  Generated  from  Text  
Given Retrieved  
 
water,  red,  
sunset      

nature,  flower,  
red,  green  

blue,  green,  
yellow,  colors  

chocolate,  cake  

[Courtesy,  R.  Salakhutdinov]  


MIR-­‐Flickr  Dataset  
•   1  million  images  along  with  user-­‐assigned  tags.  

nikon,  abigfave,   food,  cupcake,  


sculpture,  beauty,   d80   goldstaraward,  d80,  
stone   vegan  
nikond80  

anawesomeshot,   nikon,  green,  light,   white,  yellow,   sky,  geotagged,  


theperfectphotographer,   photoshop,  apple,  d70   abstract,  lines,  bus,   reflec4on,  cielo,  
flash,  damniwishidtakenthat,   graphic   bilbao,  reflejo  
spiritofphotography  

[Courtesy,  R.  Salakhutdinov]  


Huiskes  et.  al.  
Results  
•   Logis4c  regression  on  top-­‐level  representa4on.  
•   Mul4modal  Inputs   Mean  Average  Precision  

Learning  Algorithm   MAP   Precision@50  


Random   0.124   0.124  
LDA  [Huiskes  et.  al.]   0.492   0.754   Labeled  
25K  
SVM  [Huiskes  et.  al.]   0.475   0.758  
examples  
DBM-­‐Labelled   0.526   0.791  
Deep  Belief  Net   0.638   0.867  
+  1  Million  
Autoencoder   0.638   0.875   unlabelled  
DBM   0.641   0.873  

[Courtesy,  R.  Salakhutdinov]  


Artificial Neural Networks: Summary
•  Highly non-linear regression/classification
•  Hidden layers learn intermediate representations
•  Potentially millions of parameters to estimate
•  Stochastic gradient descent, local minima problems

•  Deep networks have produced real progress in many fields


–  computer vision
–  speech recognition
–  mapping images to text
–  recommender systems
–  …
•  They learn very useful non-linear representations

You might also like