Neural Networks Example
Neural Networks Example
Marcello Pelillo
University of Venice, Italy
Artificial Intelligence
a.y. 2017/18
DARPA Neural Network Study (1989)
“Over the history of computing science, two advances have matured: High
speed numerical processing and knowledge processing (Artificial Intelligence).
Neural networks seem to offer the next necessary ingredient for intelligent
machines − namely, knowledge formation and organization.”
DARPA Neural Network Study (1989)
Transition (1960-1980)
• Widrow – Hoff (LMS rule)
• Anderson (Associative memories)
• Amari
Resurgence (1980-1990’s)
• Hopfield (Ass. mem. / Optimization)
• Rumelhart et al. (Back-prop)
• Kohonen (Self-organizing maps)
• Hinton , Sejnowski (Boltzmann machine)
1014/1015 connections
Dendrites: Receive incoming signals from other nerve axons via synapse
Neural Dynamics
The transmission of signal in the cerebral cortex is a complex process:
Simplifying :
SYNAPSE is the relay point where information is conveyed by chemical transmitters from neuron
to neuron. A synapse consists of two parts: the knowblike tip of an axon terminal and the receptor
region on the surface of another neuron. The membranes are separated by a synaptic cleft some
200 nanometers across. Molecules of chemical transmitter, stored in vesicles in the axon terminal,
are released into the cleft by arriving nerve impulses. Transmitter changes electrical state of the
receiving neuron, making it either more likely or less likely to fire an impulse.
Synaptic Efficacy
It’s the amount of current that enters into the post-synaptic neuron,
compared to the action potential of the pre-synaptic neuron.
Weights wij represent the strength of the synapse between neuron j and neuron i
Properties of McCulloch-Pitts Networks
By properly combining MP neurons one can simulate the behavior of any
Boolean circuit.
Three elementary logical operations (a) negation, (b) and, (c) or. In each diagram
the states of the neurons on the left are at time t and those on the right at time t +1.
Given :
1) some “features”: f1 , f 2 ,...., f n
2) some “classes”: c1 ,....,c m
Problem :
To classify an “object” according to its features
Example #1
To classify an “object” as :
c1 = “ watermelon ”
c2 = “ apple ”
c3 = “ orange ”
Example :
weight = 80 g
color = green “apple”
size = 10 cm³
Example #2
• (Potential) Features :
f1 : Body temperature
f2 : Headache ? (yes / no)
f3 : Throat is red ? (yes / no / medium)
f4 :
Example #3
Hand-written digit recognition
Example #4:
Face Detection
Example #5:
Spam Detection
Geometric Interpretation
Example:
Classes = { 0 , 1 }
Features = x , y : both taking value in [ 0 , +∞ [
Examples:
Linear Separability
A classification problem is said to be linearly separable if the decision regions
can be separated by a hyperplane.
Example: AND
X Y X AND Y
0 0 0
0 1 0
1 0 0
1 1 1
Limitations of Perceptrons
It has been shown that perceptrons can only solve linearly separable
problems.
• Add “ hidden” layers between the input and output layer. A network
with just one hidden layer can represent any Boolean functions including
XOR
• Power of multilayer networks was known long ago, but algorithms for
training or learning, e.g. back-propagation method, became available
only recently (invented several times, popularized in 1986)
Hyperbolic tangent
Back-propagation Learning Algorithm
• An algorithm for learning the weights in a feed-forward network,
given a training set of input-output pairs
• The algorithm is based on gradient descent method.
Supervised Learning
Supervised learning algorithms require the presence of a “teacher” who
provides the right answers to the input questions.
L= { (x , y ) ,
1 1
..... (x , y ) }
p p
where :
E = åå (y )
1 m m 2
k - Ok
2 m k
where Okμ is the output provided by the output unit k when the network is
given example μ as input.
Back-Propagation
To minimize the error function E we can use the classic gradient-
descent algorithm:
η = “learning rate”
h j w jk xk
k
and produces as output :
V j g h j g w jk xk
k
Back-Prop:
Updating Hidden-to-Output Weights
åå ( )
1 2
E= ykm - Okm
2 m k
¶E
DWij = - h
¶Wij
¶ é1 ù
ê åå (y )
2
= -h m
k - Ok m
ú
¶Wij êë 2 m k úû
¶Okm
= h å å ( yk - Ok m m
)
m k ¶Wij
¶Oim
= h å ( yi - Oi m m
)
m ¶Wij
= h å ( yim - Oim ) g '( h ) V m
i
m
j
m
E = åå (y )
1 m m 2
E k - Ok
w jk 2 m k
w jk
Oi
yi Oi
i
w jk
hi hi Vl
i
yi Oi
g' h
i
w jk w jk
Wil
l w jk
V j
Wij
w jk
Wij
g h j
w jk
h j
Wij g' h j w
jk
Back-Prop:
Updating Input-to-Hidden Weights (2)
h j
w jm xm
w jk w jk m
xk
Hence, we get:
w jk y O g ' h W
i i i ij
g ' h j xk
,i
W i ij
g ' h j xk
,i
ˆ j xk
Simple remedy:
Leave-one-out: using as many test folds as there are examples (size of test fold = 1)
Overfitting
(a) A good fit to noisy data.(b) Overfitting of the same data: the fit is perfect on the
“training set” (x’s), but is likely to be poor on “test set” represented by the circle.
Early Stopping
Size Matters
Adavantages:
IDEA: Remove unit h (and its in/out connections) and adjust the
remaining weights so that the I/O behavior is the same
G. Castellano, A. M. Fanelli, and M. Pelillo, An iterative pruning algorithm for feedforward neural networks, IEEE
Transactions on Neural Networks 8(3):519-531, 1997.
An Iterative Pruning Algorithm
This is equivalent to solving the system:
nh nh
å ij j å( ij ij ) j
w y (m )
= w + d y (m )
i =1… nO , m =1… P
j=1 j=1
j¹h
before after
which is equivalent to the following linear system (in the unknown δ’s):
jh
ij y ( )
j wih y ( )
h i =1… nO , m =1… P
An Iterative Pruning Algorithm
Ax = b
Pno ´ no ( nh -1)
where AÎÂ
Least-square solution :
min Ax - b
x
Detecting Excessive Units
Axk b rk
decrease: rk rk 1
• Starting point: x0 0 r0 b
2) Repeat
nine 4-5-1
Pruned nets 5 hidden nodes (average)
one 4-4-1
100 0,25
90
80 0,2
70
recognition rate (%)
recognition rate
60 MSE 0,15
MSE
50 MINIMUM
NET
40 0,1
30
20 0,05
10
0 0
10 9 8 7 6 5 4 3 2 1
number of hidden units
Example: 4-bit simmetry
100 0,25
90
80 0,2
70
MINIMUM
recognition rate (%)
MSE
50
40 0,1
30
20 0,05
10
0 0
10 9 8 7 6 5 4 3 2 1
number of hidden units
Deep Neural Networks
The Age of “Deep Learning”
The Deep Learning “Philosophy”
From. R. E. Turner
Performance Improves with More Data
Old Idea… Why Now?
Predict a single label (or a distribution over labels as shown here to indicate our confidence)
for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x
Height x 3. The 3 represents the three color channels Red, Green, Blue.
From: A. Karpathy
Challenges
From: A. Karpathy
The Data-Driven Approach
From. R. E. Turner
Hierarchy of Visual Areas
From. D. Zoccolan
The Retina
The Retina
Receptive Fields
“The region of the visual field in which light stimuli evoke responses of
a given neuron.”
Cellular Recordings
Kuffler, Hubel, Wiesel, …
1968 ..
Retinal Ganglion Cell Response
Beyond the Retina
Simple Cells
Orientation selectivity: Most V1 neurons are orientation selective meaning that they
respond strongly to lines, bars, or edges of a particular orientation (e.g., vertical) but
not to the orthogonal orientation (e.g., horizontal).
Complex Cells
Hypercomplex Cells (end-stopping)
Take-Home Message:
Visual System as a Hierarchy of Feature Detectors
Convolution
Convolution
Mean Filters
Gaussian Filters
Gaussian Filters
The Effect of Gaussian Filters
The Effect of Gaussian Filters
Kernel Width Affects Scale
Edge detection
Edge detection
Using Convolution for Edge Detection
A Variety of Image Filters
From: M. Sebag
Traditional vs Deep Learning Approach
From: M. Sebag
Convolutional Neural Networks (CNNs)
(LeCun 1998)
From. M. A. Ranzato
Using Several Trainable Filters
• 8 layers total
Loop:
1. Sample a batch of data
2. Forward prop it through the graph, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Data Augmentation
The neurons which are “dropped out” in this way do not contribute to
the forward pass and do not participate in backpropagation.
From: B. Biggio
Layer 1
• Chop the network at desired layer and use the output as a feature
representation to train an SVM on some other dataset (Zeiler-Fergus 2013):
• Speech recognition
• Autonomous driving
• …
References
• http://neuralnetworksanddeeplearning.com
• http://deeplearning.stanford.edu/tutorial/
• http://www.deeplearningbook.org/
• http://deeplearning.net/
Platforms:
• Theano
• Torch
• TensorFlow
• …