A single-layer neural net
• The idea of neural nets is to approximate f (x) as a sequence
Lecture 11: Neural networks of “ ‘simple” non-linear functions.
Spatial Statistics and Image Analysis • Let’s look at a single-layer model first.
• Start by forming p1 different linear combinations of the data:
(1) (1)
W1 · x, W2 · x, . . . , W(1)
p1 · x
(1) (1) (1) P (1)
where Wk are weights and Wk · x = wk0 + pi=1 wki xi .
• To each linear combination, apply a non-linear function g (1) :
David Bolin (1) (1)
University of Gothenburg ↵1 = g (1) (W1 ·x), ↵2 = g (1) (W2 ·x), . . . , ↵p1 = g (1) (W(1)
p1 ·x)
Gothenburg • Finally approximate y1 and y2 as a transformed linear
May 13, 2019 combinations of these values:
(2) (2) (2) (2) (2) (2)
y1 = g1 (W1 · ↵, W2 · ↵), y2 = g2 (W1 · ↵, W2 · ↵)
Nerual networks David Bolin
Neural nets • To make sure that y1 , y2 are probabilities, we take g (2) (x) as
(2) x
the softmax function: gk (x1 , x2 ) = P2e kexl .
• A problem with the methods for image classification from last l=1
• We can represent this model as a network:
time is the need for feature selection.
• Neural networks is a class of methods that can be used to Input layer L1 Hidden layer L2 Output layer L3
design classifiers without the need to select features.
• Let us start with the binary classification problem: We have an x1
image x with pixels x1 , . . . , xp , which can belong to one of y1
two classes. x2
• Model: y2
y1 = P(z = 0|x) = f (x; ✓), x3
W (1) W (2)
y2 = P(z = 1|x) = 1 f (x; ✓)
↵
for some non-linear function of the pixel values.
• Likelihood for training the model from M images: • This is a feed-forward network since information only flows
M
Y forward in the network.
`(✓) = f (xi ; ✓)zi (1 f (xi ; ✓)zi )1 zi
• The nodes in the hidden layer are called neurons.
i=1 • The functions g (1) and g (2) are called activation functions.
Nerual networks David Bolin Nerual networks David Bolin
A single-layer neural net Example for binary classification
• Our model for y1 = f (x) is thus
0 1 Input layer L1 Hidden layer L2 Hidden layer L3 Hidden layer L4 Output layer L5
p
X p
X
e z1 (2) (2) (1) (1)
f (x) = , zk = wk0 + wki g (1) @wi0 + wij xi A
e z1 + e z2
i=1 j=1
x1
where all the weights w should be estimated to give a good fit. y1
x2
• The main idea of neural networks is that we should be able to
y2
approximate any function f (x) in this way:
x3 ↵(5)
The universal approximation theorem
(1) (1)
↵ W
W (4)
↵(2) W (2)
A feed-forward network with a single hidden layer containing a
↵(3) W (3) ↵(4)
finite number of neurons can approximate continuous functions on
compact subsets of Rn , under mild assumptions on the activation
function.
Nerual networks David Bolin Nerual networks David Bolin
General feed-forward neural nets for classification Comments
• Input data x1 , . . . , xp . Output: probabilities for K classes.
• The output probabilities are given by ↵(L) .
• In total L 1 hidden layers in the model.
• Common activation functions for the internal layers:
• We can allow for a non-linear transformation of the input data • Rectified linear: g(v) = max(0, v). Sometimes called Rectified
(1)
in the Input layer, giving ↵k = g (0) (xk ). Linear Unit (RELU).
• Usually we set g (0) to the identity function but keep the • Sigmoid function: g(v) = 1+e1 v . Sometimes called a radial
(1) basis function (RBF network).
notation ↵k = xk to simplify the formulas. • g(v) = tanh(v).
• At layer k in the model, define linear combination of the • Common activation function for the output layer for
neurons in previous layers, and new neuron values: classification:
pl • Softmax gi (v1 , . . . , vK ) = PKexp(vi ) .
(l) (l 1)
X1 (l 1) (l 1) k=1 exp(vk )
zl = wk0 + wkj ↵j := W (l 1) (l 1)
↵ • Symmetric version of the logit link used for logistic regression.
j=1 • The neural network is nothing else than a hierarchically
(l) (l) (l)
↵ = g (z ) specified non-linear regression. Compare with logistic
regression.
for l = 1, . . . , L, where p1 = p, and ↵(1) = x.
Nerual networks David Bolin Nerual networks David Bolin
Parameter estimation Backpropagation
The gradient of L can be estimated using the chain rule.
• The neural network defines a nonlinear function f (x, W ) of
(l)
the input variables x, depending on the unknown weights 1 Feed forward pass: Compute ↵k for each layer l and each
W = {W (1) , W (2) , . . . , W (L) }. node k based on the current estimate of W .
• To estimate W , for some input data {xi , yi }M 2 For the output layer, compute
i=1 , we can define
a loss-function R(y, f (x, W )) and compute (L)
(L) @R @ R @ ↵k @R (L)
k = (L)
= (L) (L)
= (L)
ġ (L) (zk )
M
X @ zk @ ↵k @ zk @ ↵k
Ŵ = arg min R(yi , f (xi , W ))
W i=1
3 For l = L 1, . . . , 2, compute
Simple examples of L: 0 1
pl+1
• For regression: Squared loss (l)
X (l) (l+1) A (l) (l)
@
R(y, f (x, W )) = 12 ky f (x, W )k2 . k = wkj j ġ (zk )
• For classification: Cross-entropy loss j=1
PK
R(y, f (x, W )) = k=1 1(y = k) log fk (x, W ).
@R (l) (l+1)
• Estimate W using gradient-descent. 4 Compute (l) = ↵j k
@ wkj
Parameter estimation David Bolin Parameter estimation David Bolin
Regularization Gradient descent
• Neural networks in general have too many parameters and will
overfit the data.
(l)
• An early solution to this problem was to stop the • Update wkj using a gradient descent step. Assuming
gradient-based estimation before convergence. weight-decay penalty:
• A validation dataset can be used to determine when to stop. 0 1
• A more explicit method for regularization is to include a (l) (l) @ R (l)
wkj wkj @ + wkj A
penalty on the weights in the loss-function: (l)
@ wkj
• The neural network defines a nonlinear function f (x, W ) of
the input variables x, depending on the unknown weights • We need a lot of data to estimate these models, and for large
M
X datasets, the computation of @ R(l) is expensive: For M
Ŵ = arg min R(yi , f (xi , W )) + J(W ) @ wkj
W i=1 training images with p pixels, and a network with N hidden
units, O(pM N ) operations are needed.
• A common example is the weight-decay penalty:
P (l)
J(W ) = j,l,k (wk,j )2 which will pull the weights to zero.
• is a tuning parameter: estimate using cross-validation.
Parameter estimation David Bolin Parameter estimation David Bolin
Stochastic gradient descent Convolution layers
To speed up the estimation, it is common to replace the exact A convolution layer has three stages:
gradient by a stochastic estimate:
P 1 Convolution stage: Convolve each input image with f different
• Option 1: Define G(W ) = 1s M @R
i=1 Ji @ W (l) , where Ji are linear filters, with kernels of size q ⇥ q, producing f output
independent Be(s) random variables. Thus, we are randomly images.
selecting (on average) 100s% of the images in each iteration.
2 Detector stage: Apply a non-linear function to each image.
Then
Typically the rectified linear function g(v) = max(0, v).
M M
1X @R X @R 3 Pooling stage: For each image, reduce each non-overlapping
E(G(W )) = E(Ji ) =
s @W (l) @ W (l) block of r ⇥ r pixels to one single value, by for example taking
i=1 i=1
the largest value in the block.
• Option 2: Divide the training data into m batches and Convolution layer
randomly sample one of the batches in each iteration. Input image Convolution stage Detector stage Pooling stage Output
images
There are several other tricks to speed up convergence, such as
momentum updates.
Parameter estimation David Bolin Convolutional neural networks David Bolin
Convolutional Neural networks Comments
• Often, a fully connected network simply has too many • One could view the convolution stage as a regular layer where
parameters: For the first single-layer network for binary most of the weights are zero: A pixel in the output image only
classification, we have pp1 + 2p1 unknown weights. depends on the q ⇥ q nearest pixels in the input image.
p = p1 = 1000 thus gives 1002000 unknown parameters! • The different nodes share parameters, since we use the same
• The problem is that we have a separate weight between each convolution kernel across the entire image.
pixel and each hidden node. • As a result, a convolution layer has f q 2 parameters, which is
much less than a corresponding fully connected layer with
• The idea of Convolutional neural networks is to reduce the
(ppl )2 parameters.
number of parameters by assuming that most of the weights
• Since pooling reduces the image size, we can in the next stage
are zero, and that the non-zero weights have a common
structure. use more filters without increasing the total number of nodes.
• Pooling makes the output less sensitive to small translations of
• A CNN assumes that the input data has a lattice structure,
the input.
like an image.
• Another variant of pooling is to take the max across different
• Consists of a special type of layers called convolution layers,
learned features. This can make the output invariant to other
which are based on filtering the image with a kernel.
things, such as rotations.
Convolutional neural networks David Bolin Convolutional neural networks David Bolin
Example of a CNN
Convolution Convolution Convolution
Input image Vectorize .
layer layer layer .
256 ⇥ 256 512 ⇥ 1 .
64 ⇥ 64 ⇥ 2 16 ⇥ 16 ⇥ 8 4 ⇥ 4 ⇥ 32
Output
layer
L5
• The first layer has f = 2 filters, the second has f = 4, the
third has f = 4.
• Each pooling stage uses r = 4.
• The final hidden layer is a usual fully connected layer.
Convolutional neural networks David Bolin
Comments
• A CNN is a method for image classification using filtered
images as features, but where we do not need to specify
features manually.
• Using CNNs for image classification re-popularized neural
networks around 2010, and “Deep learning” was coined as a
flashy name for using “deep” neural networks with more than
one hidden layer.
• For further details on neural networks, for example see:
1 Computer age statistical inference by Efron and Hastie
2 deeplearningbook.org
3 Matlab guides: Create simple deep learning network for
classification
Convolutional neural networks David Bolin