0% found this document useful (0 votes)

18 views71 pages

Lecture 19

Uploaded by

Vic Yassenov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views71 pages

Lecture 19

Uploaded by

Vic Yassenov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

The Oxford logo

At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all

other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal

rectangle logo, is only to be used where height (vertical
space) is restricted.

These standard versions of the Oxford logo are intended

for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.

Chapter 8, Part 3: Modules and Architectures

Examples of how these logos should be used for various
applications appear in the following pages.
Rectangle Logo

NOTE
The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with The rectangular

Advanced Topics in Statistical Machine Learning

bolder elements are available for use down to 15mm
wide. See page 7.
secondary Oxford
logo is for use only
where height is
restricted.

Tom Rainforth
Hilary 2022
[email protected]
amples
Complicated Architectures from Simple Building Blocks

1
Ops, Modules, and Factories

To help with exposition, we define three kinds of objects used to

2
Ops, Modules, and Factories

To help with exposition, we define three kinds of objects used to

construct a computation graph
Op A node in a computation graph. Given an input, its
output will be a block of (hidden) unit values. We
can think of ops as evaluations of functions; they
have no parameters of their own.
Module A function that can applied multiple times within
the graph. It may be parameterized, in which case
the parameters are shared across uses. Applying a
module creates an op.

2
Ops, Modules, and Factories

To help with exposition, we define three kinds of objects used to

Consider a factory Factory(n, σ) that generates functions of the

form σ(W x) where W ∈ Rn×n and σ is an element–wise
non–linearity. We can then construct the computation graph:

3
A Simple Example

Consider a factory Factory(n, σ) that generates functions of the

3
A Simple Example

Consider a factory Factory(n, σ) that generates functions of the

form σ(W x) where W ∈ Rn×n and σ is an element–wise
non–linearity. We can then construct the computation graph:
module1 ∼ Factory(n, σ)
module2 ∼ Factory(n, σ)
op1 = module1 (x)
op2 = module1 (op1 )
op3 = module2 (op2 )
op4 = module2 (op3 )
module1 and module2 have the same form but different parameters
W1 and W2 . The computation graph equates to the computation
f (x) = σ(W2 σ(W2 σ(W1 σ(W1 x))))
3
Linear Factories

One of the simplest and most important kind of modules are those
generated by a linear factory

module ∼ Linear(m, n)
module(x) = W x + b

which has parameters corresponding to a weight matrix

W ∈ Rm×n and a bias vector b ∈ Rm . Here m and n correspond
to the number of output and input units respectively.

4
Linear Factories

One of the simplest and most important kind of modules are those
generated by a linear factory

module ∼ Linear(m, n)
module(x) = W x + b

which has parameters corresponding to a weight matrix

W ∈ Rm×n and a bias vector b ∈ Rm . Here m and n correspond
to the number of output and input units respectively.
In the most common case, W will be a dense matrix—for which
the factory is typically known as fully-connected or dense—but it is
also possible to have factories that produce sparse weight matrices.

4
Nonlinearity Modules

Fixed nonlinearities are a class of modules that have no learnable

parameters

5
Nonlinearity Modules

Fixed nonlinearities are a class of modules that have no learnable

5
Nonlinearity Modules

Fixed nonlinearities are a class of modules that have no learnable

parameters
They are most commonly element-wise nonlinearities, such that
they form activation functions
sigmoid(x) = 1/(1 + exp(−x))
tanh(x) = (exp(x) − exp(−x))/(exp(x) + exp(−x))
ReLU(x) = max(0, x)
softplus(x) = log(1 + exp(x))
There are also some non–element–wise nonlinearities, often used
for calculating output layers, e.g. a softmax
" #
exp(x1 ) exp(xd )
softmax([x1 , . . . , xd ]) = Pd , . . . , Pd
i=1 exp(x i ) i=1 exp(xi ) 5
Max Pooling

A max pooling module subsam-

ples its input, taking the largest
value in some surrounding area
For example, max pooling the first
two dimensions of a 3D input:

x′i′ j ′ k′ =
d1 d2
max max xi+d1 (i′ −1), j+d2 (j ′ −1), k′
i=1 j=1

This can be used to reduce the Figure Credit:

number of parameters in a model http://cs231n.github.io/
as fewer connections are needed convolutional-networks/
on the next layer
6
Convolutional Neural Networks

Convolutional Neural Networks (CNNs or ConvNets) are

one of the key workhorses of deep learning, particularly when
dealing with high–dimensional inputs like images

1
Tractable here is used in quite a loose sense: some modern incarnations have 50M+ parameters. Nonetheless this
is dwarfed by the current record for transformer–based networks of 1.6 Trillion parameters.

7
Convolutional Neural Networks

Convolutional Neural Networks (CNNs or ConvNets) are

7
Convolutional Neural Networks

Convolutional Neural Networks (CNNs or ConvNets) are

one of the key workhorses of deep learning, particularly when
dealing with high–dimensional inputs like images
Though many of the core ideas stem back to the 80s and 90s,
current state–of–the–art approaches in many application areas
are still based on CNNs and their derivatives
Their key feature, convolutional modules, have sparse
connections with many shared weights; they require far fewer
parameters for the same number of hidden units than MLPs

7
Convolutional Neural Networks

Convolutional Neural Networks (CNNs or ConvNets) are

7
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1
0 0 1 0 1 0 -1 1 -1
…
6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 -1 11 -1-1
-1
1 0 0 0 1 0
0 01 1 0 0 0 0 11 00 -1
-1 1-1 -11
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 -1 1 -1

…
…
6 x 6 image -1 1 -1
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 Dot
1 0 0 0 1 0 Product -1 -1 11 -1-1
0 01 0 0
1 0 0 1 0
1 0 -1
-1 1-1 -11-3
3 -1 -1
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 -1 11 -10
-3 -3

…
…
6 x 6 image -1 1 -10
0 1 0 0 1 0 -3 -3 1
0 0 1 0 1 0 -1 1 -1
3 -2 -2 -1
…
6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 -1 11 -1-1
-1
1 0 0 0 1 0
0 01 1 0 0 0 0 11 00 -1
-1 1-1 -11-3
3 -1 -1
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 Dot
-1
-3 11 -10 -3

…
…
6 x 6 image -1 1 -10
0 1 0 0 1 0 Product -3 -3 1
0 0 1 0 1 0 -1 1 -1
3 -2 -2 -1
…
6 x 6 image …
4x4 Convolution
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions for Image Processing
Convolutions for Image Processing
Image Convolution Examples

Credit: Frank Wood

9
Convolutions for Image Processing
Convolutions for Image Processing
Image
Image Convolution
Convolution Examples
Examples
Image Convolution Examples
Gaussian
Convolution

* =

Credit: Frank Wood

9
Convolutions for Image Processing
Convolutions for Image Processing
Image
ImageConvolution
ConvolutionExamples
Examples

Emboss
Filter
2 3
2 1 0

*
4 1
0
0 15
1 2
=

http://setosa.io/ev/image-kernels

Credit: Frank Wood

9
Convolutions for Image Processing

Convolutions for Image Processing

3D Convolution

2 3
0 1 0
4 1 5 15
0 1 0

Credit: Frank Wood

9
Convolutions for Image Processing

Convolutions for Image Processing

3D Convolution
3D Convolution

Sharpen Blue
Channel

2 3
0 1 0
4 1 5 15
0 1 0

Credit: Frank Wood

9
Convolutional Layers

3D Convolution
3D Convolution
Image Convolution Examples
Image Convolution Examples
3D Convolution
3D Convolution
3D Convolution

http://setosa.io/ev/image-kernels

Input Layer Hidden Layer 1 (before

applying activations)

10
Convolutional Layers

3D Convolution
3D Convolution
Image Convolution Examples
Image Convolution Examples
3D Convolution
3D Convolution
3D Convolution

http://setosa.io/ev/image-kernels

Learn
Input Layer
filters Hidden Layer 1 (before
applying activations)

10
L̂(W, ) = ` X ,Y , W,
B n=1
Convolutions as Sparse Connections
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2
2 1 0
3
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

11
L̂(W, ) = ` X ,Y , W,
B n=1
Convolutions as Sparse Connections
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2
2 1 0
3
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

2
Convolutional layer
2 3
w2 w3 0 0 0
6w1 w2 w3 0 07
6 7
6 0 w1 w2 w3 07 (
40 0 w w w3 5
1 2
0 0 0 w1 w2

Convolution is equivalent to sparse matrix

n=1 n=2 n=3 • • • n = N0
11
y1 y2 y3 • • • yN 0
1
L̂(W, ) = ` X i(n) , Y i(n) , W,
B
Convolutions as Sparse Connections n=1
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2 3
2 1 0
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

2
Convolutional layer
2 3
w2 w3 0 0 0
6w1 w2 w3 0 07
6 7
6 0 w1 w2 w3 07 (
40 0 w w w3 5
1 2
0 0 0 w1 w2

Convolution is equivalent to sparse matrix

n=1 n=2 n=3 • • • n = N0
11
1
L̂(W, ) = ` X i(n) , Y i(n) , W,
B
Convolutions as Sparse Connections n=1
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2 3
2 1 0
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

2
Convolutional layer
2 3
w2 w3 0 0 0
6w1 w2 w3 0 07
6 7
6 0 w1 w2 w3 07 (
40 0 w w w3 5
1 2
0 0 0 w1 w2

Convolution is equivalent to sparse matrix

n=1 n=2 n=3 • • • n = N0
11
Convolutional Modules

The most common convolutional factory produces 2D

convolutional modules that have 3D inputs and outputs where the
third dimension is a number of channels that are summed over:
module ∼ Conv2D(cin , cout , d1 , d2 )
x′ = module(x)
d1 X
X cin
d2 X
x′i′ j ′ k′ = wijkk′ xi′ +i−1,j ′ +j−1,k ∀i′ , j ′ , k ′
i=1 j=1 k=1

12
Convolutional Modules

The most common convolutional factory produces 2D

Their are a number of variants on this such as including a bias for

each output channel, padding the edges of the input (e.g. with
zeros) so that the output is the same size, and introducing a
stride, wherein the filter is moved by multiple indices at a time

12
Convolutional Modules

The most common convolutional factory produces 2D

Their are a number of variants on this such as including a bias for

each output channel, padding the edges of the input (e.g. with
zeros) so that the output is the same size, and introducing a
stride, wherein the filter is moved by multiple indices at a time
A depthwise spatial convolution instead applies a separate
convolution to each channel (i.e. wijkk′ = 0 if k ̸= k ′ ) 12
CNNs

Convolutional Networks
The traditional CNN setup has a(Convnets)
mixture of convolutional and max
pooling layers to learn features, before finishing with one or more
fully connected layers to do the final prediction2

! Both filter banks and layers are 4D tensors (arrays of numbers).

2
Some modern large CNNs forgo the pooling layers 13
SB2b/SM4 - Deep Learning ywteh
CNNs

Compared with MLPs, this reduces the number of parameters,

! Both filter
thereby banks and
reducing bothlayers are 4D
memory and tensors (arrays ofcosts;
computational numbers).
ultimately
this allows us to train deeper networks
2
Some modern large CNNs forgo the pooling layers 13
SB2b/SM4 - Deep Learning ywteh
Spatial Invariances

The other main motivation for using a mix of convolutional and

max pooling layers is that it can naturally induce spatial
invariances to where objects are in an image

“upper-left
beak” detector

They can be compressed

to the same parameters.

“middle beak”
detector

Figure Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt

14
Improving CNNs

15
Example Architecture: GoogleNet
Example: GoogleNet

[Szegedy et al 2014]
Going Deeper With Convolutions. Szegedy et al. CVPR 2015

16
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we

can end up with worse empirical risks at train (and test) time

17
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we

can end up with worse empirical risks at train (and test) time
ResNets and DenseNets use skip connections to help
alleviate this issue and allow deeper networks to be trained effectively

17
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we

can end up with worse empirical risks at train (and test) time
ResNets and DenseNets use skip connections to help
alleviate this issue and allow deeper networks to be trained effectively

Left: regular block. Right: residual block.

Source: https://www.d2l.ai/chapter_
convolutional-modern/resnet.html 17
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we

can end up with worse empirical risks at train (and test) time
ResNets and DenseNets use skip connections to help
alleviate this issue and allow deeper networks to be trained effectively

DenseNet blocks replace the addition

Left: regular block. Right: residual block.
with a concatenation. Image Credit:
Source: https://www.d2l.ai/chapter_
Rowel Atienza
convolutional-modern/resnet.html 17
Recurrent Neural Networks (RNNs)