Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views71 pages

Lecture 19

Uploaded by

Vic Yassenov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views71 pages

Lecture 19

Uploaded by

Vic Yassenov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

The Oxford logo

At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all


other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal


rectangle logo, is only to be used where height (vertical
space) is restricted.

These standard versions of the Oxford logo are intended


for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.

Chapter 8, Part 3: Modules and Architectures


Examples of how these logos should be used for various
applications appear in the following pages.
Rectangle Logo

NOTE
The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with The rectangular

Advanced Topics in Statistical Machine Learning


bolder elements are available for use down to 15mm
wide. See page 7.
secondary Oxford
logo is for use only
where height is
restricted.

Tom Rainforth
Hilary 2022
[email protected]
amples
Complicated Architectures from Simple Building Blocks

1
Ops, Modules, and Factories

To help with exposition, we define three kinds of objects used to


construct a computation graph
Op A node in a computation graph. Given an input, its
output will be a block of (hidden) unit values. We
can think of ops as evaluations of functions; they
have no parameters of their own.

2
Ops, Modules, and Factories

To help with exposition, we define three kinds of objects used to


construct a computation graph
Op A node in a computation graph. Given an input, its
output will be a block of (hidden) unit values. We
can think of ops as evaluations of functions; they
have no parameters of their own.
Module A function that can applied multiple times within
the graph. It may be parameterized, in which case
the parameters are shared across uses. Applying a
module creates an op.

2
Ops, Modules, and Factories

To help with exposition, we define three kinds of objects used to


construct a computation graph
Op A node in a computation graph. Given an input, its
output will be a block of (hidden) unit values. We
can think of ops as evaluations of functions; they
have no parameters of their own.
Module A function that can applied multiple times within
the graph. It may be parameterized, in which case
the parameters are shared across uses. Applying a
module creates an op.
Factory A procedure that generates modules. This allows us
to produce multiple modules that have separate sets
of parameters
2
A Simple Example

Consider a factory Factory(n, σ) that generates functions of the


form σ(W x) where W ∈ Rn×n and σ is an element–wise
non–linearity. We can then construct the computation graph:

3
A Simple Example

Consider a factory Factory(n, σ) that generates functions of the


form σ(W x) where W ∈ Rn×n and σ is an element–wise
non–linearity. We can then construct the computation graph:
module1 ∼ Factory(n, σ)
module2 ∼ Factory(n, σ)
op1 = module1 (x)
op2 = module1 (op1 )
op3 = module2 (op2 )
op4 = module2 (op3 )

3
A Simple Example

Consider a factory Factory(n, σ) that generates functions of the


form σ(W x) where W ∈ Rn×n and σ is an element–wise
non–linearity. We can then construct the computation graph:
module1 ∼ Factory(n, σ)
module2 ∼ Factory(n, σ)
op1 = module1 (x)
op2 = module1 (op1 )
op3 = module2 (op2 )
op4 = module2 (op3 )
module1 and module2 have the same form but different parameters
W1 and W2 . The computation graph equates to the computation
f (x) = σ(W2 σ(W2 σ(W1 σ(W1 x))))
3
Linear Factories

One of the simplest and most important kind of modules are those
generated by a linear factory

module ∼ Linear(m, n)
module(x) = W x + b

which has parameters corresponding to a weight matrix


W ∈ Rm×n and a bias vector b ∈ Rm . Here m and n correspond
to the number of output and input units respectively.

4
Linear Factories

One of the simplest and most important kind of modules are those
generated by a linear factory

module ∼ Linear(m, n)
module(x) = W x + b

which has parameters corresponding to a weight matrix


W ∈ Rm×n and a bias vector b ∈ Rm . Here m and n correspond
to the number of output and input units respectively.
In the most common case, W will be a dense matrix—for which
the factory is typically known as fully-connected or dense—but it is
also possible to have factories that produce sparse weight matrices.

4
Nonlinearity Modules

Fixed nonlinearities are a class of modules that have no learnable


parameters

5
Nonlinearity Modules

Fixed nonlinearities are a class of modules that have no learnable


parameters
They are most commonly element-wise nonlinearities, such that
they form activation functions
sigmoid(x) = 1/(1 + exp(−x))
tanh(x) = (exp(x) − exp(−x))/(exp(x) + exp(−x))
ReLU(x) = max(0, x)
softplus(x) = log(1 + exp(x))

5
Nonlinearity Modules

Fixed nonlinearities are a class of modules that have no learnable


parameters
They are most commonly element-wise nonlinearities, such that
they form activation functions
sigmoid(x) = 1/(1 + exp(−x))
tanh(x) = (exp(x) − exp(−x))/(exp(x) + exp(−x))
ReLU(x) = max(0, x)
softplus(x) = log(1 + exp(x))
There are also some non–element–wise nonlinearities, often used
for calculating output layers, e.g. a softmax
" #
exp(x1 ) exp(xd )
softmax([x1 , . . . , xd ]) = Pd , . . . , Pd
i=1 exp(x i ) i=1 exp(xi ) 5
Max Pooling

A max pooling module subsam-


ples its input, taking the largest
value in some surrounding area
For example, max pooling the first
two dimensions of a 3D input:

x′i′ j ′ k′ =
d1 d2
max max xi+d1 (i′ −1), j+d2 (j ′ −1), k′
i=1 j=1

This can be used to reduce the Figure Credit:


number of parameters in a model http://cs231n.github.io/
as fewer connections are needed convolutional-networks/
on the next layer
6
Convolutional Neural Networks

ˆ Convolutional Neural Networks (CNNs or ConvNets) are


one of the key workhorses of deep learning, particularly when
dealing with high–dimensional inputs like images

1
Tractable here is used in quite a loose sense: some modern incarnations have 50M+ parameters. Nonetheless this
is dwarfed by the current record for transformer–based networks of 1.6 Trillion parameters.

7
Convolutional Neural Networks

ˆ Convolutional Neural Networks (CNNs or ConvNets) are


one of the key workhorses of deep learning, particularly when
dealing with high–dimensional inputs like images
ˆ Though many of the core ideas stem back to the 80s and 90s,
current state–of–the–art approaches in many application areas
are still based on CNNs and their derivatives

1
Tractable here is used in quite a loose sense: some modern incarnations have 50M+ parameters. Nonetheless this
is dwarfed by the current record for transformer–based networks of 1.6 Trillion parameters.

7
Convolutional Neural Networks

ˆ Convolutional Neural Networks (CNNs or ConvNets) are


one of the key workhorses of deep learning, particularly when
dealing with high–dimensional inputs like images
ˆ Though many of the core ideas stem back to the 80s and 90s,
current state–of–the–art approaches in many application areas
are still based on CNNs and their derivatives
ˆ Their key feature, convolutional modules, have sparse
connections with many shared weights; they require far fewer
parameters for the same number of hidden units than MLPs

1
Tractable here is used in quite a loose sense: some modern incarnations have 50M+ parameters. Nonetheless this
is dwarfed by the current record for transformer–based networks of 1.6 Trillion parameters.

7
Convolutional Neural Networks

ˆ Convolutional Neural Networks (CNNs or ConvNets) are


one of the key workhorses of deep learning, particularly when
dealing with high–dimensional inputs like images
ˆ Though many of the core ideas stem back to the 80s and 90s,
current state–of–the–art approaches in many application areas
are still based on CNNs and their derivatives
ˆ Their key feature, convolutional modules, have sparse
connections with many shared weights; they require far fewer
parameters for the same number of hidden units than MLPs
ˆ They form a principled means of designing large networks
while retaining a tractable number of parameters1

1
Tractable here is used in quite a loose sense: some modern incarnations have 50M+ parameters. Nonetheless this
is dwarfed by the current record for transformer–based networks of 1.6 Trillion parameters.

7
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1
0 0 1 0 1 0 -1 1 -1

6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 -1 11 -1-1
-1
1 0 0 0 1 0
0 01 1 0 0 0 0 11 00 -1
-1 1-1 -11
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 -1 1 -1



6 x 6 image -1 1 -1
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1

6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 Dot
1 0 0 0 1 0 Product -1 -1 11 -1-1
0 01 0 0
1 0 0 1 0
1 0 -1
-1 1-1 -11-3
3 -1 -1
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 -1 11 -10
-3 -3



6 x 6 image -1 1 -10
0 1 0 0 1 0 -3 -3 1
0 0 1 0 1 0 -1 1 -1
3 -2 -2 -1

6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 Dot
1 0 0 0 1 0 Product -1 -1 11 -1-1
0 01 0 0
1 0 0 1 0
1 0 -1
-1 1-1 -11-3
3 -1 -1
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 -1 11 -10
-3 -3



6 x 6 image -1 1 -10
0 1 0 0 1 0 -3 -3 1
0 0 1 0 1 0 -1 1 -1
3 -2 -2 -1

6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 Dot
1 0 0 0 1 0 Product -1 -1 11 -1-1
0 01 0 0
1 0 0 1 0
1 0 -1
-1 1-1 -11-3
3 -1 -1
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 -1 11 -10
-3 -3



6 x 6 image -1 1 -10
0 1 0 0 1 0 -3 -3 1
0 0 1 0 1 0 -1 1 -1
3 -2 -2 -1

6 x 6 image …
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions (Technically Cross Correlations)

1 -1 -1
1 0 0 0 0 1 -1 1 -1 3x3 Filter
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0 1 -1 -1
1 0 0 0 0 1 -1 11 -1-1
-1
1 0 0 0 1 0
0 01 1 0 0 0 0 11 00 -1
-1 1-1 -11-3
3 -1 -1
0 00 0 1 1 1 0 01 00 -1 1 -1
1 0 0 0 1 0 Dot
-1
-3 11 -10 -3



6 x 6 image -1 1 -10
0 1 0 0 1 0 Product -3 -3 1
0 0 1 0 1 0 -1 1 -1
3 -2 -2 -1

6 x 6 image …
4x4 Convolution
Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt
8
Convolutions for Image Processing
Convolutions for Image Processing
Image Convolution Examples

Credit: Frank Wood

9
Convolutions for Image Processing
Convolutions for Image Processing
Image
Image Convolution
Convolution Examples
Examples
Image Convolution Examples
Gaussian
Convolution

* =

Credit: Frank Wood

9
Convolutions for Image Processing
Convolutions for Image Processing
Image
ImageConvolution
ConvolutionExamples
Examples

Emboss
Filter
2 3
2 1 0

*
4 1
0
0 15
1 2
=

http://setosa.io/ev/image-kernels

Credit: Frank Wood

9
Convolutions for Image Processing

Convolutions for Image Processing


3D Convolution

2 3
0 1 0
4 1 5 15
0 1 0

Credit: Frank Wood

9
Convolutions for Image Processing

Convolutions for Image Processing


3D Convolution
3D Convolution

Sharpen Blue
Channel

2 3
0 1 0
4 1 5 15
0 1 0

Credit: Frank Wood

9
Convolutional Layers

3D Convolution
3D Convolution
Image Convolution Examples
Image Convolution Examples
3D Convolution
3D Convolution
3D Convolution

http://setosa.io/ev/image-kernels

Input Layer Hidden Layer 1 (before


applying activations)

10
Convolutional Layers

3D Convolution
3D Convolution
Image Convolution Examples
Image Convolution Examples
3D Convolution
3D Convolution
3D Convolution

http://setosa.io/ev/image-kernels

Learn
Input Layer
filters Hidden Layer 1 (before
applying activations)

10
L̂(W, ) = ` X ,Y , W,
B n=1
Convolutions as Sparse Connections
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2
2 1 0
3
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

11
L̂(W, ) = ` X ,Y , W,
B n=1
Convolutions as Sparse Connections
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2
2 1 0
3
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

2
Convolutional layer
2 3
w2 w3 0 0 0
6w1 w2 w3 0 07
6 7
6 0 w1 w2 w3 07 (
40 0 w w w3 5
1 2
0 0 0 w1 w2

Convolution is equivalent to sparse matrix


n=1 n=2 n=3 • • • n = N0
11
y1 y2 y3 • • • yN 0
1
L̂(W, ) = ` X i(n) , Y i(n) , W,
B
Convolutions as Sparse Connections n=1
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2 3
2 1 0
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

2
Convolutional layer
2 3
w2 w3 0 0 0
6w1 w2 w3 0 07
6 7
6 0 w1 w2 w3 07 (
40 0 w w w3 5
1 2
0 0 0 w1 w2

Convolution is equivalent to sparse matrix


n=1 n=2 n=3 • • • n = N0
11
1
L̂(W, ) = ` X i(n) , Y i(n) , W,
B
Convolutions as Sparse Connections n=1
Update the network parameters
W W ⌘1 rW L̂(W, )
⌘2 r L̂(W, )
We can think of a convolution as⌘1aand
where sparse matrix
⌘2 are step sizes multiplication
So how come this is a neural net?
with shared parameters 2 3
2 1 0
4 1 0 15
0 1 2
Fully connected layer
2 3
w11 w12 w13 w14 w15
6w21 w22 w23 w24 w25 7
6 7
6w31 w32 w33 w34 w35 7
4w w w w w45 5
41 42 43 44
w51 w52 w53 w54 w55

2
Convolutional layer
2 3
w2 w3 0 0 0
6w1 w2 w3 0 07
6 7
6 0 w1 w2 w3 07 (
40 0 w w w3 5
1 2
0 0 0 w1 w2

Convolution is equivalent to sparse matrix


n=1 n=2 n=3 • • • n = N0
11
Convolutional Modules

The most common convolutional factory produces 2D


convolutional modules that have 3D inputs and outputs where the
third dimension is a number of channels that are summed over:
module ∼ Conv2D(cin , cout , d1 , d2 )
x′ = module(x)
d1 X
X cin
d2 X
x′i′ j ′ k′ = wijkk′ xi′ +i−1,j ′ +j−1,k ∀i′ , j ′ , k ′
i=1 j=1 k=1

12
Convolutional Modules

The most common convolutional factory produces 2D


convolutional modules that have 3D inputs and outputs where the
third dimension is a number of channels that are summed over:
module ∼ Conv2D(cin , cout , d1 , d2 )
x′ = module(x)
d1 X
X cin
d2 X
x′i′ j ′ k′ = wijkk′ xi′ +i−1,j ′ +j−1,k ∀i′ , j ′ , k ′
i=1 j=1 k=1

Their are a number of variants on this such as including a bias for


each output channel, padding the edges of the input (e.g. with
zeros) so that the output is the same size, and introducing a
stride, wherein the filter is moved by multiple indices at a time

12
Convolutional Modules

The most common convolutional factory produces 2D


convolutional modules that have 3D inputs and outputs where the
third dimension is a number of channels that are summed over:
module ∼ Conv2D(cin , cout , d1 , d2 )
x′ = module(x)
d1 X
X cin
d2 X
x′i′ j ′ k′ = wijkk′ xi′ +i−1,j ′ +j−1,k ∀i′ , j ′ , k ′
i=1 j=1 k=1

Their are a number of variants on this such as including a bias for


each output channel, padding the edges of the input (e.g. with
zeros) so that the output is the same size, and introducing a
stride, wherein the filter is moved by multiple indices at a time
A depthwise spatial convolution instead applies a separate
convolution to each channel (i.e. wijkk′ = 0 if k ̸= k ′ ) 12
CNNs

Convolutional Networks
The traditional CNN setup has a(Convnets)
mixture of convolutional and max
pooling layers to learn features, before finishing with one or more
fully connected layers to do the final prediction2

! Both filter banks and layers are 4D tensors (arrays of numbers).

2
Some modern large CNNs forgo the pooling layers 13
SB2b/SM4 - Deep Learning ywteh
CNNs

Convolutional Networks
The traditional CNN setup has a(Convnets)
mixture of convolutional and max
pooling layers to learn features, before finishing with one or more
fully connected layers to do the final prediction2

Compared with MLPs, this reduces the number of parameters,


! Both filter
thereby banks and
reducing bothlayers are 4D
memory and tensors (arrays ofcosts;
computational numbers).
ultimately
this allows us to train deeper networks
2
Some modern large CNNs forgo the pooling layers 13
SB2b/SM4 - Deep Learning ywteh
Spatial Invariances

The other main motivation for using a mix of convolutional and


max pooling layers is that it can naturally induce spatial
invariances to where objects are in an image

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector

Figure Credit: https://cs.uwaterloo.ca/~mli/Deep-Learning-2017-Lecture5CNN.ppt


14
Improving CNNs

15
Example Architecture: GoogleNet
Example: GoogleNet

[Szegedy et al 2014]
Going Deeper With Convolutions. Szegedy et al. CVPR 2015

16
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we


can end up with worse empirical risks at train (and test) time

17
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we


can end up with worse empirical risks at train (and test) time
ResNets and DenseNets use skip connections to help
alleviate this issue and allow deeper networks to be trained effectively

17
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we


can end up with worse empirical risks at train (and test) time
ResNets and DenseNets use skip connections to help
alleviate this issue and allow deeper networks to be trained effectively

Left: regular block. Right: residual block.


Source: https://www.d2l.ai/chapter_
convolutional-modern/resnet.html 17
Getting Too Deep: ResNets and DenseNets

Making networks too deep can cause difficulties with training; we


can end up with worse empirical risks at train (and test) time
ResNets and DenseNets use skip connections to help
alleviate this issue and allow deeper networks to be trained effectively

DenseNet blocks replace the addition


Left: regular block. Right: residual block.
with a concatenation. Image Credit:
Source: https://www.d2l.ai/chapter_
Rowel Atienza
convolutional-modern/resnet.html 17
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction

18
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction
ˆ Time series, text, audio, video

18
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction
ˆ Time series, text, audio, video
ˆ We may also need to make predictions “online,” such that we
must predict each output given only the sequence so far

18
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction
ˆ Time series, text, audio, video
ˆ We may also need to make predictions “online,” such that we
must predict each output given only the sequence so far
ˆ Even for non–sequential, fixed–dimensional data, we may
want to use sequential reasoning processes

18
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction
ˆ Time series, text, audio, video
ˆ We may also need to make predictions “online,” such that we
must predict each output given only the sequence so far
ˆ Even for non–sequential, fixed–dimensional data, we may
want to use sequential reasoning processes
ˆ “Attention” for image data

18
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction
ˆ Time series, text, audio, video
ˆ We may also need to make predictions “online,” such that we
must predict each output given only the sequence so far
ˆ Even for non–sequential, fixed–dimensional data, we may
want to use sequential reasoning processes
ˆ “Attention” for image data
ˆ Recurrent neural networks (RNNs) are a class of architectures
that allow us to deal with such sequential settings by
processing inputs and outputs in a sequence

18
Recurrent Neural Networks (RNNs)

ˆ Data is often sequential and exploiting this sequential


structure can be essential for accurate prediction
ˆ Time series, text, audio, video
ˆ We may also need to make predictions “online,” such that we
must predict each output given only the sequence so far
ˆ Even for non–sequential, fixed–dimensional data, we may
want to use sequential reasoning processes
ˆ “Attention” for image data
ˆ Recurrent neural networks (RNNs) are a class of architectures
that allow us to deal with such sequential settings by
processing inputs and outputs in a sequence
ˆ They have a very wide range of applications and are
particularly prominent in natural language processing
18
Basic RNN Framework

Imagine we have some input sequence x1 , x2 , . . . , xτ and want to


predict a corresponding sequence of outputs y1 , y2 , . . . , yτ . Further
assume that we will only have access to x1:t when predicting yt .

19
Basic RNN Framework

Imagine we have some input sequence x1 , x2 , . . . , xτ and want to


predict a corresponding sequence of outputs y1 , y2 , . . . , yτ . Further
assume that we will only have access to x1:t when predicting yt .
RNNs introduce a sequence of hidden states ht that represent the
sequence history x1:t and which can be calculated recursively as

ht = fe (ht−1 , xt )

19
Basic RNN Framework

Imagine we have some input sequence x1 , x2 , . . . , xτ and want to


predict a corresponding sequence of outputs y1 , y2 , . . . , yτ . Further
assume that we will only have access to x1:t when predicting yt .
RNNs introduce a sequence of hidden states ht that represent the
sequence history x1:t and which can be calculated recursively as

ht = fe (ht−1 , xt )

Here fe is known as an encoder module and its parameters are


generally shared between points in the sequence, i.e. the same fe is
used for each t

19
Basic RNN Framework

Imagine we have some input sequence x1 , x2 , . . . , xτ and want to


predict a corresponding sequence of outputs y1 , y2 , . . . , yτ . Further
assume that we will only have access to x1:t when predicting yt .
RNNs introduce a sequence of hidden states ht that represent the
sequence history x1:t and which can be calculated recursively as

ht = fe (ht−1 , xt )

Here fe is known as an encoder module and its parameters are


generally shared between points in the sequence, i.e. the same fe is
used for each t
Prediction is then performed from the hidden state at each step in
sequences using a decoder fd , such that ŷt = fd (ht )
19
A Simple Example RNN Architecture

fe ∼ MLP
fd ∼ MLP
h1 = fe (0, x1 )
ht = fe (ht−1 , xt ) ∀t ∈ {2, . . . , τ }
ŷt = fd (ht ) ∀t ∈ {1, . . . , τ }

Image Credit: fdeloche, https://commons.wikimedia.org/w/index.php?curid=60109157

20
Bidirectional RNN

Sometimes we want predictions to depend on all the inputs rather


than just the inputs so far

21
Bidirectional RNN

Sometimes we want predictions to depend on all the inputs rather


than just the inputs so far
We can deal with this by using a bidirectional RNN that has a
two sets of hidden states, one in each direction

21
Bidirectional RNN

Sometimes we want predictions to depend on all the inputs rather


than just the inputs so far
We can deal with this by using a bidirectional RNN that has a
two sets of hidden states, one in each direction

Image Credit: http://colah.github.io/posts/2015-09-NN-Types-FP/

21
Dealing with Varying Input and Output Lengths

Input data and predictions can also be of varying size


ˆ Text translation/generation, incomplete data, videos/audio of
varying length, identifying multiple objects in an image

22
Dealing with Varying Input and Output Lengths

Input data and predictions can also be of varying size


ˆ Text translation/generation, incomplete data, videos/audio of
varying length, identifying multiple objects in an image
Careful setups of RNNs allow us to deal with this, e.g. using “end
of sentence” as a possible input and output for text data which
triggers changepoint behavior

Image Credit: Andrej Karpathy


22
Example Application: Text Translation

Credit: https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html

23
Further Reading

ˆ Interactive and up–to–date online book on deep learning with


code examples etc: https://www.d2l.ai/index.html

24
Further Reading

ˆ Interactive and up–to–date online book on deep learning with


code examples etc: https://www.d2l.ai/index.html
ˆ Chapters 9 and 10 of Goodfellow, Bengio, and Courville,
Deep Learning

24
Further Reading

ˆ Interactive and up–to–date online book on deep learning with


code examples etc: https://www.d2l.ai/index.html
ˆ Chapters 9 and 10 of Goodfellow, Bengio, and Courville,
Deep Learning
ˆ Deep learning software tutorials (note that you can play
around with these directly in your browser without installing
anything)

24
Further Reading

ˆ Interactive and up–to–date online book on deep learning with


code examples etc: https://www.d2l.ai/index.html
ˆ Chapters 9 and 10 of Goodfellow, Bengio, and Courville,
Deep Learning
ˆ Deep learning software tutorials (note that you can play
around with these directly in your browser without installing
anything)
ˆ PyTorch: https://pytorch.org/tutorials/beginner/
deep_learning_60min_blitz.html

24
Further Reading

ˆ Interactive and up–to–date online book on deep learning with


code examples etc: https://www.d2l.ai/index.html
ˆ Chapters 9 and 10 of Goodfellow, Bengio, and Courville,
Deep Learning
ˆ Deep learning software tutorials (note that you can play
around with these directly in your browser without installing
anything)
ˆ PyTorch: https://pytorch.org/tutorials/beginner/
deep_learning_60min_blitz.html
ˆ Tensorflow: https://www.tensorflow.org/tutorials

24
Further Reading

ˆ Interactive and up–to–date online book on deep learning with


code examples etc: https://www.d2l.ai/index.html
ˆ Chapters 9 and 10 of Goodfellow, Bengio, and Courville,
Deep Learning
ˆ Deep learning software tutorials (note that you can play
around with these directly in your browser without installing
anything)
ˆ PyTorch: https://pytorch.org/tutorials/beginner/
deep_learning_60min_blitz.html
ˆ Tensorflow: https://www.tensorflow.org/tutorials
ˆ Stanford course on deep learning: https://youtube.com/
playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv
(Lectures 5, 9, and 10 of particular relevance for this lecture)
24

You might also like