The Oxford logo
At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.
The word OXFORD is a specially drawn typeface while all
other text elements use the typeface Foundry Sterling.
The secondary version of the Oxford logo, the horizontal
rectangle logo, is only to be used where height (vertical
space) is restricted.
Chapter 8, Part 5: Training and Practicalities
These standard versions of the Oxford logo are intended
for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.
Examples of how these logos should be used for various
Rectangle Logo
applications appear in the following pages.
for Deep Learning Models NOTE
The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with The rectangular
secondary Oxford
bolder elements are available for use down to 15mm
logo is for use only
wide. See page 7.
where height is
restricted.
Advanced Topics in Statistical Machine Learning
5
Tom Rainforth
Hilary 2022
[email protected]
Gradient Descent
Using the backpropagation/automatic differentiation techniques
from the last lecture yields the loss gradient ∇θ Li for any
datapoint, from which we can construct the empirical risk gradient
n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1
1
This could be a local minimum or a saddle point
1
Gradient Descent
Using the backpropagation/automatic differentiation techniques
from the last lecture yields the loss gradient ∇θ Li for any
datapoint, from which we can construct the empirical risk gradient
n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1
The simplest approach to optimize the network parameters would
now be to just use vanilla gradient descent by taking updates
θt = θt−1 − αt ∇θ R̂(θt−1 )
where αt > 0 is a scalar step–size that is decreased over time.
1
This could be a local minimum or a saddle point
1
Gradient Descent
Using the backpropagation/automatic differentiation techniques
from the last lecture yields the loss gradient ∇θ Li for any
datapoint, from which we can construct the empirical risk gradient
n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1
The simplest approach to optimize the network parameters would
now be to just use vanilla gradient descent by taking updates
θt = θt−1 − αt ∇θ R̂(θt−1 )
where αt > 0 is a scalar step–size that is decreased over time.
Given appropriate control of αt , this will converge to a stationary
point1 of R̂(θ) in the limit of a large number of steps
1
This could be a local minimum or a saddle point
1
Stochastic Gradients
However, each such step costs O(n) which can very inefficient for
large datasets: we do not generally need to ensure we move in
exactly the direction of steepest descent
2
Stochastic Gradients
However, each such step costs O(n) which can very inefficient for
large datasets: we do not generally need to ensure we move in
exactly the direction of steepest descent
The gradient will generally
not point directly towards
the optimum anyway
Taking more noisy steps can
converge faster than taking
fewer more accurate steps
Potential for huge gains for
large datasets if we can use
Source: https://wikidocs.net/3413
steps whose cost is ≪ O(n)
2
Stochastic Gradient Descent by Minibatching
We can perform stochastic gradient descent (SGD) by taking
different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where
[ 1 X
∇ θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
|B|
i∈B
3
Stochastic Gradient Descent by Minibatching
We can perform stochastic gradient descent (SGD) by taking
different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where
[ 1 X
∇ θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
|B|
i∈B
The minibatches are usually generated by randomizing the order of
the data and taking the next |B| samples each iteration, looping
over the dataset multiple times with each loop known as an epoch
3
Stochastic Gradient Descent by Minibatching
We can perform stochastic gradient descent (SGD) by taking
different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where
[ 1 X
∇ θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
|B|
i∈B
The minibatches are usually generated by randomizing the order of
the data and taking the next |B| samples each iteration, looping
over the dataset multiple times with each loop known as an epoch
The cost of each update is now only O(|B|), which is something
we can directly control by choosing the batch size |B|
3
Stochastic Gradient Descent by Minibatching
We can perform stochastic gradient descent (SGD) by taking
different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where
[ 1 X
∇ θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
|B|
i∈B
The minibatches are usually generated by randomizing the order of
the data and taking the next |B| samples each iteration, looping
over the dataset multiple times with each loop known as an epoch
The cost of each update is now only O(|B|), which is something
we can directly control by choosing the batch size |B|
Though there is a trade–off (the gradient variance is O(1/|B|))
the optimal batch size is usually quite small (and O(1)) 3
Convergence of SGD
[
For appropriate minibatching schemes, we can view ∇ θ R̂(θt−1 ) as
an unbiased Monte Carlo estimate of ∇θ R̂(θt−1 )
4
Convergence of SGD
[
For appropriate minibatching schemes, we can view ∇ θ R̂(θt−1 ) as
an unbiased Monte Carlo estimate of ∇θ R̂(θt−1 )
By linearity, the expected value of θt given θt−1 for SGD is the
same as the deterministic θt that would have been produced by
using standard gradient descent
4
Convergence of SGD
[
For appropriate minibatching schemes, we can view ∇ θ R̂(θt−1 ) as
an unbiased Monte Carlo estimate of ∇θ R̂(θt−1 )
By linearity, the expected value of θt given θt−1 for SGD is the
same as the deterministic θt that would have been produced by
using standard gradient descent
This can be used to show that SGD also converges to a stationary
point of R̂(θ) (given some weak assumptions) for any batch size if
the step sizes satisfy the Robbins–Monro conditions:
X∞ ∞
X
αt = ∞, αt2 < ∞ (1)
t=1 t=1
4
Convergence of SGD
[
For appropriate minibatching schemes, we can view ∇ θ R̂(θt−1 ) as
an unbiased Monte Carlo estimate of ∇θ R̂(θt−1 )
By linearity, the expected value of θt given θt−1 for SGD is the
same as the deterministic θt that would have been produced by
using standard gradient descent
This can be used to show that SGD also converges to a stationary
point of R̂(θ) (given some weak assumptions) for any batch size if
the step sizes satisfy the Robbins–Monro conditions:
X∞ ∞
X
αt = ∞, αt2 < ∞ (1)
t=1 t=1
A traditional choice would be something like αt = a/(b + t) for
positive constants a and b
4
Momentum
Vanilla SGD can be susceptible to moving extremely slowly
through “valleys” where the optimization problem is ill–conditioned
5
Momentum
Vanilla SGD can be susceptible to moving extremely slowly
through “valleys” where the optimization problem is ill–conditioned
Momentum mitigates this by averaging over previous minibatches:
[
∆t = β∆t−1 + (1 − β)∇ θ R̂(θt−1 ), 0<β<1
θt = θt−1 − αt ∆t
This can also help to reduce noise and allow larger learning rates
Standard SGD SGD with momentum
Credit: Dive into Deep Learning https://d2l.ai/index.html 5
ADAM
Standard SGD uses a single step size for all parameters, but
parameters may have different scales and need different step sizes
6
ADAM
Standard SGD uses a single step size for all parameters, but
parameters may have different scales and need different step sizes
ADAM is a popular optimiser that accounts for this by keeping
weighted moving average estimates for both the gradient (∆t ) and
the squared gradient (Vt ), scaling the step size using the latter
[ ˜t = ∆t
∆t = β1 ∆t−1 + (1 − β1 )∇ θ R̂(θt−1 ) ∆
1 − (β1 )t
2
[ Vt
Vt = β2 Vt−1 + (1 − β2 ) ∇θ R̂(θt−1 ) Ṽt =
1 − (β2 )t
αt ˜t
θt = θt−1 − p ∆
Ṽt + ϵ
6
ADAM
Standard SGD uses a single step size for all parameters, but
parameters may have different scales and need different step sizes
ADAM is a popular optimiser that accounts for this by keeping
weighted moving average estimates for both the gradient (∆t ) and
the squared gradient (Vt ), scaling the step size using the latter
[ ˜t = ∆t
∆t = β1 ∆t−1 + (1 − β1 )∇ θ R̂(θt−1 ) ∆
1 − (β1 )t
2
[ Vt
Vt = β2 Vt−1 + (1 − β2 ) ∇θ R̂(θt−1 ) Ṽt =
1 − (β2 )t
αt ˜t
θt = θt−1 − p ∆
Ṽt + ϵ
Vt is related to the curvature of the loss landscape.
6
ADAM
Standard SGD uses a single step size for all parameters, but
parameters may have different scales and need different step sizes
ADAM is a popular optimiser that accounts for this by keeping
weighted moving average estimates for both the gradient (∆t ) and
the squared gradient (Vt ), scaling the step size using the latter
[ ˜t = ∆t
∆t = β1 ∆t−1 + (1 − β1 )∇ θ R̂(θt−1 ) ∆
1 − (β1 )t
2
[ Vt
Vt = β2 Vt−1 + (1 − β2 ) ∇θ R̂(θt−1 ) Ṽt =
1 − (β2 )t
αt ˜t
θt = θt−1 − p ∆
Ṽt + ϵ
Vt is related to the curvature of the loss landscape.
Both ∆0 and V0 are initialized at 0
6
Initialization
In general, parameters are initialized randomly, typically with
zero-mean Gaussian distributions
7
Initialization
In general, parameters are initialized randomly, typically with
zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently
7
Initialization
In general, parameters are initialized randomly, typically with
zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently
Variances are chosen to ensure that the pre–activations are in
sensible regions that avoid saturation of the activation function
(typically on the order of [−2, 2], e.g. tanh or sigmoid)
7
Initialization
In general, parameters are initialized randomly, typically with
zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently
Variances are chosen to ensure that the pre–activations are in
sensible regions that avoid saturation of the activation function
(typically on the order of [−2, 2], e.g. tanh or sigmoid)
A linear layer will cause the pre–activations to be O(σℓ2 dℓ−1 ) times
the activations of the previous layer if the weights ∼ N (0, σℓ2 )
7
Initialization
In general, parameters are initialized randomly, typically with
zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently
Variances are chosen to ensure that the pre–activations are in
sensible regions that avoid saturation of the activation function
(typically on the order of [−2, 2], e.g. tanh or sigmoid)
A linear layer will cause the pre–activations to be O(σℓ2 dℓ−1 ) times
the activations of the previous layer if the weights ∼ N (0, σℓ2 )
We can avoid blow up or collapse by setting σℓ2 = O(1/dℓ−1 )
7
Initialization
In general, parameters are initialized randomly, typically with
zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently
Variances are chosen to ensure that the pre–activations are in
sensible regions that avoid saturation of the activation function
(typically on the order of [−2, 2], e.g. tanh or sigmoid)
A linear layer will cause the pre–activations to be O(σℓ2 dℓ−1 ) times
the activations of the previous layer if the weights ∼ N (0, σℓ2 )
We can avoid blow up or collapse by setting σℓ2 = O(1/dℓ−1 )
Using σℓ2 = 1/dℓ−1 exactly is known as Xavier initialization
7
The Loss Landscape
R̂(θ) is a very high–dimensional and highly non–convex function
8
The Loss Landscape
R̂(θ) is a very high–dimensional and highly non–convex function
We are only using local optimizers
8
The Loss Landscape
R̂(θ) is a very high–dimensional and highly non–convex function
We are only using local optimizers
However, we rarely actually
reach true local optima any-
way: we tend to get stuck in
regions where it is difficult to
make progress
Though deep models tend to
be more variable over random
seeds than most other ML
approaches, they are still re-
Credit: Visualizing the Loss Landscape of
markably consistent given we Neural Nets, Li et al, NeurIPS 2018
only use local optimization 8
Modern Networks Can Always Overfit
Modern deep neural networks can effectively always overfit any
training data if the network is sufficiently large, even producing
zero training error on random datasets
Credit: Understanding deep learning requires rethinking generalization. Zhang et al.
ICLR, 2017
9
Double Descent
A strange phenomenon in deep learning is that overparameterized
models typically give the best generalization performance
Credit: Belkin et al 2019 (see notes) and OpenAI blog on “Deep Double Descent” 10
Regularization
The most common (but still rarely used) classical regularization
approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22
11
Regularization
The most common (but still rarely used) classical regularization
approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22
This is often known as weight decay as it equates to replacing
θt−1 with θt−1 (1 − λαt ) at each iteration, such that it encourages
a decay towards 0
11
Regularization
The most common (but still rarely used) classical regularization
approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22
This is often known as weight decay as it equates to replacing
θt−1 with θt−1 (1 − λαt ) at each iteration, such that it encourages
a decay towards 0
Instead we typically use algorithmic approaches to regularization:
11
Regularization
The most common (but still rarely used) classical regularization
approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22
This is often known as weight decay as it equates to replacing
θt−1 with θt−1 (1 − λαt ) at each iteration, such that it encourages
a decay towards 0
Instead we typically use algorithmic approaches to regularization:
Dropout: randomly remove hidden units from network at
each training iteration by temporarily fixing each activation to
zero with some probability p (typically p = 0.5), sampling the
units to be “dropped out” independently at each iteration
11
Regularization
The most common (but still rarely used) classical regularization
approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22
This is often known as weight decay as it equates to replacing
θt−1 with θt−1 (1 − λαt ) at each iteration, such that it encourages
a decay towards 0
Instead we typically use algorithmic approaches to regularization:
Dropout: randomly remove hidden units from network at
each training iteration by temporarily fixing each activation to
zero with some probability p (typically p = 0.5), sampling the
units to be “dropped out” independently at each iteration
Early Stopping: calculate the validation loss for the current
network configuration at regular intervals and stop training
when it starts increasing.
11
Computational Considerations
The bottleneck computation involved in training is typically
the large tensor products in the forward pass/backpropagation
12
Computational Considerations
The bottleneck computation involved in training is typically
the large tensor products in the forward pass/backpropagation
Graphics processing units (GPUs) are typically far more
efficient at these than CPUs, sometimes by up to 100+ times
12
Computational Considerations
The bottleneck computation involved in training is typically
the large tensor products in the forward pass/backpropagation
Graphics processing units (GPUs) are typically far more
efficient at these than CPUs, sometimes by up to 100+ times
Use of GPUs has thus become essential to the practical
training of large, and even medium, sized networks
12
Computational Considerations
The bottleneck computation involved in training is typically
the large tensor products in the forward pass/backpropagation
Graphics processing units (GPUs) are typically far more
efficient at these than CPUs, sometimes by up to 100+ times
Use of GPUs has thus become essential to the practical
training of large, and even medium, sized networks
Good GPUs tend to be very expensive and energy hungry:
access to compute has become a bit bottleneck in research as
is a driving force in increased industry presence (Google Colab
is useful way of getting some free access)
12
Computational Considerations
The bottleneck computation involved in training is typically
the large tensor products in the forward pass/backpropagation
Graphics processing units (GPUs) are typically far more
efficient at these than CPUs, sometimes by up to 100+ times
Use of GPUs has thus become essential to the practical
training of large, and even medium, sized networks
Good GPUs tend to be very expensive and energy hungry:
access to compute has become a bit bottleneck in research as
is a driving force in increased industry presence (Google Colab
is useful way of getting some free access)
Modern packages allow for easy transfer of computation onto
GPUs, but this must still often be carefully managed
12
Computational Considerations
The bottleneck computation involved in training is typically
the large tensor products in the forward pass/backpropagation
Graphics processing units (GPUs) are typically far more
efficient at these than CPUs, sometimes by up to 100+ times
Use of GPUs has thus become essential to the practical
training of large, and even medium, sized networks
Good GPUs tend to be very expensive and energy hungry:
access to compute has become a bit bottleneck in research as
is a driving force in increased industry presence (Google Colab
is useful way of getting some free access)
Modern packages allow for easy transfer of computation onto
GPUs, but this must still often be carefully managed
More recently, custom–built hardware has started to be
developed, such as Tensor processing units (TPUs)
12
A Full Example Implementation in PyTorch
https://colab.research.google.com/github/pytorch/
tutorials/blob/gh-pages/_downloads/
17a7c7cb80916fcdf921097825a0f562/cifar10_tutorial.
ipynb
13
The Success of Deep Learning
To summarize, the success of deep learning is primarily based on:
The empirical prowess of large, many–layered, neural networks
for problems with a huge amount of high–dimensional data
14
The Success of Deep Learning
To summarize, the success of deep learning is primarily based on:
The empirical prowess of large, many–layered, neural networks
for problems with a huge amount of high–dimensional data
The flexibility of the general framework to allow highly
customized architectures tailored to specific tasks
14
The Success of Deep Learning
To summarize, the success of deep learning is primarily based on:
The empirical prowess of large, many–layered, neural networks
for problems with a huge amount of high–dimensional data
The flexibility of the general framework to allow highly
customized architectures tailored to specific tasks
Automatic differentiation and deep learning packages
making models and their corresponding training schemes easy
to construct to run
14
The Success of Deep Learning
To summarize, the success of deep learning is primarily based on:
The empirical prowess of large, many–layered, neural networks
for problems with a huge amount of high–dimensional data
The flexibility of the general framework to allow highly
customized architectures tailored to specific tasks
Automatic differentiation and deep learning packages
making models and their corresponding training schemes easy
to construct to run
Effectiveness of stochastic gradient schemes in allowing us
to successfully train huge networks
14
The Success of Deep Learning
To summarize, the success of deep learning is primarily based on:
The empirical prowess of large, many–layered, neural networks
for problems with a huge amount of high–dimensional data
The flexibility of the general framework to allow highly
customized architectures tailored to specific tasks
Automatic differentiation and deep learning packages
making models and their corresponding training schemes easy
to construct to run
Effectiveness of stochastic gradient schemes in allowing us
to successfully train huge networks
Suitability of these computations to running on GPUs allowing
for big speed ups
14
Further Reading
Chapter 11, 12, and 19 of https://d2l.ai/index.html
Chapters 7, 8, and 11 of
https://www.deeplearningbook.org/
Andrew Ng on “Nuts and Bolts of Applying Deep Learning”
https://www.youtube.com/watch?v=F1ka6a13S9I
Tutorials linked to in previous lectures and extensive help
pages for packages like Tensorflow and PyTorch
Truly massive body of recent work in the literature
(non–exhaustive list of prominent publication venues:
NeurIPS, ICML, ICLR, JMLR, AISTATS, AAAI, and UAI)
15