0% found this document useful (0 votes)

14 views49 pages

Lecture 21

Uploaded by

Vic Yassenov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views49 pages

Lecture 21

Uploaded by

Vic Yassenov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

The Oxford logo

At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all

other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal

rectangle logo, is only to be used where height (vertical
space) is restricted.

Chapter 8, Part 5: Training and Practicalities

These standard versions of the Oxford logo are intended
for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.

Examples of how these logos should be used for various

Rectangle Logo
applications appear in the following pages.

for Deep Learning Models NOTE

The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with The rectangular
secondary Oxford
bolder elements are available for use down to 15mm
logo is for use only
wide. See page 7.
where height is
restricted.

Advanced Topics in Statistical Machine Learning

Tom Rainforth
Hilary 2022
[email protected]
Gradient Descent

Using the backpropagation/automatic differentiation techniques

from the last lecture yields the loss gradient ∇θ Li for any
datapoint, from which we can construct the empirical risk gradient
n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1

1
This could be a local minimum or a saddle point
1
Gradient Descent

Using the backpropagation/automatic differentiation techniques

1
This could be a local minimum or a saddle point
1
Gradient Descent

Using the backpropagation/automatic differentiation techniques

from the last lecture yields the loss gradient ∇θ Li for any
datapoint, from which we can construct the empirical risk gradient
n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1
The simplest approach to optimize the network parameters would
now be to just use vanilla gradient descent by taking updates
θt = θt−1 − αt ∇θ R̂(θt−1 )
where αt > 0 is a scalar step–size that is decreased over time.
Given appropriate control of αt , this will converge to a stationary
point1 of R̂(θ) in the limit of a large number of steps
1
This could be a local minimum or a saddle point
1
Stochastic Gradients

However, each such step costs O(n) which can very inefficient for
large datasets: we do not generally need to ensure we move in
exactly the direction of steepest descent

2
Stochastic Gradients

However, each such step costs O(n) which can very inefficient for
large datasets: we do not generally need to ensure we move in
exactly the direction of steepest descent

The gradient will generally

not point directly towards
the optimum anyway
Taking more noisy steps can
converge faster than taking
fewer more accurate steps
Potential for huge gains for
large datasets if we can use
Source: https://wikidocs.net/3413
steps whose cost is ≪ O(n)
2
Stochastic Gradient Descent by Minibatching

We can perform stochastic gradient descent (SGD) by taking

different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where

[ 1 X
∇ θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
|B|
i∈B

3
Stochastic Gradient Descent by Minibatching

We can perform stochastic gradient descent (SGD) by taking

different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where

3
Stochastic Gradient Descent by Minibatching

We can perform stochastic gradient descent (SGD) by taking

different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where

[ 1 X
∇ θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
|B|
i∈B
The minibatches are usually generated by randomizing the order of
the data and taking the next |B| samples each iteration, looping
over the dataset multiple times with each loop known as an epoch
The cost of each update is now only O(|B|), which is something
we can directly control by choosing the batch size |B|

3
Stochastic Gradient Descent by Minibatching

We can perform stochastic gradient descent (SGD) by taking

different randomized minibatches of the data B ⊂ {1, . . . , n} at
each iteration and updating using only these datapoints:
[
θt = θt−1 − αt ∇ θ R̂(θt−1 ) where

[
For appropriate minibatching schemes, we can view ∇ θ R̂(θt−1 ) as
an unbiased Monte Carlo estimate of ∇θ R̂(θt−1 )

4
Convergence of SGD

[
For appropriate minibatching schemes, we can view ∇ θ R̂(θt−1 ) as
an unbiased Monte Carlo estimate of ∇θ R̂(θt−1 )
By linearity, the expected value of θt given θt−1 for SGD is the
same as the deterministic θt that would have been produced by
using standard gradient descent
This can be used to show that SGD also converges to a stationary
point of R̂(θ) (given some weak assumptions) for any batch size if
the step sizes satisfy the Robbins–Monro conditions:
X∞ ∞
X
αt = ∞, αt2 < ∞ (1)
t=1 t=1

4
Convergence of SGD

Vanilla SGD can be susceptible to moving extremely slowly

through “valleys” where the optimization problem is ill–conditioned

5
Momentum

Vanilla SGD can be susceptible to moving extremely slowly

through “valleys” where the optimization problem is ill–conditioned
Momentum mitigates this by averaging over previous minibatches:
[
∆t = β∆t−1 + (1 − β)∇ θ R̂(θt−1 ), 0<β<1
θt = θt−1 − αt ∆t
This can also help to reduce noise and allow larger learning rates

Standard SGD SGD with momentum

Credit: Dive into Deep Learning https://d2l.ai/index.html 5
ADAM

Standard SGD uses a single step size for all parameters, but
parameters may have different scales and need different step sizes

6
ADAM

Standard SGD uses a single step size for all parameters, but
parameters may have different scales and need different step sizes
ADAM is a popular optimiser that accounts for this by keeping
weighted moving average estimates for both the gradient (∆t ) and
the squared gradient (Vt ), scaling the step size using the latter
[ ˜t = ∆t
∆t = β1 ∆t−1 + (1 − β1 )∇ θ R̂(θt−1 ) ∆
1 − (β1 )t
2
[ Vt
Vt = β2 Vt−1 + (1 − β2 ) ∇θ R̂(θt−1 ) Ṽt =
1 − (β2 )t
αt ˜t
θt = θt−1 − p ∆
Ṽt + ϵ

6
ADAM

In general, parameters are initialized randomly, typically with

zero-mean Gaussian distributions

7
Initialization

In general, parameters are initialized randomly, typically with

zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently

7
Initialization

In general, parameters are initialized randomly, typically with

7
Initialization

In general, parameters are initialized randomly, typically with

zero-mean Gaussian distributions
Randomization is important to break symmetry so that each unit
in a layer will learn differently
Variances are chosen to ensure that the pre–activations are in
sensible regions that avoid saturation of the activation function
(typically on the order of [−2, 2], e.g. tanh or sigmoid)
A linear layer will cause the pre–activations to be O(σℓ2 dℓ−1 ) times
the activations of the previous layer if the weights ∼ N (0, σℓ2 )

7
Initialization

In general, parameters are initialized randomly, typically with

7
Initialization

In general, parameters are initialized randomly, typically with

7
The Loss Landscape

R̂(θ) is a very high–dimensional and highly non–convex function

8
The Loss Landscape

R̂(θ) is a very high–dimensional and highly non–convex function

We are only using local optimizers

8
The Loss Landscape

R̂(θ) is a very high–dimensional and highly non–convex function

We are only using local optimizers
However, we rarely actually
reach true local optima any-
way: we tend to get stuck in
regions where it is difficult to
make progress
Though deep models tend to
be more variable over random
seeds than most other ML
approaches, they are still re-
Credit: Visualizing the Loss Landscape of
markably consistent given we Neural Nets, Li et al, NeurIPS 2018
only use local optimization 8
Modern Networks Can Always Overfit

Modern deep neural networks can effectively always overfit any

training data if the network is sufficiently large, even producing
zero training error on random datasets

Credit: Understanding deep learning requires rethinking generalization. Zhang et al.

ICLR, 2017
9
Double Descent

A strange phenomenon in deep learning is that overparameterized

models typically give the best generalization performance

Credit: Belkin et al 2019 (see notes) and OpenAI blog on “Deep Double Descent” 10
Regularization

The most common (but still rarely used) classical regularization

approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22

11
Regularization

The most common (but still rarely used) classical regularization

11
Regularization

The most common (but still rarely used) classical regularization

11
Regularization

The most common (but still rarely used) classical regularization

approach used in deep learning is L2 regularization, r(θ) = 12 ∥θ∥22
This is often known as weight decay as it equates to replacing
θt−1 with θt−1 (1 − λαt ) at each iteration, such that it encourages
a decay towards 0
Instead we typically use algorithmic approaches to regularization:
Dropout: randomly remove hidden units from network at
each training iteration by temporarily fixing each activation to
zero with some probability p (typically p = 0.5), sampling the
units to be “dropped out” independently at each iteration

11
Regularization