0% found this document useful (0 votes)

3 views32 pages

Lec 4

UC Berkly CS182 Lecture Notes

Uploaded by

Phạm Thạch Thanh Trúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views32 pages

Lec 4

UC Berkly CS182 Lecture Notes

Uploaded by

Phạm Thạch Thanh Trúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Optimization

Designing, Visualizing and Understanding Deep Neural Networks

CS W182/282A
Instructor: Sergey Levine
UC Berkeley
How does gradient descent work?
The loss “landscape”

some small constant

called “learning rate” or
“step size”
Gradient descent

negative slope = go to the right

positive slope = go to the left gradient:

in general:
for each dimension, go in the direction
opposite the slope along that dimension

etc.
Visualizing gradient descent

level set contours

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/

Demo time!

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/

What’s going on?

we don’t always move toward the optimum!

the steepest direction is not always best!

more on this later…
Logistic regression:

The loss surface

Negative likelihood loss for logistic regression is
all roads lead to Rome guaranteed to be convex
(this is not an obvious or trivial statement!)

Convexity: a function is convex if a line segment between

any two points lies entirely “above” the graph

convex functions are “nice” in the sense that

simple algorithms like gradient descent have
strong guarantees

the doesn’t mean that gradient descent works

This is a very nice loss surface Why? well for all convex functions!
Is our loss actually this nice?
The loss surface…
…of a neural network
layer 1 layer 2 pretty hard to visualize, because neural networks
have very large numbers of parameters

but let’s give it a try!

…though some networks are better!

the monster of the plateau

Oh no… the dragon of local optima

The geography of a loss landscape

the local optimum the plateau

the saddle point
Local optima
the most obvious issue with non-convex loss landscapes
one of the big reasons people used to worry about neural networks!
very scary in principle, since gradient descent could converge to a
solution that is arbitrarily worse than the global optimum!

a bit surprisingly, this becomes less of

an issue as the number of parameters
increases!
for big networks, local optima exist,
but tend to be not much worse than
global optima

Choromanska, Henaff, Mathieu, Ben Arous, LeCun.

The Loss Surface of Multilayer Networks.
Plateaus

Can’t just choose tiny learning rates to prevent oscillation!

Need learning rates to be large enough not to get stuck in a plateau

We’ll learn about momentum, which really helps with this

Saddle points

the gradient here is very small

it takes a long time to get out of saddle points

this seems like a very special structure,

does it really happen that often?
Yes! in fact, most critical points in neural
net loss landscapes are saddle points
Saddle points (local) minimum
Critical points:

In higher dimensions:
(local) maximum
Hessian matrix:

only maximum or minimum if all

diagonal entries are positive or
negative!
how often is that the case?
Which way do we go?

we don’t always move toward the optimum!

the steepest direction is not always best!

more on this later…
Improvement directions
A better direction…

?
Newton’s method

Hessian
gradient
Tractable acceleration
Why is Newton’s method not a viable way to improve neural network optimization?

Hessian

if using naïve approach, though fancy methods can be

much faster if they avoid forming the Hessian explicitly

because of this, we would really prefer methods that

don’t require second derivatives, but somehow
“accelerate” gradient descent instead
Momentum averaging together successive gradients
seems to yield a much better direction!

Intuition: if successive gradient steps point in different

directions, we should cancel off the directions that disagree

if successive gradient steps point in similar directions, we

should go faster in that direction
Momentum

“blend in” previous direction

this is a very simple update rule

in practice, it brings some of the benefits of

Newton’s method, at virtually no cost

this kind of momentum method has few guarantees

a closely related idea is “Nesterov accelerated gradient,”

which does carry very appealing guarantees (in practice we
usually just momentum)
Momentum Demo

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/

Gradient scale

Intuition: the sign of the gradient tells us which way to go along each dimension,
but the magnitude is not so great huge when far from optimum

Even worse: overall magnitude of the gradient can change drastically over the
course of optimization, making learning rates hard to tune

Idea: “normalize” out the magnitude of the gradient along each dimension
Algorithm: RMSProp

this is roughly the squared length of each dimension

each dimension is divided by its magnitude

Algorithm: AdaGrad
RMSProp:

How does AdaGrad and RMSProp compare?

AdaGrad has some appealing guarantees for convex problems
Learning rate effectively “decreases” over time, which is good for convex problems
But this only works if we find the optimum quickly before the rate decays too much

RMSProp tends to be much better for deep learning (and most non-convex problems)
Algorithm: Adam
Basic idea: combine momentum and RMSProp

first moment estimate (“momentum-like”)

second moment estimate

so early on these values will be small, and this

why? correction “blows them up” a bit for small k

good default settings:

small number to prevent division by zero

Stochastic optimization
Why is gradient descent expensive?

could simply use fewer

samples, and still have a
requires summing over all
correct (unbiased) estimator
datapoints in the dataset

ILSVRC (ImageNet), 2009: 1.5 million images

Stochastic gradient descent
with minibatches

draw B datapoints at random from dataset of size N

can also use momentum, ADAM, etc.

each iteration samples a different minibatch

Stochastic gradient descent in practice:
sampling randomly is slow due to random memory access
instead, shuffle the dataset (like a deck of cards…) once, in advance
then just construct batches out of consecutive groups of B datapoints
Learning rates

Low learning rates can result in

convergence to worse values!
loss

This is a bit counter-intuitive

high learning rate
low learning rate
good learning rate
epoch

1 epoch
Decaying learning rates
AlexNet trained on ImageNet Learning rate decay schedules usually
needed for best performance with
SGD (+momentum)

Often not needed with ADAM

Opinions differ, some people think

SGD + momentum is better than
ADAM if you want the very best
performance (but ADAM is easier to
tune)
Tuning (stochastic) gradient descent
Hyperparameters:

larger batches = less noisy gradients, usually “safer” but more expensive

best to use the biggest rate that still works, decay over time

0.99 is good keep the defaults (usually)

What to tune hyperparameters on?

Technically we want to tune this on the training loss, since it is a parameter of the optimization
Often tuned on validation loss

Relationship between stochastic gradient and regularization is

complex – some people consider it to be a good regularizer!
(this suggests we should use validation loss)

Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
5 - Curve Fitting by Numerical Methods
No ratings yet
5 - Curve Fitting by Numerical Methods
57 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
9 pages
Gauss Elimination Method - Learn and Solve Questions
No ratings yet
Gauss Elimination Method - Learn and Solve Questions
14 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
2 pages
Optimization Nonlinear
No ratings yet
Optimization Nonlinear
144 pages
Multicollinearity in Regression Models
No ratings yet
Multicollinearity in Regression Models
20 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Mini-Batch Gradient Descent Guide
No ratings yet
Mini-Batch Gradient Descent Guide
31 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
SGD 1
No ratings yet
SGD 1
86 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
14.2.7 - Numerical Approximation Euler's Method
No ratings yet
14.2.7 - Numerical Approximation Euler's Method
23 pages
Asdfvvasdfr
No ratings yet
Asdfvvasdfr
1 page
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Optimizers
No ratings yet
Optimizers
4 pages
Quadratic Programming: Optimization of Chemical Processes
No ratings yet
Quadratic Programming: Optimization of Chemical Processes
12 pages
Gradient Descent for Deep Learning
No ratings yet
Gradient Descent for Deep Learning
21 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
17 pages
CHAPTER 4 Simplex Method
No ratings yet
CHAPTER 4 Simplex Method
26 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Gradient Descent Presentation
No ratings yet
Gradient Descent Presentation
26 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
Lecture 7
No ratings yet
Lecture 7
43 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Optimization
No ratings yet
Optimization
21 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Grade 9 - Math - Quadratic Function
No ratings yet
Grade 9 - Math - Quadratic Function
23 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
(Fall 2024) Deep Learning 2
No ratings yet
(Fall 2024) Deep Learning 2
46 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Cours 5
No ratings yet
Cours 5
23 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Neural Network Training Tips
No ratings yet
Neural Network Training Tips
24 pages
Ece Syllabus
No ratings yet
Ece Syllabus
34 pages
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
No ratings yet
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
9 pages
DL Class2
No ratings yet
DL Class2
30 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
MBA Tech EXTC Syllabus Sem 5 A.Y.2021-22 R89Uc9TFFm
No ratings yet
MBA Tech EXTC Syllabus Sem 5 A.Y.2021-22 R89Uc9TFFm
38 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
No ratings yet
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
31 pages
DL Class1
No ratings yet
DL Class1
18 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
Unboundedness and Infeasibility of LPP
No ratings yet
Unboundedness and Infeasibility of LPP
11 pages
Qrmethod
No ratings yet
Qrmethod
13 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Determinants: Exercises & Solutions
No ratings yet
Determinants: Exercises & Solutions
30 pages
Linear Algebra: ENGE600004 - 4 SKS
No ratings yet
Linear Algebra: ENGE600004 - 4 SKS
17 pages
Shape Function Generatio
No ratings yet
Shape Function Generatio
34 pages
Apply The Fundamental Theorem of Algebra: For Your Notebook
No ratings yet
Apply The Fundamental Theorem of Algebra: For Your Notebook
8 pages
Lib Burst Generated
No ratings yet
Lib Burst Generated
30 pages
Stepping Stone Method
No ratings yet
Stepping Stone Method
4 pages
Numerical Computation of Internal and Ex
No ratings yet
Numerical Computation of Internal and Ex
5 pages
2nd 2 The Rational Root Theorem and Fundamental Theorem of Algebra
No ratings yet
2nd 2 The Rational Root Theorem and Fundamental Theorem of Algebra
17 pages
Muller's Method Root Finding
No ratings yet
Muller's Method Root Finding
3 pages
Nonlinear Optimization: Overview of Methods The Newton Method With Line Search
No ratings yet
Nonlinear Optimization: Overview of Methods The Newton Method With Line Search
9 pages
Machine Contouring Using Minimum Curvature
No ratings yet
Machine Contouring Using Minimum Curvature
10 pages
EEPC102 Module - 1 Lesson 1
No ratings yet
EEPC102 Module - 1 Lesson 1
10 pages
Multidimensional Newton's Method Guide
No ratings yet
Multidimensional Newton's Method Guide
1 page
Brachistochrone ProblemIPOPTinfo
No ratings yet
Brachistochrone ProblemIPOPTinfo
2 pages
Sheets-BAS111 Numerical Analysis-Part
No ratings yet
Sheets-BAS111 Numerical Analysis-Part
6 pages
Chapter 1 Supplementary Exercises
No ratings yet
Chapter 1 Supplementary Exercises
5 pages
SSLC Math Study Guide: Polynomials
No ratings yet
SSLC Math Study Guide: Polynomials
4 pages

Lec 4

Uploaded by

Lec 4

Uploaded by

Optimization

Designing, Visualizing and Understanding Deep Neural Networks

some small constant

negative slope = go to the right

level set contours

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/

we don’t always move toward the optimum!

the steepest direction is not always best!

The loss surface

Convexity: a function is convex if a line segment between

convex functions are “nice” in the sense that

the doesn’t mean that gradient descent works

but let’s give it a try!

the monster of the plateau

Oh no… the dragon of local optima

the local optimum the plateau

a bit surprisingly, this becomes less of

Choromanska, Henaff, Mathieu, Ben Arous, LeCun.

Can’t just choose tiny learning rates to prevent oscillation!

Need learning rates to be large enough not to get stuck in a plateau

We’ll learn about momentum, which really helps with this

the gradient here is very small

this seems like a very special structure,

only maximum or minimum if all

we don’t always move toward the optimum!

the steepest direction is not always best!

if using naïve approach, though fancy methods can be

because of this, we would really prefer methods that

Intuition: if successive gradient steps point in different

if successive gradient steps point in similar directions, we

“blend in” previous direction

this is a very simple update rule

in practice, it brings some of the benefits of

this kind of momentum method has few guarantees

a closely related idea is “Nesterov accelerated gradient,”

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/

this is roughly the squared length of each dimension

each dimension is divided by its magnitude

How does AdaGrad and RMSProp compare?

first moment estimate (“momentum-like”)

second moment estimate

so early on these values will be small, and this

good default settings:

small number to prevent division by zero

could simply use fewer

ILSVRC (ImageNet), 2009: 1.5 million images

draw B datapoints at random from dataset of size N

can also use momentum, ADAM, etc.

each iteration samples a different minibatch

Low learning rates can result in

This is a bit counter-intuitive

Often not needed with ADAM

Opinions differ, some people think

0.99 is good keep the defaults (usually)

What to tune hyperparameters on?

Relationship between stochastic gradient and regularization is

You might also like