Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views32 pages

Lec 4

UC Berkly CS182 Lecture Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views32 pages

Lec 4

UC Berkly CS182 Lecture Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Optimization

Designing, Visualizing and Understanding Deep Neural Networks

CS W182/282A
Instructor: Sergey Levine
UC Berkeley
How does gradient descent work?
The loss “landscape”

some small constant


called “learning rate” or
“step size”
Gradient descent

negative slope = go to the right


positive slope = go to the left gradient:

in general:
for each dimension, go in the direction
opposite the slope along that dimension

etc.
Visualizing gradient descent

level set contours

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/


Demo time!

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/


What’s going on?

we don’t always move toward the optimum!

the steepest direction is not always best!


more on this later…
Logistic regression:

The loss surface


Negative likelihood loss for logistic regression is
all roads lead to Rome guaranteed to be convex
(this is not an obvious or trivial statement!)

Convexity: a function is convex if a line segment between


any two points lies entirely “above” the graph

convex functions are “nice” in the sense that


simple algorithms like gradient descent have
strong guarantees

the doesn’t mean that gradient descent works


This is a very nice loss surface Why? well for all convex functions!
Is our loss actually this nice?
The loss surface…
…of a neural network
layer 1 layer 2 pretty hard to visualize, because neural networks
have very large numbers of parameters

but let’s give it a try!


…though some networks are better!

the monster of the plateau

Oh no… the dragon of local optima


The geography of a loss landscape

the local optimum the plateau


the saddle point
Local optima
the most obvious issue with non-convex loss landscapes
one of the big reasons people used to worry about neural networks!
very scary in principle, since gradient descent could converge to a
solution that is arbitrarily worse than the global optimum!

a bit surprisingly, this becomes less of


an issue as the number of parameters
increases!
for big networks, local optima exist,
but tend to be not much worse than
global optima

Choromanska, Henaff, Mathieu, Ben Arous, LeCun.


The Loss Surface of Multilayer Networks.
Plateaus

Can’t just choose tiny learning rates to prevent oscillation!

Need learning rates to be large enough not to get stuck in a plateau

We’ll learn about momentum, which really helps with this


Saddle points

the gradient here is very small


it takes a long time to get out of saddle points

this seems like a very special structure,


does it really happen that often?
Yes! in fact, most critical points in neural
net loss landscapes are saddle points
Saddle points (local) minimum
Critical points:

In higher dimensions:
(local) maximum
Hessian matrix:

only maximum or minimum if all


diagonal entries are positive or
negative!
how often is that the case?
Which way do we go?

we don’t always move toward the optimum!

the steepest direction is not always best!


more on this later…
Improvement directions
A better direction…

?
Newton’s method

Hessian
gradient
Tractable acceleration
Why is Newton’s method not a viable way to improve neural network optimization?

Hessian

if using naïve approach, though fancy methods can be


much faster if they avoid forming the Hessian explicitly

because of this, we would really prefer methods that


don’t require second derivatives, but somehow
“accelerate” gradient descent instead
Momentum averaging together successive gradients
seems to yield a much better direction!

Intuition: if successive gradient steps point in different


directions, we should cancel off the directions that disagree

if successive gradient steps point in similar directions, we


should go faster in that direction
Momentum

“blend in” previous direction

this is a very simple update rule

in practice, it brings some of the benefits of


Newton’s method, at virtually no cost

this kind of momentum method has few guarantees

a closely related idea is “Nesterov accelerated gradient,”


which does carry very appealing guarantees (in practice we
usually just momentum)
Momentum Demo

visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/


Gradient scale

Intuition: the sign of the gradient tells us which way to go along each dimension,
but the magnitude is not so great huge when far from optimum

Even worse: overall magnitude of the gradient can change drastically over the
course of optimization, making learning rates hard to tune

Idea: “normalize” out the magnitude of the gradient along each dimension
Algorithm: RMSProp

this is roughly the squared length of each dimension

each dimension is divided by its magnitude


Algorithm: AdaGrad
RMSProp:

How does AdaGrad and RMSProp compare?


AdaGrad has some appealing guarantees for convex problems
Learning rate effectively “decreases” over time, which is good for convex problems
But this only works if we find the optimum quickly before the rate decays too much

RMSProp tends to be much better for deep learning (and most non-convex problems)
Algorithm: Adam
Basic idea: combine momentum and RMSProp

first moment estimate (“momentum-like”)

second moment estimate

so early on these values will be small, and this


why? correction “blows them up” a bit for small k

good default settings:

small number to prevent division by zero


Stochastic optimization
Why is gradient descent expensive?

could simply use fewer


samples, and still have a
requires summing over all
correct (unbiased) estimator
datapoints in the dataset

ILSVRC (ImageNet), 2009: 1.5 million images


Stochastic gradient descent
with minibatches

draw B datapoints at random from dataset of size N

can also use momentum, ADAM, etc.

each iteration samples a different minibatch


Stochastic gradient descent in practice:
sampling randomly is slow due to random memory access
instead, shuffle the dataset (like a deck of cards…) once, in advance
then just construct batches out of consecutive groups of B datapoints
Learning rates

Low learning rates can result in


convergence to worse values!
loss

This is a bit counter-intuitive


high learning rate
low learning rate
good learning rate
epoch

1 epoch
Decaying learning rates
AlexNet trained on ImageNet Learning rate decay schedules usually
needed for best performance with
SGD (+momentum)

Often not needed with ADAM

Opinions differ, some people think


SGD + momentum is better than
ADAM if you want the very best
performance (but ADAM is easier to
tune)
Tuning (stochastic) gradient descent
Hyperparameters:

larger batches = less noisy gradients, usually “safer” but more expensive

best to use the biggest rate that still works, decay over time

0.99 is good keep the defaults (usually)

What to tune hyperparameters on?

Technically we want to tune this on the training loss, since it is a parameter of the optimization
Often tuned on validation loss

Relationship between stochastic gradient and regularization is


complex – some people consider it to be a good regularizer!
(this suggests we should use validation loss)

You might also like