Optimization
Designing, Visualizing and Understanding Deep Neural Networks
CS W182/282A
Instructor: Sergey Levine
UC Berkeley
How does gradient descent work?
The loss “landscape”
some small constant
called “learning rate” or
“step size”
Gradient descent
negative slope = go to the right
positive slope = go to the left gradient:
in general:
for each dimension, go in the direction
opposite the slope along that dimension
etc.
Visualizing gradient descent
level set contours
visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/
Demo time!
visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/
What’s going on?
we don’t always move toward the optimum!
the steepest direction is not always best!
more on this later…
Logistic regression:
The loss surface
Negative likelihood loss for logistic regression is
all roads lead to Rome guaranteed to be convex
(this is not an obvious or trivial statement!)
Convexity: a function is convex if a line segment between
any two points lies entirely “above” the graph
convex functions are “nice” in the sense that
simple algorithms like gradient descent have
strong guarantees
the doesn’t mean that gradient descent works
This is a very nice loss surface Why? well for all convex functions!
Is our loss actually this nice?
The loss surface…
…of a neural network
layer 1 layer 2 pretty hard to visualize, because neural networks
have very large numbers of parameters
but let’s give it a try!
…though some networks are better!
the monster of the plateau
Oh no… the dragon of local optima
The geography of a loss landscape
the local optimum the plateau
the saddle point
Local optima
the most obvious issue with non-convex loss landscapes
one of the big reasons people used to worry about neural networks!
very scary in principle, since gradient descent could converge to a
solution that is arbitrarily worse than the global optimum!
a bit surprisingly, this becomes less of
an issue as the number of parameters
increases!
for big networks, local optima exist,
but tend to be not much worse than
global optima
Choromanska, Henaff, Mathieu, Ben Arous, LeCun.
The Loss Surface of Multilayer Networks.
Plateaus
Can’t just choose tiny learning rates to prevent oscillation!
Need learning rates to be large enough not to get stuck in a plateau
We’ll learn about momentum, which really helps with this
Saddle points
the gradient here is very small
it takes a long time to get out of saddle points
this seems like a very special structure,
does it really happen that often?
Yes! in fact, most critical points in neural
net loss landscapes are saddle points
Saddle points (local) minimum
Critical points:
In higher dimensions:
(local) maximum
Hessian matrix:
only maximum or minimum if all
diagonal entries are positive or
negative!
how often is that the case?
Which way do we go?
we don’t always move toward the optimum!
the steepest direction is not always best!
more on this later…
Improvement directions
A better direction…
?
Newton’s method
Hessian
gradient
Tractable acceleration
Why is Newton’s method not a viable way to improve neural network optimization?
Hessian
if using naïve approach, though fancy methods can be
much faster if they avoid forming the Hessian explicitly
because of this, we would really prefer methods that
don’t require second derivatives, but somehow
“accelerate” gradient descent instead
Momentum averaging together successive gradients
seems to yield a much better direction!
Intuition: if successive gradient steps point in different
directions, we should cancel off the directions that disagree
if successive gradient steps point in similar directions, we
should go faster in that direction
Momentum
“blend in” previous direction
this is a very simple update rule
in practice, it brings some of the benefits of
Newton’s method, at virtually no cost
this kind of momentum method has few guarantees
a closely related idea is “Nesterov accelerated gradient,”
which does carry very appealing guarantees (in practice we
usually just momentum)
Momentum Demo
visualizations based on Gabriel Goh’s distill.pub article: https://distill.pub/2017/momentum/
Gradient scale
Intuition: the sign of the gradient tells us which way to go along each dimension,
but the magnitude is not so great huge when far from optimum
Even worse: overall magnitude of the gradient can change drastically over the
course of optimization, making learning rates hard to tune
Idea: “normalize” out the magnitude of the gradient along each dimension
Algorithm: RMSProp
this is roughly the squared length of each dimension
each dimension is divided by its magnitude
Algorithm: AdaGrad
RMSProp:
How does AdaGrad and RMSProp compare?
AdaGrad has some appealing guarantees for convex problems
Learning rate effectively “decreases” over time, which is good for convex problems
But this only works if we find the optimum quickly before the rate decays too much
RMSProp tends to be much better for deep learning (and most non-convex problems)
Algorithm: Adam
Basic idea: combine momentum and RMSProp
first moment estimate (“momentum-like”)
second moment estimate
so early on these values will be small, and this
why? correction “blows them up” a bit for small k
good default settings:
small number to prevent division by zero
Stochastic optimization
Why is gradient descent expensive?
could simply use fewer
samples, and still have a
requires summing over all
correct (unbiased) estimator
datapoints in the dataset
ILSVRC (ImageNet), 2009: 1.5 million images
Stochastic gradient descent
with minibatches
draw B datapoints at random from dataset of size N
can also use momentum, ADAM, etc.
each iteration samples a different minibatch
Stochastic gradient descent in practice:
sampling randomly is slow due to random memory access
instead, shuffle the dataset (like a deck of cards…) once, in advance
then just construct batches out of consecutive groups of B datapoints
Learning rates
Low learning rates can result in
convergence to worse values!
loss
This is a bit counter-intuitive
high learning rate
low learning rate
good learning rate
epoch
1 epoch
Decaying learning rates
AlexNet trained on ImageNet Learning rate decay schedules usually
needed for best performance with
SGD (+momentum)
Often not needed with ADAM
Opinions differ, some people think
SGD + momentum is better than
ADAM if you want the very best
performance (but ADAM is easier to
tune)
Tuning (stochastic) gradient descent
Hyperparameters:
larger batches = less noisy gradients, usually “safer” but more expensive
best to use the biggest rate that still works, decay over time
0.99 is good keep the defaults (usually)
What to tune hyperparameters on?
Technically we want to tune this on the training loss, since it is a parameter of the optimization
Often tuned on validation loss
Relationship between stochastic gradient and regularization is
complex – some people consider it to be a good regularizer!
(this suggests we should use validation loss)