10-425/625: Introduction to Convex Optimization (Fall 2023)
Lecture 7: Gradient Descent
1
Instructor: Matt Gormley September 18, 2023
7.1 Gradient Descent
For the next couple of lectures we’ll focus on a basic unconstrained optimiza-
tion problem:
min f (x).
x∈Rd
For most of today we’ll also assume that f is differentiable everywhere. A
classical method to solve such optimization problems is gradient descent, i.e.
we initialize at some guess x0 and execute iterations of the form,
xt+1 = xt − η∇f (xt ),
for some choice of the step-size η > 0, and until we reach some stopping
condition.
Algorithm 1 Gradient Descent
1: Choose initial point x0 ∈ Rn
2: for t = 1, 2, . . . , T do
3: Compute the gradient g = ∇f (xt ) ∈ Rn
4: Update the point: xt+1 = xt − ηt g
5: Stop when ∥∇f (xt )∥22 < ϵ for some small ϵ > 0
7.1.1 Motivation #0: Moving to the Nearest Valley
Gradient descent is a local optimization algorithm, which means that it con-
verges to a nearby local minimum. Since the gradient ∇f (x) is point in the
1
These notes were originally written by Siva Balakrishnan for 10-725 Spring 2023 (orig-
inal version: here) and were edited and adapted for 10-425/625.
7-1
Lecture 7: Gradient Descent 7-2
direction of steepest ascent, we take steps in the opposite direction −∇f (x)
and gradually move towards such a local minimum.
For a strictly convex function where the minimizer exists and is unique,
gradient descent will be moving towards the same local minimum (a global
minimum) regardless of where it begins. Figure 7.1 shows this case.
●
●
● ●
Figure 7.1: Gradient descent on a convex function with random initializations
For a nonconvex function, our choice of the initial point and step size will
determine which local minimum (or saddle point) we arrive at — and there
may be many, as in the example in Figure 7.2.
Lecture 7: Gradient Descent 7-3
● ●
●●
Figure 7.2: Gradient descent on a nonconvex function with random initial-
izations
7.1.2 Motivation #1: Descent Directions
There are many ways to motivate this algorithm. One is to notice that if we
were at a point x and moved in a direction v with step-size η > 0
f (x + ηv) ≥ f (x) + ηv T ∇f (x).
So at the very least we’d like to ensure that the second term is negative,
i.e. v T ∇f (x) ≤ 0 (otherwise we’re moving to a strictly worse point). Such
directions (which make a larger than 90-degree angle with the gradient) are
typically called descent directions (for f at x).
So how do we choose v s.t. v T ∇f (x) ≤ 0?
It should be clear that the negative gradient v = −∇f (x) always gives us a
descent direction (and in some sense gives us one for which the term v T ∇f (x)
is most negative – amongst vectors v with a given norm). Why? Recall that
for any vector w, ∥w∥22 = wT w ≥ 0. So −∥∇f (x)∥22 ≤ 0.
At first glance it might seem remarkable that gradient descent works at all.
Note that the direction of steepest descent is not necessarily pointing towards
the minimum of the function. And yet if we continue taking little steps in
this direction, we will show that we eventually converge to a local minimum.
Lecture 7: Gradient Descent 7-4
7.1.3 Motivation #2: Gradient Descent as Minimizing
the Local Linear Approximation
A more interesting way to motivate GD (which will also be subsequently use-
ful to motivate mirror descent, the proximal method and Newton’s method)
is to consider minimizing a linear approximation to our function (locally).
Constrained Version Suppose we approximate our true function f by the
linear objective below. Then we optimize the linear objective subject to the
constraint that our solution y is not too far away from our current xt .
xt+1 = arg min f (xt ) + ∇f (xt )T (y − xt )
y∈Rd
1
s.t. ∥y − xt ∥22 ≤ ϵ,
2
Unconstrained Version We could instead minimize an unconstrained
version of the above problem, where we use a soft constraint. The result
is a local quadratic approximation of the function. A picture will be help-
ful. With a picture in mind, we can view GD as solving the following local
minimization problem:
1
xt+1 = arg min f (xt ) + ∇f (xt )T (y − xt ) + ∥y − xt ∥22 ,
y∈Rd 2η
where the second term behaves as a regularizer to ensure that (for small η)
our update remains close to our current iterate xt . This local optimization
problem has a closed form solution (take the derivative and set it to 0), and
this precisely gives us our familiar GD update:
xt+1 = xt − η∇f (xt ).
An example of this local minimization problem is shown in Figure 7.3, where
the blue dot is our current iterate xt and the red dot is xt+1 .
Lecture 7: Gradient Descent 7-5
Figure 7.3: Local quadratic approximation of a function
7.2 Choosing the Step-Size
In practice, the most important choice to be made is that of the step-size.
We’ll see various theoretical rules/schedules that one might follow based on
what we know about the objective function. Here are some natural possibil-
ities:
1. Fixed Step-Size: Here we simply select a fixed step-size η and run
the algorithm with that fixed step-size. An immediate problem that
you will encounter (in practice) even for very benign problems is that
if you select the step-size too large then GD can diverge, and if you
select it too small it might take a very long time to converge.
You will find pictures of this in the BV textbook, but here is a typical
analytical example to keep in mind.
(a) Suppose we have f (x) = x2 /2, initialize at x0 = 1 we take our
step size to be 3 (too large). Then the iterates will be xt =
−2, 4, −8, . . . (i.e. GD will diverge).
(b) For the same function, initialization, if we take our step size to be
0.00001 then GD would take 105 steps to converge.
(c) On the other hand if we picked the “correct” step-size of 1, we
would converge in 1 step.
Lecture 7: Gradient Descent 7-6
Figure 7.4 depicts three runs of gradient descent, each with a different
fixed step size η. When the step size is too small, even after 100 steps
it has not reached the minimum. When the step size is too large, after
8 steps it has already begun to diverge. When the step size is “just
right”, it converges after 40 steps.
20
20
20
●●
●
● ● ●
●●
●
●
● ●
●●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
10
10
10
●
●
●
●
● ●
●
●
● ● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
* * *
0
0
−10
−10
−10
−20
−20
−20
−20 −10 0 10 20 −20 −10 0 10 20 −20 −10 0 10 20
Figure 7.4: Examples of different steps sizes on the function f (x) = (10x21 +
x22 )/2. One is too small (left), one too large (middle), and one “just right”
(right).
In theory, we’d like to understand this issue better (i.e. what properties
of a function make certain step-sizes “too big”, “too small”, or “cor-
rect”). The correct step-size in many cases may depend on properties
of the function that we don’t know. In practice, it will often be useful
to have at our disposal a few different ways to tune the step-size (and
some understanding of how we might diagnose issues with the step-size
choice).
2. Exact Line-Search: Once we’ve committed to a direction (in GD
this is the direction of the negative gradient), one might consider solving
the following 1D optimization problem to determine the best step-size:
η t = arg min f (xt − ηe∇f (xt )).
ηe≥0
It’s often computationally cumbersome to solve this optimization prob-
lem exactly, so we resort to some approximation of this idea.
3. Backtracking Line-Search: The idea of backtracking line-search
very roughly, is to try an aggressive (large) step-size, and reduce it by
some factor if it’s too big.
Lecture 7: Gradient Descent 7-7
Here is the algorithm: we pick two parameters α ∈ (0, 0.5) and β ∈
(0, 1). At iteration t: initialize η = 1,
(a) If f (xt − η∇f (xt )) > f (xt ) − αη∥∇f (xt )∥22 , then reduce η := β × η
and go back to step (a).
(b) Otherwise, take a step, i.e. set xt+1 = xt − η∇f (xt ).
Often in practice, taking α = 0.3 and β = 0.5 work reasonably well.
This method of backtracking line search uses the Armijo-Goldstein in-
equality to ensure that we acheive a sufficient decrease.
In Figure 7.5, gradient descent with backtracking line search is applied
to the same function we examined before and it roughly seems to get
the right step sizes.
20
●
10
●●
●
●●
●
●
●
●
●
*
0
−10
−20
−20 −10 0 10 20
Figure 7.5: Example of gradient descent with backtracking line search. In
this example, it accepts 12 steps and computes 40 steps total.
You will develop some better intuition when we study the main de-
scent lemma for GD, but roughly if your function is nice (the Hessian
term in a Taylor series is ignorable), you should expect to make about
η∥∇f (xt )∥22 amount of progress in one step of GD if η is small enough.
The backtracking line search simply says if you’re making upto an α
factor of this amount of progress you should be content and take a step.
Lecture 7: Gradient Descent 7-8
Example 7.1 (Backtracking on a Linear Function). As an example,
consider the case where f is a linear function. Then it’s clear that
taking a gradient step η∇f (x) should improve the function by exactly
f (xt+1 ) − f (xt ) = η∥∇f (xt )∥22 . If our function is locally linear than
maybe making an improvement of αη∥∇f (xt )∥22 is good enough.
Segue... Next lecture we will look at evaluate the performance of gradi-
ent descent on two example functions and consider our first convergence
results.