0% found this document useful (0 votes)

20 views8 pages

Lecture7 Graddesc

This document discusses the concept of gradient descent as a method for solving unconstrained optimization problems, specifically focusing on its iterative process and the importance of step-size selection. It explains the motivations behind the algorithm, including moving towards local minima and minimizing local linear approximations of functions. Additionally, it outlines various strategies for choosing the step-size, such as fixed step-size, exact line-search, and backtracking line-search.

Uploaded by

Hui Kong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views8 pages

Lecture7 Graddesc

Uploaded by

Hui Kong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

10-425/625: Introduction to Convex Optimization (Fall 2023)

Lecture 7: Gradient Descent

1
Instructor: Matt Gormley September 18, 2023

7.1 Gradient Descent

For the next couple of lectures we’ll focus on a basic unconstrained optimiza-
tion problem:

min f (x).
x∈Rd

For most of today we’ll also assume that f is differentiable everywhere. A

classical method to solve such optimization problems is gradient descent, i.e.
we initialize at some guess x0 and execute iterations of the form,

xt+1 = xt − η∇f (xt ),

for some choice of the step-size η > 0, and until we reach some stopping
condition.

Algorithm 1 Gradient Descent

1: Choose initial point x0 ∈ Rn
2: for t = 1, 2, . . . , T do
3: Compute the gradient g = ∇f (xt ) ∈ Rn
4: Update the point: xt+1 = xt − ηt g
5: Stop when ∥∇f (xt )∥22 < ϵ for some small ϵ > 0

7.1.1 Motivation #0: Moving to the Nearest Valley

Gradient descent is a local optimization algorithm, which means that it con-
verges to a nearby local minimum. Since the gradient ∇f (x) is point in the
1
These notes were originally written by Siva Balakrishnan for 10-725 Spring 2023 (orig-
inal version: here) and were edited and adapted for 10-425/625.

7-1
Lecture 7: Gradient Descent 7-2

direction of steepest ascent, we take steps in the opposite direction −∇f (x)
and gradually move towards such a local minimum.
For a strictly convex function where the minimizer exists and is unique,
gradient descent will be moving towards the same local minimum (a global
minimum) regardless of where it begins. Figure 7.1 shows this case.

●
●

● ●

Figure 7.1: Gradient descent on a convex function with random initializations

For a nonconvex function, our choice of the initial point and step size will
determine which local minimum (or saddle point) we arrive at — and there
may be many, as in the example in Figure 7.2.
Lecture 7: Gradient Descent 7-3

● ●

●●

Figure 7.2: Gradient descent on a nonconvex function with random initial-

izations

7.1.2 Motivation #1: Descent Directions

There are many ways to motivate this algorithm. One is to notice that if we
were at a point x and moved in a direction v with step-size η > 0
f (x + ηv) ≥ f (x) + ηv T ∇f (x).
So at the very least we’d like to ensure that the second term is negative,
i.e. v T ∇f (x) ≤ 0 (otherwise we’re moving to a strictly worse point). Such
directions (which make a larger than 90-degree angle with the gradient) are
typically called descent directions (for f at x).
So how do we choose v s.t. v T ∇f (x) ≤ 0?
It should be clear that the negative gradient v = −∇f (x) always gives us a
descent direction (and in some sense gives us one for which the term v T ∇f (x)
is most negative – amongst vectors v with a given norm). Why? Recall that
for any vector w, ∥w∥22 = wT w ≥ 0. So −∥∇f (x)∥22 ≤ 0.
At first glance it might seem remarkable that gradient descent works at all.
Note that the direction of steepest descent is not necessarily pointing towards
the minimum of the function. And yet if we continue taking little steps in
this direction, we will show that we eventually converge to a local minimum.
Lecture 7: Gradient Descent 7-4

7.1.3 Motivation #2: Gradient Descent as Minimizing

the Local Linear Approximation
A more interesting way to motivate GD (which will also be subsequently use-
ful to motivate mirror descent, the proximal method and Newton’s method)
is to consider minimizing a linear approximation to our function (locally).

Constrained Version Suppose we approximate our true function f by the

linear objective below. Then we optimize the linear objective subject to the
constraint that our solution y is not too far away from our current xt .

xt+1 = arg min f (xt ) + ∇f (xt )T (y − xt )

y∈Rd
1
s.t. ∥y − xt ∥22 ≤ ϵ,
2

Unconstrained Version We could instead minimize an unconstrained

version of the above problem, where we use a soft constraint. The result
is a local quadratic approximation of the function. A picture will be help-
ful. With a picture in mind, we can view GD as solving the following local
minimization problem:
1
xt+1 = arg min f (xt ) + ∇f (xt )T (y − xt ) + ∥y − xt ∥22 ,
y∈Rd 2η

where the second term behaves as a regularizer to ensure that (for small η)
our update remains close to our current iterate xt . This local optimization
problem has a closed form solution (take the derivative and set it to 0), and
this precisely gives us our familiar GD update:

xt+1 = xt − η∇f (xt ).

An example of this local minimization problem is shown in Figure 7.3, where

the blue dot is our current iterate xt and the red dot is xt+1 .
Lecture 7: Gradient Descent 7-5

Figure 7.3: Local quadratic approximation of a function

7.2 Choosing the Step-Size

In practice, the most important choice to be made is that of the step-size.
We’ll see various theoretical rules/schedules that one might follow based on
what we know about the objective function. Here are some natural possibil-
ities:
1. Fixed Step-Size: Here we simply select a fixed step-size η and run
the algorithm with that fixed step-size. An immediate problem that
you will encounter (in practice) even for very benign problems is that
if you select the step-size too large then GD can diverge, and if you
select it too small it might take a very long time to converge.
You will find pictures of this in the BV textbook, but here is a typical
analytical example to keep in mind.
(a) Suppose we have f (x) = x2 /2, initialize at x0 = 1 we take our
step size to be 3 (too large). Then the iterates will be xt =
−2, 4, −8, . . . (i.e. GD will diverge).
(b) For the same function, initialization, if we take our step size to be
0.00001 then GD would take 105 steps to converge.
(c) On the other hand if we picked the “correct” step-size of 1, we
would converge in 1 step.
Lecture 7: Gradient Descent 7-6

Figure 7.4 depicts three runs of gradient descent, each with a different
fixed step size η. When the step size is too small, even after 100 steps
it has not reached the minimum. When the step size is too large, after
8 steps it has already begun to diverge. When the step size is “just
right”, it converges after 40 steps.
20

20
●●
●
● ● ●
●●
●
●
● ●
●●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
10

10
●
●
●
●
● ●
●
●
● ● ●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●

* * *
0

0
−10

−10

−10
−20

−20

−20
−20 −10 0 10 20 −20 −10 0 10 20 −20 −10 0 10 20

Figure 7.4: Examples of different steps sizes on the function f (x) = (10x21 +
x22 )/2. One is too small (left), one too large (middle), and one “just right”
(right).

In theory, we’d like to understand this issue better (i.e. what properties
of a function make certain step-sizes “too big”, “too small”, or “cor-
rect”). The correct step-size in many cases may depend on properties
of the function that we don’t know. In practice, it will often be useful
to have at our disposal a few different ways to tune the step-size (and
some understanding of how we might diagnose issues with the step-size
choice).
2. Exact Line-Search: Once we’ve committed to a direction (in GD
this is the direction of the negative gradient), one might consider solving
the following 1D optimization problem to determine the best step-size:
η t = arg min f (xt − ηe∇f (xt )).
ηe≥0

It’s often computationally cumbersome to solve this optimization prob-

lem exactly, so we resort to some approximation of this idea.
3. Backtracking Line-Search: The idea of backtracking line-search
very roughly, is to try an aggressive (large) step-size, and reduce it by
some factor if it’s too big.
Lecture 7: Gradient Descent 7-7

Here is the algorithm: we pick two parameters α ∈ (0, 0.5) and β ∈

(0, 1). At iteration t: initialize η = 1,
(a) If f (xt − η∇f (xt )) > f (xt ) − αη∥∇f (xt )∥22 , then reduce η := β × η
and go back to step (a).
(b) Otherwise, take a step, i.e. set xt+1 = xt − η∇f (xt ).
Often in practice, taking α = 0.3 and β = 0.5 work reasonably well.
This method of backtracking line search uses the Armijo-Goldstein in-
equality to ensure that we acheive a sufficient decrease.
In Figure 7.5, gradient descent with backtracking line search is applied
to the same function we examined before and it roughly seems to get
the right step sizes.
20

●
10

●●
●

●●
●
●
●
●
●

*
0
−10
−20

−20 −10 0 10 20

Figure 7.5: Example of gradient descent with backtracking line search. In

this example, it accepts 12 steps and computes 40 steps total.

You will develop some better intuition when we study the main de-
scent lemma for GD, but roughly if your function is nice (the Hessian
term in a Taylor series is ignorable), you should expect to make about
η∥∇f (xt )∥22 amount of progress in one step of GD if η is small enough.
The backtracking line search simply says if you’re making upto an α
factor of this amount of progress you should be content and take a step.
Lecture 7: Gradient Descent 7-8

Example 7.1 (Backtracking on a Linear Function). As an example,

consider the case where f is a linear function. Then it’s clear that
taking a gradient step η∇f (x) should improve the function by exactly
f (xt+1 ) − f (xt ) = η∥∇f (xt )∥22 . If our function is locally linear than
maybe making an improvement of αη∥∇f (xt )∥22 is good enough.

Segue... Next lecture we will look at evaluate the performance of gradi-

ent descent on two example functions and consider our first convergence
results.

Lecture Notes On Optimization Methods
100% (15)
Lecture Notes On Optimization Methods
252 pages
Assignment No 3
No ratings yet
Assignment No 3
7 pages
Wolfram Mathematica Tutorial Collection
No ratings yet
Wolfram Mathematica Tutorial Collection
38 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Linear Programing and Types of Matrix
86% (7)
Linear Programing and Types of Matrix
5 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Module-2 - Transportation and Assignment Problem
No ratings yet
Module-2 - Transportation and Assignment Problem
83 pages
Midterm 1 Notes
No ratings yet
Midterm 1 Notes
46 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Chapter 7: Continuous Optimization (Math For Machine Learning)
No ratings yet
Chapter 7: Continuous Optimization (Math For Machine Learning)
65 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Gradient Descent
No ratings yet
Gradient Descent
55 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
DL - Unit 2
No ratings yet
DL - Unit 2
60 pages
Modi Method
No ratings yet
Modi Method
10 pages
Lecture8 UnconstrainedII 2023
No ratings yet
Lecture8 UnconstrainedII 2023
57 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Continuous Optimization in ML
No ratings yet
Continuous Optimization in ML
23 pages
Chapter 07
No ratings yet
Chapter 07
20 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Lecture3-Steepest and Gradient Descent
No ratings yet
Lecture3-Steepest and Gradient Descent
7 pages
Adaptive Convex Optimization Methods
No ratings yet
Adaptive Convex Optimization Methods
23 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Part 3 Nonlinear Op Tim Ization
No ratings yet
Part 3 Nonlinear Op Tim Ization
69 pages
Gradient Descent & Optimization Techniques
No ratings yet
Gradient Descent & Optimization Techniques
51 pages
Hauser Lecture2
No ratings yet
Hauser Lecture2
26 pages
Gradient Descent Insights
No ratings yet
Gradient Descent Insights
4 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Lect 5 - Gradient Descent
No ratings yet
Lect 5 - Gradient Descent
31 pages
Duality Principle
No ratings yet
Duality Principle
34 pages
Lecture 5
No ratings yet
Lecture 5
31 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
2 Simplex Exercises
No ratings yet
2 Simplex Exercises
5 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
1 - Introduction To Control Systems
No ratings yet
1 - Introduction To Control Systems
11 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
U1 - Introduction To Linear Programming and Applications: U1-T1-S1 - A1
No ratings yet
U1 - Introduction To Linear Programming and Applications: U1-T1-S1 - A1
31 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
Chapter8-Unconstrained Optimization
No ratings yet
Chapter8-Unconstrained Optimization
14 pages
Nocedal - Wright CH - 02-02
No ratings yet
Nocedal - Wright CH - 02-02
12 pages
Ford Fulkerson Algorithm
No ratings yet
Ford Fulkerson Algorithm
16 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Pre Solving in Linear Programming
No ratings yet
Pre Solving in Linear Programming
16 pages
Nonlinearity in Structural Dynamics Chapter App G
No ratings yet
Nonlinearity in Structural Dynamics Chapter App G
11 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Bca 4004 Unit 1
No ratings yet
Bca 4004 Unit 1
69 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Stochastic Gradient Descent Guide
No ratings yet
Stochastic Gradient Descent Guide
4 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Cme347 Lean Mnufacturing
No ratings yet
Cme347 Lean Mnufacturing
2 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Linear Programming (Optimization)
No ratings yet
Linear Programming (Optimization)
44 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Process Optimization PDF
No ratings yet
Process Optimization PDF
117 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Haramaya University: College of Business and Economics
No ratings yet
Haramaya University: College of Business and Economics
48 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
PID Control Systems Guide
No ratings yet
PID Control Systems Guide
13 pages
Big M Method
No ratings yet
Big M Method
28 pages
Convex Optimisation
No ratings yet
Convex Optimisation
17 pages
Matlab Optimization Toolbox: Most Materials Are Obtained From Matlab Website
No ratings yet
Matlab Optimization Toolbox: Most Materials Are Obtained From Matlab Website
12 pages
TUGAS 3 Optimum Design Problem Formulation
No ratings yet
TUGAS 3 Optimum Design Problem Formulation
8 pages
Pert.4 Minimization Model
No ratings yet
Pert.4 Minimization Model
28 pages
Scatter Search: Fred Glover, Manuel Laguna and Rafael Martí
No ratings yet
Scatter Search: Fred Glover, Manuel Laguna and Rafael Martí
20 pages
Optimization Techniques MCQs - Technicalblog - in
100% (1)
Optimization Techniques MCQs - Technicalblog - in
4 pages
Practice Problem Set 7: OA4201 Nonlinear Programming
No ratings yet
Practice Problem Set 7: OA4201 Nonlinear Programming
4 pages
03 Ot-1
No ratings yet
03 Ot-1
15 pages
数理最適化手法を用いた工程計画の立案
No ratings yet
数理最適化手法を用いた工程計画の立案
12 pages
Research On The Optimal Solution of Lagrangian Multiplier Function Method in Nonlinear Programming
No ratings yet
Research On The Optimal Solution of Lagrangian Multiplier Function Method in Nonlinear Programming
8 pages
Problem13 Solutions
No ratings yet
Problem13 Solutions
4 pages
Linear Optimization Lecture
No ratings yet
Linear Optimization Lecture
3 pages
Metodo Simplex QMWINDOWS
No ratings yet
Metodo Simplex QMWINDOWS
2 pages

Lecture7 Graddesc

Uploaded by

Lecture7 Graddesc

Uploaded by

10-425/625: Introduction to Convex Optimization (Fall 2023)

Lecture 7: Gradient Descent

7.1 Gradient Descent

For most of today we’ll also assume that f is differentiable everywhere. A

xt+1 = xt − η∇f (xt ),

Algorithm 1 Gradient Descent

7.1.1 Motivation #0: Moving to the Nearest Valley

Figure 7.1: Gradient descent on a convex function with random initializations

Figure 7.2: Gradient descent on a nonconvex function with random initial-

7.1.2 Motivation #1: Descent Directions

7.1.3 Motivation #2: Gradient Descent as Minimizing

Constrained Version Suppose we approximate our true function f by the

xt+1 = arg min f (xt ) + ∇f (xt )T (y − xt )

Unconstrained Version We could instead minimize an unconstrained

xt+1 = xt − η∇f (xt ).

An example of this local minimization problem is shown in Figure 7.3, where

Figure 7.3: Local quadratic approximation of a function

7.2 Choosing the Step-Size

It’s often computationally cumbersome to solve this optimization prob-

Here is the algorithm: we pick two parameters α ∈ (0, 0.5) and β ∈

Figure 7.5: Example of gradient descent with backtracking line search. In

Example 7.1 (Backtracking on a Linear Function). As an example,

Segue... Next lecture we will look at evaluate the performance of gradi-

You might also like