Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
151 views37 pages

Gradient

The document summarizes key concepts related to gradient methods for minimizing convex functions. It discusses: 1) The gradient method which iteratively updates the current point by moving in the negative gradient direction with a step size determined by line search. 2) Properties of convex functions including Jensen's inequality and the gradient being a monotone mapping. 3) The definition of a Lipschitz continuous gradient, which bounds the change in gradient as a function of distance between points. This property implies a quadratic upper bound on the function.

Uploaded by

Sanjay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views37 pages

Gradient

The document summarizes key concepts related to gradient methods for minimizing convex functions. It discusses: 1) The gradient method which iteratively updates the current point by moving in the negative gradient direction with a step size determined by line search. 2) Properties of convex functions including Jensen's inequality and the gradient being a monotone mapping. 3) The definition of a Lipschitz continuous gradient, which bounds the change in gradient as a function of distance between points. This property implies a quadratic upper bound on the function.

Uploaded by

Sanjay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

L.

Vandenberghe ECE236C (Spring 2019)

1. Gradient method

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method

1.1
Gradient method

to minimize a convex differentiable function f : choose an initial point x0 and repeat

x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .

step size t k is constant or determined by line search

Advantages

• every iteration is inexpensive


• does not require second derivatives

Notation

• x k can refer to k th element of a sequence, or to the k th component of vector x


• to avoid confusion, we sometimes use x (k) to denote elements of a sequence

Gradient method 1.2


Quadratic example

1
f (x) = (x12 + γx22) (with γ > 1)
2

with exact line search and starting point x (0) = (γ, 1)

4
k x (k) − x? k2 γ−1 k
 
=
(0)
k x − x k2 ? γ+1 0
x2

where x? = 0
−4
−10 0 10
x1

gradient method is often slow; convergence very dependent on scaling

Gradient method 1.3


Nondifferentiable example

x1 + γ|x2 |
q
f (x) = x12 + γx22 if |x2 | ≤ x1, f (x) = p if |x2 | > x1
1+γ

with exact line search, starting point x (0) = (γ, 1), converges to non-optimal point

0
x2

−2
−2 0 2 4
x1

gradient method does not handle nondifferentiable problems

Gradient method 1.4


First-order methods

address one or both shortcomings of the gradient method

Methods for nondifferentiable or constrained problems

• subgradient method
• proximal gradient method
• smoothing methods
• cutting-plane methods

Methods with improved convergence

• conjugate gradient method


• accelerated gradient method
• quasi-Newton methods

Gradient method 1.5


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Convex function

a function f is convex if dom f is a convex set and Jensen’s inequality holds:

f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0, 1]

First-order condition

for (continuously) differentiable f , Jensen’s inequality can be replaced with

f (y) ≥ f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f

Second-order condition

for twice differentiable f , Jensen’s inequality can be replaced with

∇2 f (x)  0 for all x ∈ dom f

Gradient method 1.6


Strictly convex function

f is strictly convex if dom f is a convex set and

f (θ x + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y , and θ ∈ (0, 1)

strict convexity implies that if a minimizer of f exists, it is unique

First-order condition

for differentiable f , strict Jensen’s inequality can be replaced with

f (y) > f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f , x , y

Second-order condition

note that ∇2 f (x)  0 is not necessary for strict convexity (cf., f (x) = x 4)

Gradient method 1.7


Monotonicity of gradient

a differentiable function f is convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f

i.e., the gradient ∇ f : Rn → Rn is a monotone mapping

a differentiable function f is strictly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y

i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping

Gradient method 1.8


Proof

• if f is differentiable and convex, then

f (y) ≥ f (x) + ∇ f (x)T (y − x), f (x) ≥ f (y) + ∇ f (y)T (x − y)

combining the inequalities gives (∇ f (x) − ∇ f (y))T (x − y) ≥ 0

• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g , where

g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T (y − x)

hence
∫ 1
f (y) = g(1) = g(0) + g0(t) dt ≥ g(0) + g0(0)
0
= f (x) + ∇ f (x)T (y − x)

this is the first-order condition for convexity

Gradient method 1.9


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Lipschitz continuous gradient

the gradient of f is Lipschitz continuous with parameter L > 0 if

k∇ f (x) − ∇ f (y)k∗ ≤ L k x − yk for all x, y ∈ dom f

• functions f with this property are also called L -smooth

• the definition does not assume convexity of f (and holds for − f if it holds for f )

• in the definition, k · k and k · k∗ are a pair of dual norms:

uT v
kuk∗ = sup = sup uT v
v,0 kvk kvk=1

this implies a generalized Cauchy–Schwarz inequality

|uT v| ≤ kuk∗ kvk for all u, v

Gradient method 1.10


Choice of norm

Equivalence of norms

• for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that

c1 k xkb ≤ k xka ≤ c2 k xkb for all x

• constants depend on dimension; for example, for x ∈ Rn,

√ 1
k xk2 ≤ k xk1 ≤ n k xk2, √ k xk2 ≤ k xk∞ ≤ k xk2
n

Norm in definition of Lipschitz continuity

• without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2


• the parameter L depends on choice of norm
• in complexity bounds, choice of norm can simplify dependence on dimensions

Gradient method 1.11


Quadratic upper bound
suppose ∇ f is Lipschitz continuous with parameter L

• this implies (from the generalized Cauchy–Schwarz inequality) that

(∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2 for all x, y ∈ dom f (1)

• if dom f is convex, (1) is equivalent to


L
f (y) ≤ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(2)
2

f (y) (x, f (x))

Gradient method 1.12


Proof (of the equivalence of (1) and (2) if dom f is convex)
• consider arbitrary x, y ∈ dom f and define g(t) = f (x + t(y − x))
• g(t) is defined for t ∈ [0, 1] because dom f is convex
• if (1) holds, then

g0(t) − g0(0) = (∇ f (x + t(y − x)) − ∇ f (x))T (y − x) ≤ t L k x − yk 2

integrating from t = 0 to t = 1 gives (2):


∫ 1 L
f (y) = g(1) = g(0) + g0(t) dt ≤ g(0) + g0(0) + k x − yk 2
0 2
L
= f (x) + ∇ f (x) (y − x) + k x − yk 2
T
2

• conversely, if (2) holds, then (2) and the same inequality with x , y switched, i.e.,
L
f (x) ≤ f (y) + ∇ f (y) (x − y) + k x − yk 2,
T
2

can be combined to give (∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2

Gradient method 1.13


Consequence of quadratic upper bound

if dom f = Rn and f has a minimizer x?, then

1 L
k∇ f (z)k∗ ≤ f (z) − f (x ) ≤ kz − x? k 2 for all z
2 ?
2L 2

• right-hand inequality follows from upper bound property (2) at x = x?, y = z


• left-hand inequality follows by minimizing quadratic upper bound for x = z

L
 
inf f (y) ≤ inf f (z) + ∇ f (z)T (y − z) + k y − zk 2
y y 2

Lt 2 
= inf inf f (z) + t∇ f (z)T v +
kvk=1 t 2
1
 
= inf f (z) − (∇ f (z)T v)2
kvk=1 2L
1
= f (z) − k∇ f (z)k∗2
2L

Gradient method 1.14


Co-coercivity of gradient

if f is convex with dom f = Rn and ∇ f is L -Lipschitz continuous, then

1
(∇ f (x) − ∇ f (y)) (x − y) ≥ k∇ f (x) − ∇ f (y)k∗2 for all x, y
T
L

• this property is known as co-coercivity of ∇ f (with parameter 1/L )

• co-coercivity in turn implies Lipschitz continuity of ∇ f (by Cauchy–Schwarz)

• hence, for differentiable convex f with dom f = Rn

Lipschitz continuity of ∇ f ⇒ upper bound property (2) (equivalently, (1))


⇒ co-coercivity of ∇ f
⇒ Lipschitz continuity of ∇ f

therefore the three properties are equivalent

Gradient method 1.15


Proof of co-coercivity: define two convex functions fx , fy with domain Rn

fx (z) = f (z) − ∇ f (x)T z, fy (z) = f (z) − ∇ f (y)T z

• the two functions have L -Lipschitz continuous gradients


• z = x minimizes fx (z); from the left-hand inequality on page 1.14,

f (y) − f (x) − ∇ f (x)T (y − x) = fx (y) − fx (x)


1
≥ k∇ fx (y)k∗2
2L
1
= k∇ f (y) − ∇ f (x)k∗2
2L

• similarly, z = y minimizes fy (z); therefore

1
T
f (x) − f (y) − ∇ f (y) (x − y) ≥ k∇ f (y) − ∇ f (x)k∗2
2L

combining the two inequalities shows co-coercivity


Gradient method 1.16
Lipschitz continuity with respect to Euclidean norm

supose f is convex with dom f = Rn, and L -smooth for the Euclidean norm:

k∇ f (x) − ∇ f (y)k2 ≤ L k x − yk2 for all x , y

• the equivalent property (1) states that

(∇ f (x) − ∇ f (y))T (x − y) ≤ L(x − y)T (x − y) for all x , y

• this is monotonicity of L x − ∇ f (x), i.e., equivalent to the property that

L
k xk22 − f (x) is a convex function
2

• if f is twice differentiable, the Hessian of this function is LI − ∇2 f (x):

λmax(∇2 f (x)) ≤ L for all x

is an equivalent characterization of L -smoothness

Gradient method 1.17


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Strongly convex function

f is strongly convex with parameter m > 0 if dom f is convex and

m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk 2
2

holds for all x, y ∈ dom f , θ ∈ [0, 1]

• this is a stronger version of Jensen’s inequality


• it holds if and only if it holds for f restricted to arbitrary lines:

m 2
f (x + t(y − x)) − t k x − yk 2 (3)
2

is a convex function of t , for all x, y ∈ dom f

• without loss of generality, we can take k · k = k · k2


• however, the strong convexity parameter m depends on the norm used

Gradient method 1.18


Quadratic lower bound
if f is differentiable and m-strongly convex, then
m
f (y) ≥ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(4)
2

f (y)

(x, f (x))

• follows from the 1st order condition of convexity of (3)


• this implies that the sublevel sets of f are bounded
• if f is closed (has closed sublevel sets), it has a unique minimizer x? and
m ? 2 1
?
kz − x k ≤ f (z) − f (x ) ≤ k∇ f (z)k∗2 for all z ∈ dom f
2 2m
(proof as on page 1.14)
Gradient method 1.19
Strong monotonicity

differentiable f is strongly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) ≥ mk x − yk 2 for all x, y ∈ dom f

this is called strong monotonicity (coercivity) of ∇ f

Proof

• one direction follows from (4) and the same inequality with x and y switched
• for the other direction, assume ∇ f is strongly monotone and define

m 2
g(t) = f (x + t(y − x)) − t k x − yk 2
2

then g0(t) is nondecreasing, so g is convex

Gradient method 1.20


Strong convexity with respect to Euclidean norm

suppose f is m-strongly convex for the Euclidean norm:

m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk22
2

for x, y ∈ dom f , θ ∈ [0, 1]

• this is Jensen’s inequality for the function

m
h(x) = f (x) − k xk22
2

• therefore f is strongly convex if and only if h is convex


• if f is twice differentiable, h is convex if and only if ∇2 f (x) − mI  0, or

λmin(∇2 f (x)) ≥ m for all x ∈ dom f

Gradient method 1.21


Extension of co-coercivity

suppose f is m-strongly convex and L -smooth for k · k2, and dom f = Rn

• then the function


m
h(x) = f (x) − k xk22
2
is convex and (L − m)-smooth:

0 ≤ (∇h(x) − ∇h(y))T (x − y)
= (∇ f (x) − ∇ f (y))T (x − y) − mk x − yk22
≤ (L − m)k x − yk22

• co-coercivity of ∇h can be written as

mL 1
(∇ f (x) − ∇ f (y))T (x − y) ≥ k x − yk22 + k∇ f (x) − ∇ f (y)k22
m+L m+L

for all x, y ∈ dom f

Gradient method 1.22


Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method


Analysis of gradient method

x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .

with fixed step size or backtracking line search

Assumptions

1. f is convex and differentiable with dom f = Rn

2. ∇ f (x) is L -Lipschitz continuous with respect to the Euclidean norm, with L > 0

3. optimal value f ? = inf x f (x) is finite and attained at x?

Gradient method 1.23


Basic gradient step

• from quadratic upper bound (page 1.12) with y = x − t∇ f (x):

Lt
f (x − t∇ f (x)) ≤ f (x) − t(1 − ) k∇ f (x)k22
2

• therefore, if x + = x − t∇ f (x) and 0 < t ≤ 1/L ,

t
f (x ) ≤ f (x) − k∇ f (x)k22
+
(5)
2

• from (5) and convexity of f ,

t
f (x +) − f ? ≤ ∇ f (x)T (x − x?) − k∇ f (x)k22
2
1  ? 2 ?
2 
k x − x k2 − x − x − t∇ f (x) 2

=
2t
1 
k x − x? k22 − k x + − x? k22

= (6)
2t

Gradient method 1.24


Descent properties

assume ∇ f (x) , 0

• the inequality (5) shows that

f (x +) < f (x)

• the inequality (6) shows that

k x + − x? k2 < k x − x? k2

in the gradient method, function value and distance to the optimal set decrease

Gradient method 1.25


Gradient method with constant step size

x k+1 = x k − t∇ f (x k ), k = 0, 1, . . .

• take x = xi−1, x + = xi in (6) and add the bounds for i = 1, . . . , k :

k k 
? 1X ? 2 ? 2

( f (xi ) − f ) ≤ k xi−1 − x k2 − k xi − x k2
X
i=1 2t i=1
1 
k x0 − x? k22 − k x k − x? k22

=
2t
1
≤ k x0 − x? k22
2t

• since f (xi ) is non-increasing (see (5))

k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2kt

Conclusion: number of iterations to reach f (x k ) − f ? ≤  is O(1/)


Gradient method 1.26
Backtracking line search

initialize t k at tˆ > 0 (for example, tˆ = 1) and take t k := βt k until

f (x k − t k ∇ f (x k )) < f (x k ) − αt k k∇ f (x k )k22

f (xk − t∇ f (xk ))

f (xk ) − αt k∇ f (xk )k22

f (xk ) − t k∇ f (xk )k22


t

0 < β < 1; we will take α = 1/2 (mostly to simplify proofs)

Gradient method 1.27


Analysis for backtracking line search

line search with α = 1/2, if f has a Lipschitz continuous gradient

tL
f (xk ) − t(1 − )k∇ f (xk )k22
2
t
f (xk ) − k∇ f (xk )k22
2

f (xk − t∇ f (xk ))

t = 1/L

selected step size satisfies t k ≥ tmin = min{tˆ, β/L}

Gradient method 1.28


Gradient method with backtracking line search

• from line search condition and convexity of f ,

ti
f (xi+1) ≤ f (xi ) − k∇ f (xi )k22
2
ti
≤ f ? + ∇ f (xi )T (xi − x?) − k∇ f (xi )k22
2
? 1  ? 2 ? 2

= f + k xi − x k2 − k xi+1 − x k2
2ti

• this implies k xi+1 − x? k2 ≤ k xi − x? k , so we can replace ti with tmin ≤ ti :

? 1  ? 2 ? 2

f (xi+1) − f ≤ k xi − x k2 − k xi−1 − x k2
2tmin

• adding the upper bounds gives same 1/k bound as with constant step size

k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2ktmin

Gradient method 1.29


Gradient method for strongly convex functions

better results exist if we add strong convexity to the assumptions on p. 1.23

Analysis for constant step size

if x + = x − t∇ f (x) and 0 < t ≤ 2/(m + L):

k x + − x? k22 = k x − t∇ f (x) − x? k22


= k x − x? k22 − 2t∇ f (x)T (x − x?) + t 2 k∇ f (x)k22
2mL ? 2 2
≤ (1 − t )k x − x k2 + t(t − )k∇ f (x)k22
m+L m+L
2mL
≤ (1 − t )k x − x? k22
m+L

(step 3 follows from result on page 1.22)

Gradient method 1.30


Distance to optimum

2mL
k xk − x? k22 k
≤ c k x0 − x? k22, c =1−t
m+L

• implies (linear) convergence


2
γ−1

• for t = 2/(m + L), get c = with γ = L/m
γ+1

Bound on function value (from page 1.14)

L ? 2 ck L
?
f (x k ) − f ≤ k x k − x k2 ≤ k x0 − x? k22
2 2

Conclusion: number of iterations to reach f (x k ) − f ? ≤  is O(log(1/))

Gradient method 1.31


Limits on convergence rate of first-order methods

First-order method: any iterative algorithm that selects x k+1 in the set

x0 + span{∇ f (x0), ∇ f (x1), . . . , ∇ f (x k )}

Problem class: any function that satisfies the assumptions on page 1.23

Theorem (Nesterov): for every integer k ≤ (n − 1)/2 and every x0, there exist
functions in the problem class such that for any first-order method

L x x ?k2
3 k 0 − 2
f (x k ) − f ? ≥
32 (k + 1)2

• suggests 1/k rate for gradient method is not optimal


• more recent accelerated gradient methods have 1/k 2 convergence (see later)

Gradient method 1.32


References

• A. Beck, First-Order Methods in Optimization (2017), chapter 5.

• Yu. Nesterov, Lectures on Convex Optimization (2018), section 2.1. (The result
on page 1.32 is Theorem 2.1.7 in the book.)

• B. T. Polyak, Introduction to Optimization (1987), section 1.4.

• The example on page 1.4 is from N. Z. Shor, Nondifferentiable Optimization


and Polynomial Problems (1998), page 37.

Gradient method 1.33

You might also like