L.
Vandenberghe ECE236C (Spring 2019)
1. Gradient method
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
1.1
Gradient method
to minimize a convex differentiable function f : choose an initial point x0 and repeat
x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .
step size t k is constant or determined by line search
Advantages
• every iteration is inexpensive
• does not require second derivatives
Notation
• x k can refer to k th element of a sequence, or to the k th component of vector x
• to avoid confusion, we sometimes use x (k) to denote elements of a sequence
Gradient method 1.2
Quadratic example
1
f (x) = (x12 + γx22) (with γ > 1)
2
with exact line search and starting point x (0) = (γ, 1)
4
k x (k) − x? k2 γ−1 k
=
(0)
k x − x k2 ? γ+1 0
x2
where x? = 0
−4
−10 0 10
x1
gradient method is often slow; convergence very dependent on scaling
Gradient method 1.3
Nondifferentiable example
x1 + γ|x2 |
q
f (x) = x12 + γx22 if |x2 | ≤ x1, f (x) = p if |x2 | > x1
1+γ
with exact line search, starting point x (0) = (γ, 1), converges to non-optimal point
0
x2
−2
−2 0 2 4
x1
gradient method does not handle nondifferentiable problems
Gradient method 1.4
First-order methods
address one or both shortcomings of the gradient method
Methods for nondifferentiable or constrained problems
• subgradient method
• proximal gradient method
• smoothing methods
• cutting-plane methods
Methods with improved convergence
• conjugate gradient method
• accelerated gradient method
• quasi-Newton methods
Gradient method 1.5
Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
Convex function
a function f is convex if dom f is a convex set and Jensen’s inequality holds:
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0, 1]
First-order condition
for (continuously) differentiable f , Jensen’s inequality can be replaced with
f (y) ≥ f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f
Second-order condition
for twice differentiable f , Jensen’s inequality can be replaced with
∇2 f (x) 0 for all x ∈ dom f
Gradient method 1.6
Strictly convex function
f is strictly convex if dom f is a convex set and
f (θ x + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y , and θ ∈ (0, 1)
strict convexity implies that if a minimizer of f exists, it is unique
First-order condition
for differentiable f , strict Jensen’s inequality can be replaced with
f (y) > f (x) + ∇ f (x)T (y − x) for all x, y ∈ dom f , x , y
Second-order condition
note that ∇2 f (x) 0 is not necessary for strict convexity (cf., f (x) = x 4)
Gradient method 1.7
Monotonicity of gradient
a differentiable function f is convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f
i.e., the gradient ∇ f : Rn → Rn is a monotone mapping
a differentiable function f is strictly convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y
i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping
Gradient method 1.8
Proof
• if f is differentiable and convex, then
f (y) ≥ f (x) + ∇ f (x)T (y − x), f (x) ≥ f (y) + ∇ f (y)T (x − y)
combining the inequalities gives (∇ f (x) − ∇ f (y))T (x − y) ≥ 0
• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g , where
g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T (y − x)
hence
∫ 1
f (y) = g(1) = g(0) + g0(t) dt ≥ g(0) + g0(0)
0
= f (x) + ∇ f (x)T (y − x)
this is the first-order condition for convexity
Gradient method 1.9
Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
Lipschitz continuous gradient
the gradient of f is Lipschitz continuous with parameter L > 0 if
k∇ f (x) − ∇ f (y)k∗ ≤ L k x − yk for all x, y ∈ dom f
• functions f with this property are also called L -smooth
• the definition does not assume convexity of f (and holds for − f if it holds for f )
• in the definition, k · k and k · k∗ are a pair of dual norms:
uT v
kuk∗ = sup = sup uT v
v,0 kvk kvk=1
this implies a generalized Cauchy–Schwarz inequality
|uT v| ≤ kuk∗ kvk for all u, v
Gradient method 1.10
Choice of norm
Equivalence of norms
• for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that
c1 k xkb ≤ k xka ≤ c2 k xkb for all x
• constants depend on dimension; for example, for x ∈ Rn,
√ 1
k xk2 ≤ k xk1 ≤ n k xk2, √ k xk2 ≤ k xk∞ ≤ k xk2
n
Norm in definition of Lipschitz continuity
• without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2
• the parameter L depends on choice of norm
• in complexity bounds, choice of norm can simplify dependence on dimensions
Gradient method 1.11
Quadratic upper bound
suppose ∇ f is Lipschitz continuous with parameter L
• this implies (from the generalized Cauchy–Schwarz inequality) that
(∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2 for all x, y ∈ dom f (1)
• if dom f is convex, (1) is equivalent to
L
f (y) ≤ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(2)
2
f (y) (x, f (x))
Gradient method 1.12
Proof (of the equivalence of (1) and (2) if dom f is convex)
• consider arbitrary x, y ∈ dom f and define g(t) = f (x + t(y − x))
• g(t) is defined for t ∈ [0, 1] because dom f is convex
• if (1) holds, then
g0(t) − g0(0) = (∇ f (x + t(y − x)) − ∇ f (x))T (y − x) ≤ t L k x − yk 2
integrating from t = 0 to t = 1 gives (2):
∫ 1 L
f (y) = g(1) = g(0) + g0(t) dt ≤ g(0) + g0(0) + k x − yk 2
0 2
L
= f (x) + ∇ f (x) (y − x) + k x − yk 2
T
2
• conversely, if (2) holds, then (2) and the same inequality with x , y switched, i.e.,
L
f (x) ≤ f (y) + ∇ f (y) (x − y) + k x − yk 2,
T
2
can be combined to give (∇ f (x) − ∇ f (y))T (x − y) ≤ L k x − yk 2
Gradient method 1.13
Consequence of quadratic upper bound
if dom f = Rn and f has a minimizer x?, then
1 L
k∇ f (z)k∗ ≤ f (z) − f (x ) ≤ kz − x? k 2 for all z
2 ?
2L 2
• right-hand inequality follows from upper bound property (2) at x = x?, y = z
• left-hand inequality follows by minimizing quadratic upper bound for x = z
L
inf f (y) ≤ inf f (z) + ∇ f (z)T (y − z) + k y − zk 2
y y 2
Lt 2
= inf inf f (z) + t∇ f (z)T v +
kvk=1 t 2
1
= inf f (z) − (∇ f (z)T v)2
kvk=1 2L
1
= f (z) − k∇ f (z)k∗2
2L
Gradient method 1.14
Co-coercivity of gradient
if f is convex with dom f = Rn and ∇ f is L -Lipschitz continuous, then
1
(∇ f (x) − ∇ f (y)) (x − y) ≥ k∇ f (x) − ∇ f (y)k∗2 for all x, y
T
L
• this property is known as co-coercivity of ∇ f (with parameter 1/L )
• co-coercivity in turn implies Lipschitz continuity of ∇ f (by Cauchy–Schwarz)
• hence, for differentiable convex f with dom f = Rn
Lipschitz continuity of ∇ f ⇒ upper bound property (2) (equivalently, (1))
⇒ co-coercivity of ∇ f
⇒ Lipschitz continuity of ∇ f
therefore the three properties are equivalent
Gradient method 1.15
Proof of co-coercivity: define two convex functions fx , fy with domain Rn
fx (z) = f (z) − ∇ f (x)T z, fy (z) = f (z) − ∇ f (y)T z
• the two functions have L -Lipschitz continuous gradients
• z = x minimizes fx (z); from the left-hand inequality on page 1.14,
f (y) − f (x) − ∇ f (x)T (y − x) = fx (y) − fx (x)
1
≥ k∇ fx (y)k∗2
2L
1
= k∇ f (y) − ∇ f (x)k∗2
2L
• similarly, z = y minimizes fy (z); therefore
1
T
f (x) − f (y) − ∇ f (y) (x − y) ≥ k∇ f (y) − ∇ f (x)k∗2
2L
combining the two inequalities shows co-coercivity
Gradient method 1.16
Lipschitz continuity with respect to Euclidean norm
supose f is convex with dom f = Rn, and L -smooth for the Euclidean norm:
k∇ f (x) − ∇ f (y)k2 ≤ L k x − yk2 for all x , y
• the equivalent property (1) states that
(∇ f (x) − ∇ f (y))T (x − y) ≤ L(x − y)T (x − y) for all x , y
• this is monotonicity of L x − ∇ f (x), i.e., equivalent to the property that
L
k xk22 − f (x) is a convex function
2
• if f is twice differentiable, the Hessian of this function is LI − ∇2 f (x):
λmax(∇2 f (x)) ≤ L for all x
is an equivalent characterization of L -smoothness
Gradient method 1.17
Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
Strongly convex function
f is strongly convex with parameter m > 0 if dom f is convex and
m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk 2
2
holds for all x, y ∈ dom f , θ ∈ [0, 1]
• this is a stronger version of Jensen’s inequality
• it holds if and only if it holds for f restricted to arbitrary lines:
m 2
f (x + t(y − x)) − t k x − yk 2 (3)
2
is a convex function of t , for all x, y ∈ dom f
• without loss of generality, we can take k · k = k · k2
• however, the strong convexity parameter m depends on the norm used
Gradient method 1.18
Quadratic lower bound
if f is differentiable and m-strongly convex, then
m
f (y) ≥ f (x) + ∇ f (x) (y − x) + k y − xk 2 for all x, y ∈ dom f
T
(4)
2
f (y)
(x, f (x))
• follows from the 1st order condition of convexity of (3)
• this implies that the sublevel sets of f are bounded
• if f is closed (has closed sublevel sets), it has a unique minimizer x? and
m ? 2 1
?
kz − x k ≤ f (z) − f (x ) ≤ k∇ f (z)k∗2 for all z ∈ dom f
2 2m
(proof as on page 1.14)
Gradient method 1.19
Strong monotonicity
differentiable f is strongly convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) ≥ mk x − yk 2 for all x, y ∈ dom f
this is called strong monotonicity (coercivity) of ∇ f
Proof
• one direction follows from (4) and the same inequality with x and y switched
• for the other direction, assume ∇ f is strongly monotone and define
m 2
g(t) = f (x + t(y − x)) − t k x − yk 2
2
then g0(t) is nondecreasing, so g is convex
Gradient method 1.20
Strong convexity with respect to Euclidean norm
suppose f is m-strongly convex for the Euclidean norm:
m
f (θ x + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)k x − yk22
2
for x, y ∈ dom f , θ ∈ [0, 1]
• this is Jensen’s inequality for the function
m
h(x) = f (x) − k xk22
2
• therefore f is strongly convex if and only if h is convex
• if f is twice differentiable, h is convex if and only if ∇2 f (x) − mI 0, or
λmin(∇2 f (x)) ≥ m for all x ∈ dom f
Gradient method 1.21
Extension of co-coercivity
suppose f is m-strongly convex and L -smooth for k · k2, and dom f = Rn
• then the function
m
h(x) = f (x) − k xk22
2
is convex and (L − m)-smooth:
0 ≤ (∇h(x) − ∇h(y))T (x − y)
= (∇ f (x) − ∇ f (y))T (x − y) − mk x − yk22
≤ (L − m)k x − yk22
• co-coercivity of ∇h can be written as
mL 1
(∇ f (x) − ∇ f (y))T (x − y) ≥ k x − yk22 + k∇ f (x) − ∇ f (y)k22
m+L m+L
for all x, y ∈ dom f
Gradient method 1.22
Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
Analysis of gradient method
x k+1 = x k − t k ∇ f (x k ), k = 0, 1, . . .
with fixed step size or backtracking line search
Assumptions
1. f is convex and differentiable with dom f = Rn
2. ∇ f (x) is L -Lipschitz continuous with respect to the Euclidean norm, with L > 0
3. optimal value f ? = inf x f (x) is finite and attained at x?
Gradient method 1.23
Basic gradient step
• from quadratic upper bound (page 1.12) with y = x − t∇ f (x):
Lt
f (x − t∇ f (x)) ≤ f (x) − t(1 − ) k∇ f (x)k22
2
• therefore, if x + = x − t∇ f (x) and 0 < t ≤ 1/L ,
t
f (x ) ≤ f (x) − k∇ f (x)k22
+
(5)
2
• from (5) and convexity of f ,
t
f (x +) − f ? ≤ ∇ f (x)T (x − x?) − k∇ f (x)k22
2
1 ? 2
?
2
k x − x k2 − x − x − t∇ f (x)
2
=
2t
1
k x − x? k22 − k x + − x? k22
= (6)
2t
Gradient method 1.24
Descent properties
assume ∇ f (x) , 0
• the inequality (5) shows that
f (x +) < f (x)
• the inequality (6) shows that
k x + − x? k2 < k x − x? k2
in the gradient method, function value and distance to the optimal set decrease
Gradient method 1.25
Gradient method with constant step size
x k+1 = x k − t∇ f (x k ), k = 0, 1, . . .
• take x = xi−1, x + = xi in (6) and add the bounds for i = 1, . . . , k :
k k
? 1X ? 2 ? 2
( f (xi ) − f ) ≤ k xi−1 − x k2 − k xi − x k2
X
i=1 2t i=1
1
k x0 − x? k22 − k x k − x? k22
=
2t
1
≤ k x0 − x? k22
2t
• since f (xi ) is non-increasing (see (5))
k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2kt
Conclusion: number of iterations to reach f (x k ) − f ? ≤ is O(1/)
Gradient method 1.26
Backtracking line search
initialize t k at tˆ > 0 (for example, tˆ = 1) and take t k := βt k until
f (x k − t k ∇ f (x k )) < f (x k ) − αt k k∇ f (x k )k22
f (xk − t∇ f (xk ))
f (xk ) − αt k∇ f (xk )k22
f (xk ) − t k∇ f (xk )k22
t
0 < β < 1; we will take α = 1/2 (mostly to simplify proofs)
Gradient method 1.27
Analysis for backtracking line search
line search with α = 1/2, if f has a Lipschitz continuous gradient
tL
f (xk ) − t(1 − )k∇ f (xk )k22
2
t
f (xk ) − k∇ f (xk )k22
2
f (xk − t∇ f (xk ))
t = 1/L
selected step size satisfies t k ≥ tmin = min{tˆ, β/L}
Gradient method 1.28
Gradient method with backtracking line search
• from line search condition and convexity of f ,
ti
f (xi+1) ≤ f (xi ) − k∇ f (xi )k22
2
ti
≤ f ? + ∇ f (xi )T (xi − x?) − k∇ f (xi )k22
2
? 1 ? 2 ? 2
= f + k xi − x k2 − k xi+1 − x k2
2ti
• this implies k xi+1 − x? k2 ≤ k xi − x? k , so we can replace ti with tmin ≤ ti :
? 1 ? 2 ? 2
f (xi+1) − f ≤ k xi − x k2 − k xi−1 − x k2
2tmin
• adding the upper bounds gives same 1/k bound as with constant step size
k
1X 1
?
f (x k ) − f ≤ ( f (xi ) − f ?) ≤ k x0 − x? k22
k i=1 2ktmin
Gradient method 1.29
Gradient method for strongly convex functions
better results exist if we add strong convexity to the assumptions on p. 1.23
Analysis for constant step size
if x + = x − t∇ f (x) and 0 < t ≤ 2/(m + L):
k x + − x? k22 = k x − t∇ f (x) − x? k22
= k x − x? k22 − 2t∇ f (x)T (x − x?) + t 2 k∇ f (x)k22
2mL ? 2 2
≤ (1 − t )k x − x k2 + t(t − )k∇ f (x)k22
m+L m+L
2mL
≤ (1 − t )k x − x? k22
m+L
(step 3 follows from result on page 1.22)
Gradient method 1.30
Distance to optimum
2mL
k xk − x? k22 k
≤ c k x0 − x? k22, c =1−t
m+L
• implies (linear) convergence
2
γ−1
• for t = 2/(m + L), get c = with γ = L/m
γ+1
Bound on function value (from page 1.14)
L ? 2 ck L
?
f (x k ) − f ≤ k x k − x k2 ≤ k x0 − x? k22
2 2
Conclusion: number of iterations to reach f (x k ) − f ? ≤ is O(log(1/))
Gradient method 1.31
Limits on convergence rate of first-order methods
First-order method: any iterative algorithm that selects x k+1 in the set
x0 + span{∇ f (x0), ∇ f (x1), . . . , ∇ f (x k )}
Problem class: any function that satisfies the assumptions on page 1.23
Theorem (Nesterov): for every integer k ≤ (n − 1)/2 and every x0, there exist
functions in the problem class such that for any first-order method
L x x ?k2
3 k 0 − 2
f (x k ) − f ? ≥
32 (k + 1)2
• suggests 1/k rate for gradient method is not optimal
• more recent accelerated gradient methods have 1/k 2 convergence (see later)
Gradient method 1.32
References
• A. Beck, First-Order Methods in Optimization (2017), chapter 5.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 2.1. (The result
on page 1.32 is Theorem 2.1.7 in the book.)
• B. T. Polyak, Introduction to Optimization (1987), section 1.4.
• The example on page 1.4 is from N. Z. Shor, Nondifferentiable Optimization
and Polynomial Problems (1998), page 37.
Gradient method 1.33