L.
Vandenberghe ECE236C (Spring 2022)
2. Subgradients
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
2.1
Basic inequality
recall the basic inequality for differentiable convex functions:
𝑓 (𝑦) ≥ 𝑓 (𝑥) + ∇ 𝑓 (𝑥)𝑇 (𝑦 − 𝑥) for all 𝑦 ∈ dom 𝑓
(𝑥, 𝑓 (𝑥))
∇ 𝑓 (𝑥)
−1
• the first-order approximation of 𝑓 at 𝑥 is a global lower bound
• ∇ 𝑓 (𝑥) defines non-vertical supporting hyperplane to epigraph of 𝑓 at (𝑥, 𝑓 (𝑥)) :
𝑇
∇ 𝑓 (𝑥) 𝑦 𝑥
− ≤ 0 for all (𝑦, 𝑡) ∈ epi 𝑓
−1 𝑡 𝑓 (𝑥)
Subgradients 2.2
Subgradient
𝑔 is a subgradient of a convex function 𝑓 at 𝑥 ∈ dom 𝑓 if
𝑓 (𝑦) ≥ 𝑓 (𝑥) + 𝑔𝑇 (𝑦 − 𝑥) for all 𝑦 ∈ dom 𝑓
𝑓 (𝑦)
𝑓 (𝑥 1) + 𝑔𝑇1 (𝑦 − 𝑥1)
𝑓 (𝑥 1) + 𝑔𝑇2 (𝑦 − 𝑥1)
𝑓 (𝑥2) + 𝑔𝑇3 (𝑦 − 𝑥2)
𝑥1 𝑥2
𝑔1, 𝑔2 are subgradients at 𝑥1; 𝑔3 is a subgradient at 𝑥2
Subgradients 2.3
Subdifferential
the subdifferential 𝜕 𝑓 (𝑥) of 𝑓 at 𝑥 is the set of all subgradients:
𝜕 𝑓 (𝑥) = {𝑔 | 𝑔𝑇 (𝑦 − 𝑥) ≤ 𝑓 (𝑦) − 𝑓 (𝑥), ∀𝑦 ∈ dom 𝑓 }
Properties
• 𝜕 𝑓 (𝑥) is a closed convex set (possibly empty)
this follows from the definition: 𝜕 𝑓 (𝑥) is an intersection of halfspaces
• if 𝑥 ∈ int dom 𝑓 then 𝜕 𝑓 (𝑥) is nonempty and bounded
proof on next two pages
Subgradients 2.4
Proof: we show that 𝜕 𝑓 (𝑥) is nonempty when 𝑥 ∈ int dom 𝑓
• (𝑥, 𝑓 (𝑥)) is in the boundary of the convex set epi 𝑓
• therefore there exists a supporting hyperplane to epi 𝑓 at (𝑥, 𝑓 (𝑥)) :
𝑇
𝑎 𝑦 𝑥
∃(𝑎, 𝑏) ≠ 0, − ≤0 ∀(𝑦, 𝑡) ∈ epi 𝑓
𝑏 𝑡 𝑓 (𝑥)
• 𝑏 > 0 gives a contradiction as 𝑡 → ∞
• 𝑏 = 0 gives a contradiction for 𝑦 = 𝑥 + 𝜖 𝑎 with small 𝜖 > 0
1
• therefore 𝑏 < 0 and 𝑔 = 𝑎 is a subgradient of 𝑓 at 𝑥
|𝑏|
Subgradients 2.5
Proof: 𝜕 𝑓 (𝑥) is bounded when 𝑥 ∈ int dom 𝑓
• for small 𝑟 > 0, define a set of 2𝑛 points
𝐵 = {𝑥 ± 𝑟𝑒 𝑘 | 𝑘 = 1, . . . , 𝑛} ⊂ dom 𝑓
and define 𝑀 = max 𝑓 (𝑦) < ∞
𝑦∈𝐵
• for every 𝑔 ∈ 𝜕 𝑓 (𝑥) , there is a point 𝑦 ∈ 𝐵 with
𝑟 k𝑔k∞ = 𝑔𝑇 (𝑦 − 𝑥)
(choose an index 𝑘 with |𝑔 𝑘 | = k𝑔k∞, and take 𝑦 = 𝑥 + 𝑟 sign(𝑔 𝑘 )𝑒 𝑘 )
• since 𝑔 is a subgradient, this implies that
𝑓 (𝑥) + 𝑟 k𝑔k∞ = 𝑓 (𝑥) + 𝑔𝑇 (𝑦 − 𝑥) ≤ 𝑓 (𝑦) ≤ 𝑀
• we conclude that 𝜕 𝑓 (𝑥) is bounded:
𝑀 − 𝑓 (𝑥)
k𝑔k∞ ≤ for all 𝑔 ∈ 𝜕 𝑓 (𝑥)
𝑟
Subgradients 2.6
Example
𝑓 (𝑥) = max { 𝑓1 (𝑥), 𝑓2 (𝑥)} with 𝑓1, 𝑓2 convex and differentiable
𝑓 (𝑦)
𝑓2 (𝑦)
𝑓1 (𝑦)
ˆ = 𝑓2 ( 𝑥)
• if 𝑓1 ( 𝑥) ˆ , subdifferential at 𝑥ˆ is line segment [∇ 𝑓1 ( 𝑥),
ˆ ∇ 𝑓2 ( 𝑥)]
ˆ
ˆ > 𝑓2 ( 𝑥)
• if 𝑓1 ( 𝑥) ˆ , subdifferential at 𝑥ˆ is {∇ 𝑓1 ( 𝑥)}
ˆ
ˆ < 𝑓2 ( 𝑥)
• if 𝑓1 ( 𝑥) ˆ , subdifferential at 𝑥ˆ is {∇ 𝑓2 ( 𝑥)}
ˆ
Subgradients 2.7
Examples
Absolute value 𝑓 (𝑥) = |𝑥|
𝑓 (𝑥) 𝜕 𝑓 (𝑥)
𝑥 −1
Euclidean norm 𝑓 (𝑥) = k𝑥k2
1
𝜕 𝑓 (𝑥) = { 𝑥} if 𝑥 ≠ 0, 𝜕 𝑓 (𝑥) = {𝑔 | k𝑔k2 ≤ 1} if 𝑥 = 0
k𝑥k2
Subgradients 2.8
Monotonicity
the subdifferential of a convex function is a monotone operator:
(𝑢 − 𝑣)𝑇 (𝑥 − 𝑦) ≥ 0 for all 𝑥 , 𝑦 , 𝑢 ∈ 𝜕 𝑓 (𝑥) , 𝑣 ∈ 𝜕 𝑓 (𝑦)
Proof: by definition
𝑓 (𝑦) ≥ 𝑓 (𝑥) + 𝑢𝑇 (𝑦 − 𝑥), 𝑓 (𝑥) ≥ 𝑓 (𝑦) + 𝑣𝑇 (𝑥 − 𝑦)
combining the two inequalities shows monotonicity
Subgradients 2.9
Examples of non-subdifferentiable functions
the following functions are not subdifferentiable at 𝑥 = 0
• 𝑓 : R → R, dom 𝑓 = R+
𝑓 (𝑥) = 1 if 𝑥 = 0, 𝑓 (𝑥) = 0 if 𝑥 > 0
• 𝑓 : R → R, dom 𝑓 = R+ √
𝑓 (𝑥) = − 𝑥
the only supporting hyperplane to epi 𝑓 at (0, 𝑓 (0)) is vertical
Subgradients 2.10
Subgradients and sublevel sets
if 𝑔 is a subgradient of 𝑓 at 𝑥 , then
𝑓 (𝑦) ≤ 𝑓 (𝑥) =⇒ 𝑔𝑇 (𝑦 − 𝑥) ≤ 0
𝑥
𝑓 (𝑦) ≤ 𝑓 (𝑥)
the nonzero subgradients at 𝑥 define supporting hyperplanes to the sublevel set
{𝑦 | 𝑓 (𝑦) ≤ 𝑓 (𝑥)}
Subgradients 2.11
Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
Subgradient calculus
Weak subgradient calculus: rules for finding one subgradient
• sufficient for most nondifferentiable convex optimization algorithms
• if you can evaluate 𝑓 (𝑥) , you can usually compute a subgradient
Strong subgradient calculus: rules for finding 𝜕 𝑓 (𝑥) (all subgradients)
• some algorithms, optimality conditions, etc., need entire subdifferential
• can be quite complicated
we will assume that 𝑥 ∈ int dom 𝑓
Subgradients 2.12
Basic rules
Differentiable functions: 𝜕 𝑓 (𝑥) = {∇ 𝑓 (𝑥)} if 𝑓 is differentiable at 𝑥
Nonnegative linear combination
if 𝑓 (𝑥) = 𝛼1 𝑓1 (𝑥) + 𝛼2 𝑓2 (𝑥) with 𝛼1, 𝛼2 ≥ 0, then
𝜕 𝑓 (𝑥) = 𝛼1 𝜕 𝑓1 (𝑥) + 𝛼2 𝜕 𝑓2 (𝑥)
(right-hand side is addition of sets)
Affine transformation of variables: if 𝑓 (𝑥) = ℎ( 𝐴𝑥 + 𝑏) , then
𝜕 𝑓 (𝑥) = 𝐴𝑇 𝜕ℎ( 𝐴𝑥 + 𝑏)
Subgradients 2.13
Pointwise maximum
𝑓 (𝑥) = max { 𝑓1 (𝑥), . . . , 𝑓𝑚 (𝑥)}
define 𝐼 (𝑥) = {𝑖 | 𝑓𝑖 (𝑥) = 𝑓 (𝑥)}, the ‘active’ functions at 𝑥
Weak result
to compute a subgradient at 𝑥 , choose any 𝑘 ∈ 𝐼 (𝑥) , any subgradient of 𝑓 𝑘 at 𝑥
Strong result [
𝜕 𝑓 (𝑥) = conv 𝜕 𝑓𝑖 (𝑥)
𝑖∈𝐼 (𝑥)
• the convex hull of the union of subdifferentials of ‘active’ functions at 𝑥
• if 𝑓𝑖 ’s are differentiable, 𝜕 𝑓 (𝑥) = conv {∇ 𝑓𝑖 (𝑥) | 𝑖 ∈ 𝐼 (𝑥)}
Subgradients 2.14
Example: piecewise-linear function
𝑓 (𝑥) = max (𝑎𝑇𝑖 𝑥 + 𝑏𝑖 )
𝑖=1,...,𝑚
𝑓 (𝑥)
𝑎𝑇𝑖 𝑥 + 𝑏𝑖
the subdifferential at 𝑥 is a polyhedron
𝜕 𝑓 (𝑥) = conv {𝑎𝑖 | 𝑖 ∈ 𝐼 (𝑥)}
with 𝐼 (𝑥) = {𝑖 | 𝑎𝑇𝑖 𝑥 + 𝑏𝑖 = 𝑓 (𝑥)}
Subgradients 2.15
Example: ℓ1-norm
𝑓 (𝑥) = k𝑥k1 = max 𝑠𝑇 𝑥
𝑠∈{−1,1}𝑛
the subdifferential is a product of intervals
[−1, 1]
𝑥𝑘 = 0
𝜕 𝑓 (𝑥) = 𝐽1 × · · · × 𝐽𝑛 , 𝐽 𝑘 = {1} 𝑥𝑘 > 0
{−1}
𝑥𝑘 < 0
(1, 1)
(−1, 1) (1, 1) (1, 1)
(−1, −1) (1, −1)
(1, −1)
𝜕 𝑓 (0, 0) = [−1, 1] × [−1, 1] 𝜕 𝑓 (1, 0) = {1} × [−1, 1] 𝜕 𝑓 (1, 1) = {(1, 1)}
Subgradients 2.16
Pointwise supremum
𝑓 (𝑥) = sup 𝑓𝛼 (𝑥), 𝑓𝛼 (𝑥) convex in 𝑥 for every 𝛼
𝛼∈A
Weak result: to find a subgradient at 𝑥ˆ ,
ˆ = 𝑓 𝛽 ( 𝑥)
• find any 𝛽 for which 𝑓 ( 𝑥) ˆ (assuming maximum is attained)
ˆ
• choose any 𝑔 ∈ 𝜕 𝑓 𝛽 ( 𝑥)
(Partial) strong result: define 𝐼 (𝑥) = {𝛼 ∈ A | 𝑓𝛼 (𝑥) = 𝑓 (𝑥)}
[
conv 𝜕 𝑓𝛼 (𝑥) ⊆ 𝜕 𝑓 (𝑥)
𝛼∈𝐼 (𝑥)
equality requires extra conditions (for example, A compact, 𝑓𝛼 continuous in 𝛼)
Subgradients 2.17
Exercise: maximum eigenvalue
Problem: explain how to find a subgradient of
𝑓 (𝑥) = 𝜆 max ( 𝐴(𝑥)) = sup 𝑦𝑇 𝐴(𝑥)𝑦
k𝑦k2 =1
where 𝐴(𝑥) = 𝐴0 + 𝑥 1 𝐴1 + · · · + 𝑥 𝑛 𝐴𝑛 with symmetric coefficients 𝐴𝑖
Solution: to find a subgradient at 𝑥ˆ ,
ˆ
• choose any unit eigenvector 𝑦 with eigenvalue 𝜆 max ( 𝐴( 𝑥))
• the gradient of 𝑦𝑇 𝐴(𝑥)𝑦 at 𝑥ˆ is a subgradient of 𝑓 :
ˆ
(𝑦𝑇 𝐴1 𝑦, . . . , 𝑦𝑇 𝐴𝑛 𝑦) ∈ 𝜕 𝑓 ( 𝑥)
Subgradients 2.18
Minimization
𝑓 (𝑥) = inf ℎ(𝑥, 𝑦), ℎ jointly convex in (𝑥, 𝑦)
𝑦
Weak result: to find a subgradient at 𝑥ˆ ,
• find 𝑦ˆ that minimizes ℎ( 𝑥,
ˆ 𝑦) (assuming minimum is attained)
• find subgradient (𝑔, 0) ∈ 𝜕ℎ( 𝑥,
ˆ 𝑦ˆ )
Proof: for all 𝑥 , 𝑦 ,
ℎ(𝑥, 𝑦) ≥ ˆ 𝑦ˆ ) + 𝑔𝑇 (𝑥 − 𝑥)
ℎ( 𝑥, ˆ + 0𝑇 (𝑦 − 𝑦ˆ )
= ˆ + 𝑔𝑇 (𝑥 − 𝑥)
𝑓 ( 𝑥) ˆ
therefore
𝑓 (𝑥) = inf ℎ(𝑥, 𝑦) ≥ 𝑓 ( 𝑥)
ˆ + 𝑔𝑇 (𝑥 − 𝑥)
ˆ
𝑦
Subgradients 2.19
Exercise: Euclidean distance to convex set
Problem: explain how to find a subgradient of
𝑓 (𝑥) = inf k𝑥 − 𝑦k2
𝑦∈𝐶
where 𝐶 is a closed convex set
Solution: to find a subgradient at 𝑥ˆ ,
ˆ = 0 (that is, 𝑥ˆ ∈ 𝐶) , take 𝑔 = 0
• if 𝑓 ( 𝑥)
ˆ > 0, find projection 𝑦ˆ = 𝑃( 𝑥)
• if 𝑓 ( 𝑥) ˆ on 𝐶 and take
1 1
𝑔= ( 𝑥ˆ − 𝑦ˆ ) = ( 𝑥ˆ − 𝑃( 𝑥))
ˆ
k 𝑦ˆ − 𝑥k
ˆ 2 k 𝑥ˆ − 𝑃( 𝑥)
ˆ k2
Subgradients 2.20
Composition
𝑓 (𝑥) = ℎ( 𝑓1 (𝑥), . . . , 𝑓 𝑘 (𝑥)), ℎ convex and nondecreasing, 𝑓𝑖 convex
Weak result: to find a subgradient at 𝑥ˆ ,
ˆ . . . , 𝑓 𝑘 ( 𝑥))
• find 𝑧 ∈ 𝜕ℎ( 𝑓1 ( 𝑥), ˆ and 𝑔𝑖 ∈ 𝜕 𝑓𝑖 ( 𝑥)
ˆ
ˆ
• then 𝑔 = 𝑧1 𝑔1 + · · · + 𝑧 𝑘 𝑔 𝑘 ∈ 𝜕 𝑓 ( 𝑥)
reduces to standard formula for differentiable ℎ, 𝑓𝑖
Proof:
𝑓 (𝑥) ≥ ˆ
ℎ 𝑓1 ( 𝑥) + 𝑔𝑇1 (𝑥 ˆ . . . , 𝑓 𝑘 ( 𝑥)
− 𝑥), ˆ
+ 𝑔𝑇𝑘 (𝑥 − 𝑥)ˆ
≥ ˆ . . . , 𝑓 𝑘 ( 𝑥))
ℎ ( 𝑓1 ( 𝑥), ˆ + 𝑧 𝑔1 (𝑥 − 𝑥),
𝑇 𝑇
ˆ . . . , 𝑔 𝑘 (𝑥 − 𝑥)
𝑇
ˆ
= ˆ . . . , 𝑓 𝑘 ( 𝑥))
ℎ ( 𝑓1 ( 𝑥), ˆ + (𝑧1 𝑔1 + · · · + 𝑧 𝑘 𝑔 𝑘 )𝑇 (𝑥 − 𝑥)
ˆ
= ˆ + 𝑔𝑇 (𝑥 − 𝑥)
𝑓 ( 𝑥) ˆ
Subgradients 2.21
Optimal value function
define 𝑓 (𝑢, 𝑣) as the optimal value of convex problem
minimize 𝑓0 (𝑥)
subject to 𝑓𝑖 (𝑥) ≤ 𝑢𝑖 , 𝑖 = 1, . . . , 𝑚
𝐴𝑥 = 𝑏 + 𝑣
(functions 𝑓𝑖 are convex; optimization variable is 𝑥 )
ˆ 𝑣ˆ ) is finite and strong duality holds with the dual
Weak result: suppose 𝑓 ( 𝑢,
!
X
maximize inf 𝑓0 (𝑥) + 𝜆𝑖 ( 𝑓𝑖 (𝑥) − 𝑢ˆ𝑖 ) + 𝜈𝑇 ( 𝐴𝑥 − 𝑏 − 𝑣ˆ )
𝑥
𝑖
subject to 𝜆0
if 𝜆ˆ , 𝜈ˆ are dual optimal (for right-hand sides 𝑢, ˆ −𝜈)
ˆ 𝑣ˆ ) then (−𝜆, ˆ ∈ 𝜕 𝑓 ( 𝑢,
ˆ 𝑣ˆ )
Subgradients 2.22
Proof: by weak duality for problem with right-hand sides 𝑢 , 𝑣
!
X
𝑓 (𝑢, 𝑣) ≥ inf 𝑓0 (𝑥) + 𝜆ˆ𝑖 ( 𝑓𝑖 (𝑥) − 𝑢𝑖 ) + 𝜈ˆ𝑇 ( 𝐴𝑥 − 𝑏 − 𝑣)
𝑥
!
𝑖
X
= inf 𝑓0 (𝑥) + 𝜆ˆ𝑖 ( 𝑓𝑖 (𝑥) − 𝑢ˆ𝑖 ) + 𝜈ˆ𝑇 ( 𝐴𝑥 − 𝑏 − 𝑣ˆ )
𝑥
𝑖
− 𝜆ˆ𝑇 (𝑢 − 𝑢) ˆ − 𝜈ˆ𝑇 (𝑣 − 𝑣ˆ )
= ˆ 𝑣ˆ ) − 𝜆ˆ𝑇 (𝑢 − 𝑢)
𝑓 ( 𝑢, ˆ − 𝜈ˆ𝑇 (𝑣 − 𝑣ˆ )
Subgradients 2.23
Expectation
𝑓 (𝑥) = E ℎ(𝑥, 𝑢) 𝑢 random, ℎ convex in 𝑥 for every 𝑢
Weak result: to find a subgradient at 𝑥ˆ ,
ˆ 𝑢)
• choose a function 𝑢 ↦→ 𝑔(𝑢) with 𝑔(𝑢) ∈ 𝜕𝑥 ℎ( 𝑥,
ˆ
• then, 𝑔 = E𝑢 𝑔(𝑢) ∈ 𝜕 𝑓 ( 𝑥)
Proof: by convexity of ℎ and definition of 𝑔(𝑢) ,
𝑓 (𝑥) = E ℎ(𝑥, 𝑢)
ˆ 𝑢) + 𝑔(𝑢)𝑇 (𝑥 − 𝑥)
≥ E ℎ( 𝑥, ˆ
= ˆ + 𝑔𝑇 (𝑥 − 𝑥)
𝑓 ( 𝑥) ˆ
Subgradients 2.24
Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
Optimality conditions — unconstrained
𝑥★ minimizes 𝑓 (𝑥) if and only
0 ∈ 𝜕 𝑓 (𝑥★)
𝑥★
this follows directly from the definition of subgradient:
𝑓 (𝑦) ≥ 𝑓 (𝑥★) + 0𝑇 (𝑦 − 𝑥★) for all 𝑦 ⇐⇒ 0 ∈ 𝜕 𝑓 (𝑥★)
Subgradients 2.25
Example: piecewise-linear minimization
𝑓 (𝑥) = max (𝑎𝑇𝑖 𝑥 + 𝑏𝑖 )
𝑖=1,...,𝑚
Optimality condition
0 ∈ conv {𝑎𝑖 | 𝑖 ∈ 𝐼 (𝑥★)} where 𝐼 (𝑥) = {𝑖 | 𝑎𝑇𝑖 𝑥 + 𝑏𝑖 = 𝑓 (𝑥)}
• in other words, 𝑥★ is optimal if and only if there is a 𝜆 with
X
𝑚
𝜆 0, 1𝑇 𝜆 = 1, 𝜆𝑖 𝑎𝑖 = 0, 𝜆𝑖 = 0 for 𝑖 ∉ 𝐼 (𝑥★)
𝑖=1
• these are the optimality conditions for the equivalent linear program
minimize 𝑡 maximize 𝑏𝑇 𝜆
subject to 𝐴𝑥 + 𝑏 𝑡1 subject to 𝐴𝑇 𝜆 = 0
𝜆 0, 1𝑇 𝜆 = 1
Subgradients 2.26
Optimality conditions — constrained
minimize 𝑓0 (𝑥)
subject to 𝑓𝑖 (𝑥) ≤ 0, 𝑖 = 1, . . . , 𝑚
assume dom 𝑓𝑖 = R𝑛 , so functions 𝑓𝑖 are subdifferentiable everywhere
Karush–Kuhn–Tucker conditions
if strong duality holds, then 𝑥★, 𝜆★ are primal, dual optimal if and only if
1. 𝑥★ is primal feasible
2. 𝜆★ 0
3. 𝜆★ 𝑓 (𝑥 ★) = 0 for 𝑖 = 1, . . . , 𝑚
𝑖 𝑖
P𝑚
4. 𝑥★ is a minimizer of 𝐿 (𝑥, 𝜆★) = 𝑓0 (𝑥) + 𝜆★ 𝑓 (𝑥) :
𝑖=1 𝑖 𝑖
X
𝑚
0 ∈ 𝜕 𝑓0 (𝑥 ) +
★
𝜆★
𝑖 𝜕 𝑓𝑖 (𝑥 ★
)
𝑖=1
Subgradients 2.27
Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
Directional derivative
Definition (for general 𝑓 ): the directional derivative of 𝑓 at 𝑥 in the direction 𝑦 is
0 𝑓 (𝑥 + 𝛼𝑦) − 𝑓 (𝑥)
𝑓 (𝑥; 𝑦) = lim
𝛼&0 𝛼
1
= lim 𝑡 𝑓 (𝑥 + 𝑦) − 𝑡 𝑓 (𝑥)
𝑡→∞ 𝑡
(if the limit exists)
• 𝑓 0 (𝑥; 𝑦) is the right derivative of 𝑔(𝛼) = 𝑓 (𝑥 + 𝛼𝑦) at 𝛼 = 0
• 𝑓 0 (𝑥; 𝑦) is homogeneous in 𝑦 :
𝑓 0 (𝑥; 𝜆𝑦) = 𝜆 𝑓 0 (𝑥; 𝑦) for 𝜆 ≥ 0
Subgradients 2.28
Directional derivative of a convex function
Equivalent definition (for convex 𝑓 ): replace lim with inf
0 𝑓 (𝑥 + 𝛼𝑦) − 𝑓 (𝑥)
𝑓 (𝑥; 𝑦) = inf
𝛼
𝛼>0
1
= inf 𝑡 𝑓 (𝑥 + 𝑦) − 𝑡 𝑓 (𝑥)
𝑡>0 𝑡
Proof
• the function ℎ(𝑦) = 𝑓 (𝑥 + 𝑦) − 𝑓 (𝑥) is convex in 𝑦 , with ℎ(0) = 0
• its perspective 𝑡ℎ(𝑦/𝑡) is nonincreasing in 𝑡 (ECE236B ex. A3.5); hence
𝑓 0 (𝑥; 𝑦) = lim 𝑡ℎ(𝑦/𝑡) = inf 𝑡ℎ(𝑦/𝑡)
𝑡→∞ 𝑡>0
Subgradients 2.29
Properties
consequences of the expressions (for convex 𝑓 )
0 𝑓 (𝑥 + 𝛼𝑦) − 𝑓 (𝑥)
𝑓 (𝑥; 𝑦) = inf
𝛼
𝛼>0
1
= inf 𝑡 𝑓 (𝑥 + 𝑦) − 𝑡 𝑓 (𝑥)
𝑡>0 𝑡
• 𝑓 0 (𝑥; 𝑦) is convex in 𝑦 (partial minimization of a convex function in 𝑦 , 𝑡 )
• 𝑓 0 (𝑥; 𝑦) defines a lower bound on 𝑓 in the direction 𝑦 :
𝑓 (𝑥 + 𝛼𝑦) ≥ 𝑓 (𝑥) + 𝛼 𝑓 0 (𝑥; 𝑦) for all 𝛼 ≥ 0
Subgradients 2.30
Directional derivative and subgradients
for convex 𝑓 and 𝑥 ∈ int dom 𝑓
𝑓 0 (𝑥; 𝑦) = sup 𝑔𝑇 𝑦
𝑔∈𝜕 𝑓 (𝑥)
𝑓ˆ0 (𝑥, 𝑦) = 𝑔𝑇 𝑦
𝑔ˆ
𝑓 0 (𝑥; 𝑦) is support function of 𝜕 𝑓 (𝑥)
𝑦 𝜕 𝑓 (𝑥)
• generalizes 𝑓 0 (𝑥; 𝑦) = ∇ 𝑓 (𝑥)𝑇 𝑦 for differentiable functions
• implies that 𝑓 0 (𝑥; 𝑦) exists for all 𝑥 ∈ int dom 𝑓 , all 𝑦 (see page 2.4)
Subgradients 2.31
Proof: if 𝑔 ∈ 𝜕 𝑓 (𝑥) then from page 2.29
0 𝑓 (𝑥) + 𝛼𝑔𝑇 𝑦 − 𝑓 (𝑥)
𝑓 (𝑥; 𝑦) ≥ inf = 𝑔𝑇 𝑦
𝛼>0 𝛼
it remains to show that 𝑓 0 (𝑥; 𝑦) = 𝑔ˆ𝑇 𝑦 for at least one 𝑔ˆ ∈ 𝜕 𝑓 (𝑥)
• 𝑓 0 (𝑥; 𝑦) is convex in 𝑦 with domain R𝑛 , hence subdifferentiable at all 𝑦
• let 𝑔ˆ be a subgradient of 𝑓 0 (𝑥; 𝑦) at 𝑦 : then for all 𝑣, 𝜆 ≥ 0,
𝜆 𝑓 0 (𝑥; 𝑣) = 𝑓 0 (𝑥; 𝜆𝑣) ≥ 𝑓 0 (𝑥; 𝑦) + 𝑔ˆ𝑇 (𝜆𝑣 − 𝑦)
• taking 𝜆 → ∞ shows that 𝑓 0 (𝑥; 𝑣) ≥ 𝑔ˆ𝑇 𝑣; from the lower bound on page 2.30,
𝑓 (𝑥 + 𝑣) ≥ 𝑓 (𝑥) + 𝑓 0 (𝑥; 𝑣) ≥ 𝑓 (𝑥) + 𝑔ˆ𝑇 𝑣 for all 𝑣
hence 𝑔ˆ ∈ 𝜕 𝑓 (𝑥)
• taking 𝜆 = 0 we see that 𝑓 0 (𝑥; 𝑦) ≤ 𝑔ˆ𝑇 𝑦
Subgradients 2.32
Descent directions and subgradients
𝑦 is a descent direction of 𝑓 at 𝑥 if 𝑓 0 (𝑥; 𝑦) < 0
• the negative gradient of a differentiable 𝑓 is a descent direction (if ∇ 𝑓 (𝑥) ≠ 0)
• negative subgradient is not always a descent direction
Example: 𝑓 (𝑥 1, 𝑥 2) = |𝑥 1 | + 2|𝑥 2 |
𝑥2
𝑔 = (1, 2)
𝑥1
(1, 0)
𝑔 = (1, 2) ∈ 𝜕 𝑓 (1, 0) , but 𝑦 = (−1, −2) is not a descent direction at (1, 0)
Subgradients 2.33
Steepest descent direction
Definition: (normalized) steepest descent direction at 𝑥 ∈ int dom 𝑓 is
Δ𝑥nsd = argmin 𝑓 0 (𝑥; 𝑦)
k𝑦k2 ≤1
Δ𝑥nsd is the primal solution 𝑦 of the pair of dual problems (BV §8.1.3)
minimize (over 𝑦 ) 𝑓 0 (𝑥; 𝑦) maximize (over 𝑔 ) −k𝑔k2
subject to k𝑦k2 ≤ 1 subject to 𝑔 ∈ 𝜕 𝑓 (𝑥)
• dual optimal 𝑔★ is subgradient with least norm
• 𝑓 0 (𝑥; Δ𝑥nsd) = −k𝑔★ k2 𝜕 𝑓 (𝑥)
𝑔★
• if 0 ∉ 𝜕 𝑓 (𝑥) , Δ𝑥nsd = −𝑔★/k𝑔★ k2
• Δ𝑥nsd can be expensive to compute
Δ𝑥nsd 𝑔𝑇 Δ𝑥nsd = 𝑓 0 (𝑥, Δ𝑥nsd)
Subgradients 2.34
Subgradients and distance to sublevel sets
if 𝑓 is convex, 𝑓 (𝑦) < 𝑓 (𝑥) , 𝑔 ∈ 𝜕 𝑓 (𝑥) , then for small 𝑡 > 0,
k𝑥 − 𝑡𝑔 − 𝑦k22 = k𝑥 − 𝑦k22 − 2𝑡𝑔𝑇 (𝑥 − 𝑦) + 𝑡 2 k𝑔k22
≤ k𝑥 − 𝑦k22 − 2𝑡 ( 𝑓 (𝑥) − 𝑓 (𝑦)) + 𝑡 2 k𝑔k22
< k𝑥 − 𝑦k22
• −𝑔 is descent direction for k𝑥 − 𝑦k2, for any 𝑦 with 𝑓 (𝑦) < 𝑓 (𝑥)
• in particular, −𝑔 is descent direction for distance to any minimizer of 𝑓
Subgradients 2.35
References
• A. Beck, First-Order Methods in Optimization (2017), chapter 3.
• D. P. Bertsekas, A. Nedić, A. E. Ozdaglar, Convex Analysis and Optimization
(2003), chapter 4.
• J.-B. Hiriart-Urruty, C. Lemaréchal, Convex Analysis and Minimization Algoritms
(1993), chapter VI.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 3.1.
• B. T. Polyak, Introduction to Optimization (1987), section 5.1.
Subgradients 2.36