L.
Vandenberghe ECE236C (Spring 2022)
3. Subgradient method
• subgradient method
• convergence analysis
• optimal step size when 𝑓 ★ is known
• alternating projections
• projected subgradient method
• optimality of subgradient method
3.1
Subgradient method
to minimize a nondifferentiable convex function 𝑓 : choose 𝑥 0 and repeat
𝑥 𝑘+1 = 𝑥 𝑘 − 𝑡 𝑘 𝑔 𝑘 , 𝑘 = 0, 1, . . .
𝑔 𝑘 is any subgradient of 𝑓 at 𝑥 𝑘
Step size rules
• fixed step: 𝑡 𝑘 constant
• fixed length: 𝑡 𝑘 k𝑔 𝑘 k2 = k𝑥 𝑘+1 − 𝑥 𝑘 k2 is constant
∞
• diminishing: 𝑡 𝑘 → 0 and 𝑡𝑘 = ∞
P
𝑘=0
Subgradient method 3.2
Assumptions
• problem has finite optimal value 𝑓 ★, optimal solution 𝑥★
• 𝑓 is convex with dom 𝑓 = R𝑛
• 𝑓 is Lipschitz continuous with constant 𝐺 > 0:
| 𝑓 (𝑥) − 𝑓 (𝑦)| ≤ 𝐺 k𝑥 − 𝑦k2 for all 𝑥 , 𝑦
this is equivalent to k𝑔k2 ≤ 𝐺 for all 𝑥 and 𝑔 ∈ 𝜕 𝑓 (𝑥) (see next page)
Subgradient method 3.3
Proof.
• assume k𝑔k2 ≤ 𝐺 for all subgradients; choose 𝑔 𝑦 ∈ 𝜕 𝑓 (𝑦) , 𝑔𝑥 ∈ 𝜕 𝑓 (𝑥) :
𝑔𝑇𝑥 (𝑥 − 𝑦) ≥ 𝑓 (𝑥) − 𝑓 (𝑦) ≥ 𝑔𝑇𝑦 (𝑥 − 𝑦)
by the Cauchy–Schwarz inequality
𝐺 k𝑥 − 𝑦k2 ≥ 𝑓 (𝑥) − 𝑓 (𝑦) ≥ −𝐺 k𝑥 − 𝑦k2
• assume k𝑔k2 > 𝐺 for some 𝑔 ∈ 𝜕 𝑓 (𝑥) ; take 𝑦 = 𝑥 + 𝑔/k𝑔k2:
𝑓 (𝑦) ≥ 𝑓 (𝑥) + 𝑔𝑇 (𝑦 − 𝑥)
= 𝑓 (𝑥) + k𝑔k2
> 𝑓 (𝑥) + 𝐺
Subgradient method 3.4
Analysis
• the subgradient method is not a descent method
• therefore 𝑓best,𝑘 = min𝑖=0,...,𝑘 𝑓 (𝑥𝑖 ) can be less than 𝑓 (𝑥 𝑘 )
• the key quantity in the analysis is the distance to the optimal set
Progress in one iteration
• distance to 𝑥★:
2
k𝑥𝑖+1 − 𝑥★ k22 = 𝑥𝑖 − 𝑡𝑖 𝑔𝑖 − 𝑥★ 2
= k𝑥𝑖 − 𝑥★ k22 − 2𝑡𝑖 𝑔𝑖𝑇 (𝑥𝑖 − 𝑥★) + 𝑡𝑖2 k𝑔𝑖 k22
≤ k𝑥𝑖 − 𝑥 k2 − 2𝑡𝑖 𝑓 (𝑥𝑖 ) − 𝑓 + 𝑡𝑖2 k𝑔𝑖 k22
★ 2 ★
• best function value: combine inequalities for 𝑖 = 0, . . . , 𝑘 :
𝑘 𝑘
2( ★
k𝑥0 − 𝑥★ k22 − k𝑥 𝑘+1 − 𝑥★ k22 𝑡𝑖2 k𝑔𝑖 k22
X X
𝑡𝑖 )( 𝑓best,𝑘 − 𝑓 ) ≤ +
𝑖=0 𝑘 𝑖=0
k𝑥0 − 𝑥★ k22 + 𝑡𝑖2 k𝑔𝑖 k22
X
≤
𝑖=0
Subgradient method 3.5
Fixed step size and fixed step length
Fixed step size: 𝑡𝑖 = 𝑡 with 𝑡 constant
★
k𝑥0 − 𝑥★ k22
𝐺 2𝑡
𝑓best,𝑘 −𝑓 ≤ +
2(𝑘 + 1)𝑡 2
• does not guarantee convergence of 𝑓best,𝑘
• for large 𝑘 , 𝑓best,𝑘 is approximately 𝐺 2𝑡/2-suboptimal
Fixed step length: 𝑡𝑖 = 𝑠/k𝑔𝑖 k2 with 𝑠 constant
★
𝐺 k𝑥0 − 𝑥★ k22
𝐺𝑠
𝑓best,𝑘 − 𝑓 ≤ +
2(𝑘 + 1)𝑠 2
• does not guarantee convergence of 𝑓best,𝑘
• for large 𝑘 , 𝑓best,𝑘 is approximately 𝐺𝑠/2-suboptimal
Subgradient method 3.6
Diminishing step size
∞
𝑡𝑖 → 0,
X
𝑡𝑖 = ∞
𝑖=0
• bound on function value:
𝑘
𝐺2 𝑡𝑖2
P
k𝑥0 − 𝑥★ k22 𝑖=0
𝑓best,𝑘 − 𝑓 ★ ≤ +
𝑘 𝑘
2 𝑡𝑖 2 𝑡𝑖
P P
𝑖=0 𝑖=0
𝑘 𝑘
• 2
can show that ( 𝑡𝑖 )/( 𝑡𝑖 )
P P
→ 0; hence, 𝑓best,𝑘 converges to 𝑓 ★
𝑖=0 𝑖=0
• examples of diminishing step size rules:
𝜏 𝜏
𝑡𝑖 = , 𝑡𝑖 = √
𝑖+1 𝑖+1
Subgradient method 3.7
Example: 1-norm minimization
minimize k 𝐴𝑥 − 𝑏k1
• subgradient is given by 𝐴𝑇 sign( 𝐴𝑥 − 𝑏)
• example with 𝐴 ∈ R500×100, 𝑏 ∈ R500
Fixed steplength 𝑡 𝑘 = 𝑠/k𝑔 𝑘 k2 for 𝑠 = 0.1, 0.01, 0.001
( 𝑓 (𝑥 𝑘 ) − 𝑓 ★)/ 𝑓 ★ ( 𝑓best,𝑘 ) − 𝑓 ★)/ 𝑓 ★
100 100
0.1 0.1
0.01 0.01
10−1 0.001 10−1 0.001
10−2 10−2
10−3 10−3
10−4 10−4
0 100 200 300 400 500 0 1000 2000 3000
𝑘 𝑘
Subgradient method 3.8
√
Diminishing step size: 𝑡 𝑘 = 0.01/ 𝑘 + 1 and 𝑡 𝑘 = 0.01/(𝑘 + 1)
( 𝑓best,𝑘 − 𝑓 ★)/ 𝑓 ★
100 √
𝑡 𝑘 = 0.01/ 𝑘 + 1
𝑡 𝑘 = 0.01/(𝑘 + 1)
10−1
10−2
10−3
10−4
10−5
0 1000 2000 3000 4000 5000
𝑘
Subgradient method 3.9
Optimal step size for fixed number of iterations
from page 3.5: if 𝑠𝑖 = 𝑡𝑖 k𝑔𝑖 k2 and k𝑥 0 − 𝑥★ k2 ≤ 𝑅 , then
𝑘
𝑅2 + 𝑠𝑖2
X
𝑖=0
𝑓best,𝑘 − 𝑓 ★ ≤
𝑘
2
X
𝑠𝑖 /𝐺
𝑖=0
• for given 𝑘 , the right-hand side is minimized by the fixed step length
𝑅
𝑠𝑖 = 𝑠 = √
𝑘 +1
• the resulting bound after 𝑘 steps is
𝐺𝑅
𝑓best,𝑘 − 𝑓 ★ ≤ √
𝑘 +1
• this guarantees an accuracy 𝑓best,𝑘 − 𝑓 ★ ≤ 𝜖 in 𝑘 = 𝑂 (1/𝜖 2) iterations
Subgradient method 3.10
Optimal step size when 𝑓 ★ is known
• the right-hand side in the first inequality of page 3.5 is minimized by
𝑓 (𝑥𝑖 ) − 𝑓 ★
𝑡𝑖 =
k𝑔𝑖 k22
• the optimized bound is
★
2
𝑓 (𝑥𝑖 ) − 𝑓
≤ k𝑥𝑖 − 𝑥★ k22 − k𝑥𝑖+1 − 𝑥★ k22
k𝑔𝑖 k22
• applying this recursively from 𝑖 = 0 to 𝑖 = 𝑘 (and using k𝑔𝑖 k2 ≤ 𝐺 ) gives
★ 𝐺 k𝑥0 − 𝑥★ k2
𝑓best,𝑘 −𝑓 ≤ √
𝑘 +1
Subgradient method 3.11
Example: find point in intersection of convex sets
find a point in the intersection of 𝑚 closed convex sets 𝐶1, . . . , 𝐶𝑚 :
minimize 𝑓 (𝑥) = max { 𝑓1 (𝑥), . . . , 𝑓𝑚 (𝑥)}
where 𝑓 𝑗 (𝑥) = inf k𝑥 − 𝑦k2 is Euclidean distance of 𝑥 to 𝐶 𝑗
𝑦∈𝐶 𝑗
• 𝑓 ★ = 0 if the intersection is nonempty
ˆ if 𝑔 ∈ 𝜕 𝑓 𝑗 ( 𝑥)
• (from page 2.14) 𝑔 ∈ 𝜕 𝑓 ( 𝑥) ˆ and 𝐶 𝑗 is farthest set from 𝑥ˆ
ˆ follows from projection 𝑃 𝑗 ( 𝑥)
• (from page 2.20) subgradient 𝑔 ∈ 𝜕 𝑓 𝑗 ( 𝑥) ˆ on 𝐶 𝑗 :
1
𝑔 = 0 if 𝑥ˆ ∈ 𝐶 𝑗 , 𝑔= 𝑥ˆ − 𝑃 𝑗 ( 𝑥)
ˆ if 𝑥ˆ ∉ 𝐶 𝑗
k 𝑥ˆ − 𝑃 𝑗 ( 𝑥)
ˆ k2
note that k𝑔k2 = 1 if 𝑥ˆ ∉ 𝐶 𝑗
Subgradient method 3.12
Subgradient method for point in intersection of convex sets
• optimal step size (page 3.11) for 𝑓 ★ = 0 and k𝑔𝑖 k2 = 1 is 𝑡𝑖 = 𝑓 (𝑥𝑖 )
• at iteration 𝑘 , find farthest set 𝐶 𝑗 (with 𝑓 (𝑥 𝑘 ) = 𝑓 𝑗 (𝑥 𝑘 ) ), and take
𝑓 (𝑥 𝑘 )
𝑥 𝑘+1 = 𝑥 𝑘 − (𝑥 𝑘 − 𝑃 𝑗 (𝑥 𝑘 ))
𝑓 𝑗 (𝑥 𝑘 )
= 𝑃 𝑗 (𝑥 𝑘 )
at each step, we project the current point onto the farthest set
• a version of the alternating projections algorithm
• for 𝑚 = 2, projections alternate onto one set, then the other
• later, we will see faster sequential projection methods that are almost as simple
Subgradient method 3.13
Projected subgradient method
the subgradient method is easily extended to handle constrained problems
minimize 𝑓 (𝑥)
subject to 𝑥∈𝐶
where 𝐶 is a closed convex set
Projected subgradient method: choose 𝑥 0 ∈ 𝐶 and repeat
𝑥 𝑘+1 = 𝑃𝐶 (𝑥 𝑘 − 𝑡 𝑘 𝑔 𝑘 ), 𝑘 = 0, 1, . . .
• 𝑃𝐶 (𝑦) denotes the Euclidean projection of 𝑦 on 𝐶
• 𝑔 𝑘 is any subgradient of 𝑓 at 𝑥 𝑘
• 𝑡 𝑘 is chosen by same step size rules as for unconstrained problem (page 3.2)
Subgradient method 3.14
Examples of simple convex sets
subgradient projection is practical only if projection on 𝐶 is easy to compute
Halfspace: 𝐶 = {𝑥 | 𝑎𝑇 𝑥 ≤ 𝑏} (with 𝑎 ≠ 0)
𝑏 − 𝑎𝑇 𝑥
𝑃𝐶 (𝑥) = 𝑥 + 𝑎 if 𝑎𝑇 𝑥 > 𝑏, 𝑃𝐶 (𝑥) = 𝑥 if 𝑎𝑇 𝑥 ≤ 𝑏
k𝑎k22
Rectangle: 𝐶 = {𝑥 ∈ R𝑛 | 𝑙 𝑥 𝑢} where 𝑙 𝑢
𝑙𝑘
𝑥𝑘 ≤ 𝑙𝑘
𝑃𝐶 (𝑥)𝑘 = 𝑥 𝑘 𝑙𝑘 ≤ 𝑥𝑘 ≤ 𝑢𝑘
𝑢𝑘 𝑥𝑘 ≥ 𝑢𝑘
Norm balls: 𝐶 = {𝑥 | k𝑥k ≤ 𝑅} for many common norms (e.g., 236B page 5.26)
we’ll encounter many other examples later in the course
Subgradient method 3.15
Projection on closed convex set
𝑃𝐶 (𝑥) = argmin k𝑢 − 𝑥k22
𝑢∈𝐶
𝑢 = 𝑃𝐶 (𝑥)
m 𝑢
(𝑥 − 𝑢)𝑇 (𝑧 − 𝑢) ≤ 0 ∀𝑧 ∈ 𝐶 𝐶
m
k𝑥 − 𝑧k22 ≥ k𝑥 − 𝑢k22 + k𝑧 − 𝑢k22 ∀𝑧 ∈ 𝐶
this follows from general optimality conditions in 236B page 4.9
Subgradient method 3.16
Analysis
minimize 𝑓 (𝑥)
subject to 𝑥∈𝐶
• 𝐶 is a closed convex set; other assumptions are the same as on page 3.3
• first inequality on page 3.5 still holds:
2
k𝑥𝑖+1 − 𝑥★ k22 = 𝑃𝐶 (𝑥𝑖 − 𝑡𝑖 𝑔𝑖 ) − 𝑥★ 2
2
≤ 𝑥𝑖 − 𝑡𝑖 𝑔𝑖 − 𝑥★ 2
= k𝑥𝑖 − 𝑥★ k22 − 2𝑡𝑖 𝑔𝑖𝑇 (𝑥𝑖 − 𝑥★) + 𝑡𝑖2 k𝑔𝑖 k22
≤ ★ 2
k𝑥𝑖 − 𝑥 k2 − 2𝑡𝑖 𝑓 (𝑥𝑖 ) − 𝑓 + 𝑡𝑖2 k𝑔𝑖 k22
★
second line follows from page 3.16 (with 𝑧 = 𝑥★, 𝑥 = 𝑥𝑖 − 𝑡𝑖 𝑔𝑖 )
• hence, earlier analysis also applies to subgradient projection method
Subgradient method 3.17
Optimality of the subgradient method
√
can the 𝑓best,𝑘 − 𝑓★ ≤ 𝐺 𝑅/ 𝑘 + 1 bound on page 3.10 be improved?
Problem class
minimize 𝑓 (𝑥)
• assumptions on page 3.3 are satisfied
• we are given a starting point 𝑥 (0) with k𝑥 (0) − 𝑥★ k2 ≤ 𝑅
• we are given the Lipschitz constant 𝐺 of 𝑓 on {𝑥 | k𝑥 − 𝑥★ k2 ≤ 𝑅}
• 𝑓 is defined by an oracle: given 𝑥 , the oracle returns 𝑓 (𝑥) and a 𝑔 ∈ 𝜕 𝑓 (𝑥)
Algorithm class
• algorithm can choose any 𝑥 (𝑖+1) from the set 𝑥 (0) + span{𝑔 (0) , 𝑔 (1) , . . . , 𝑔 (𝑖) }
• we stop after a fixed number 𝑘 of iterations
Subgradient method 3.18
Test problem and oracle
1
𝑓 (𝑥) = max 𝑥𝑖 + k𝑥k22 (with 𝑘 < 𝑛), 𝑥 (0) = 0
𝑖=1,...,𝑘+1 2
• subdifferential 𝜕 𝑓 (𝑥) = conv{𝑒 𝑗 + 𝑥 | 1 ≤ 𝑗 ≤ 𝑘 + 1, 𝑥 𝑗 = max 𝑥𝑖 }
𝑖=1,...,𝑘+1
• solution and optimal value
1 1 1
★
𝑥 = −( , ..., , 0, . . . , 0), ★
𝑓 =−
𝑘
| + 1 {z 𝑘 + 1
} 2(𝑘 + 1)
𝑘 + 1 times
√
• distance of starting point to solution is 𝑅 = k𝑥 (0) − 𝑥★ k 2 = 1/ 𝑘 + 1
• Lipschitz constant on {𝑥 | k𝑥 − 𝑥★ k2 ≤ 𝑅}:
2
𝐺= sup k𝑔k2 ≤ √ +1
𝑔∈𝜕 𝑓 (𝑥), k𝑥−𝑥★ k2 ≤𝑅 𝑘 +1
• the oracle returns the subgradient 𝑒 𝚥ˆ + 𝑥 where 𝚥ˆ = min{ 𝑗 | 𝑥 𝑗 = max 𝑥𝑖 }
𝑖=1,...,𝑘+1
Subgradient method 3.19
Iteration
• after 𝑖 ≤ 𝑘 iterations of any algorithm in the algorithm class,
1
𝑓 (𝑥 (𝑖) ) ≥ k𝑥 (𝑖) k22 ≥ 0,
(𝑖) (𝑖)
𝑥 (𝑖) = (𝑥1 , . . . , 𝑥𝑖 , 0, . . . , 0), 𝑓best,𝑖 = 0
2
• suboptimality after 𝑘 iterations
1 𝐺𝑅
𝑓best,𝑘 − 𝑓 ★ = − 𝑓 ★ = = √
2(𝑘 + 1) 2(2 + 𝑘 + 1)
Conclusion
√
• example shows that 𝑂 (𝐺 𝑅/ 𝑘) bound cannot be improved
• subgradient method is “optimal” (for this problem and algorithm class)
Subgradient method 3.20
Summary: subgradient method
• handles general nondifferentiable convex problems
• often leads to very simple algorithms
• convergence can be very slow
• no good stopping criterion
• theoretical complexity: 𝑂 (1/𝜖 2) iterations to find 𝜖 -suboptimal point
• an “optimal” first-order method: 𝑂 (1/𝜖 2) bound cannot be improved
Subgradient method 3.21
References
• S. Boyd, Lecture slides and notes for EE364b, Convex Optimization II.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 3.2.3. The
example on page 3.19 is in §3.2.1.
• B. T. Polyak, Introduction to Optimization (1987), section 5.3.
Subgradient method 3.22