CHAPTER 3.
PROJECTED GRADIENT
DESCENT
Hao Yuan Kun Yuan
October 10, 2023
1 Problem formulation
This chapter considers the following constrained problem
min f (x), x∈X (1)
x∈Rd
where f (x) is a differentiable objective function and X is a closed convex subset of Rd .
Notation. We introduce the following notations:
• Let x? := arg minx∈X {f (x)} be the optimial solution to problem (1).
• Let f ? := minx∈X {f (x)} be the optimal function value.
2 Projection onto closed convex sets
Lemma 2.1. Given a closed convex set C ⊆ Rd , for any x ∈ Rd , there exists a
unique z ∗ ∈ C such that kx − z ∗ k ≤ kx − zk for any z ∈ C. The point z ∗ will be
called the projection of x onto C, and will be denoted by PC [x].
Proof. First, we show the existence of z ∗ . Fix x ∈ Rd , and denote δ := inf{kz − xk, z ∈ C}.
It is evident that δ ≥ 0. Now let {zk }k≥1 be a sequence of points in C such that
1
kzk − xk2 ≤ δ 2 +
k
1
for any k. We first show that {zk }k≥1 is a Cauchy sequence. Let k, l > 0 be arbitrary, and
notice that, since C is convex, we have 21 (zk + zl ) ∈ C, which implies
1
k (zk + zl ) − xk2 ≥ δ 2 . (2)
2
Expanding this inequality leads to
1 1 1
hzk − x, zl − xi ≥ δ 2 − kzk − xk2 − kzl − xk2 .
2 4 4
We now calculate kzk − zl k2 and get
kzk − zl k2 = kzk − xk2 + kzl − xk2 − 2hzk − x, zl − xi
2 2
≤ 2(kzk − xk2 + kzl − xk2 ) − 4δ 2 ≤ + .
k l
where we used the inequality 2 in the second step. Thus for any > 0, as long as k, l ≥
d4/2 e, we have kzk − zl k ≤ , showing that {zk } is a Cauchy sequence. By the completeness
of Rd , z ∗ = limk→∞ zk exists. Since C is closed, we have z ∗ ∈ C, and by the continuity of
the norm function, we have
kz ∗ − xk = lim kzk − xk = δ,
k→∞
which is less than or equal to kz − xk for any z ∈ C, by the definition of δ.
Next, we show the uniqueness of z ∗ . Suppose both z1∗ and z2∗ satisfy kz1∗ −xk = kz2∗ −xk =
δ. Denote z̄ = 21 (z1∗ + z2∗ ), and we have
2
1 ∗ 1 1 2 1 ∗
δ 2 ≤ kz̄ − xk2 = (z − x) + (z2∗ − x) = δ + hz1 − x, z2∗ − xi,
2 1 2 2 2
which leads to
hz1∗ − x, z2∗ − xi ≥ δ 2 .
Consequently,
kz1∗ − z2∗ k2 = kz1∗ − xk2 + kz2∗ − xk2 − 2hz1∗ − x, z2∗ − xi ≤ δ 2 + δ 2 − 2δ 2 = 0,
implying that z1∗ = z2∗ . The proof is now complete.
Lemma 2.2. Let C ⊆ Rd be a closed convex set, then for any x ∈ Rd and y ∈ C,
we have y = PC [x] if and only if hz − y, x − yi ≤ 0 for any z ∈ C.
Proof. We first prove that if y = PC [x] then hz − y, x − yi ≤ 0 for any z ∈ C. To this end,
we fix x and y and suppose there exists z0 ∈ C such that hz0 − y, x − yi > 0, it then follows
2
that z0 6= y. Set z = y + t(z0 − y), it holds that
kx − zk2 − kx − yk2 = kx − y − t(z0 − y)k2 − kx − yk2
= kz0 − yk2 t2 − 2thx − y, z0 − yi
2 2hx − y, z0 − yi
= kz0 − yk t t − . (3)
kz0 − yk2
n o
Defining t∗ := min 1, hx−y,z 0 −yi
kz0 −yk 2 , we have 0 < t∗ ≤ 1 and z ∗ := y + t∗ (z0 − y) =
(1 − t∗ )y + t∗ z0 ∈ C. Substituting 0 < t∗ ≤ hx−y,z 0 −yi
kz0 −yk2 to (3), we have kx − z ∗ k < kx − yk,
which conflicts with y = PC [x].
Next we prove if y = PC [x] then hz − y, x − yi ≤ 0 for any z ∈ C then y = PC [x]. We
notice that kx − zk2 = k(x − y) − (z − y)k2 = kx − yk2 + kz − yk2 − 2hx − y, z − yi for any
z ∈ C. With hx − y, z − yi ≤ 0, we have kx − zk ≥ kx − yk. Since z is arbitrary, we conclude
that y = PC [x].
Lemma 2.3. Let C ⊆ Rd be a closed convex set, then kPC [x] − PC [y]k ≤ kx − yk
for any x, y ∈ Rd .
Proof. We first notice that
kx − yk2 = k(x − PC [x] + PC [y] − y) + (PC [x] − PC [y])k2
= kx − PC [x] + PC [y] − yk2 + kPC [x] − PC [y]k2
+ 2hx − PC [x] + PC [y] − y, PC [x] − PC [y]i.
= kx − PC [x] + PC [y] − yk2 + kPC [x] − PC [y]k2
− 2hy − PC [y], PC [x] − PC [y]i − 2hx − PC [x], PC [y] − PC [x]i. (4)
From Lemma 2.2, we know
hy − PC [y], PC [x] − PC [y]i ≤ 0, hx − PC [x], PC [y] − PC [x]i ≤ 0.
Substituting the above inequalities to (4), we reach kx − yk2 ≥ kPC [x] − PC [y]k2 .
3 Examples of projections
• Box: C = [η1 , η2 ]N , ∀x ∈ RN , (PC [x])i = (max{η1 , min{xi , η2 }}), i = 1, · · · , N .
η−u> x
• Hyperplane: C = {x | u> x = η, u ∈ RN , η ∈ R}, ∀x ∈ RN , PC [x] = x + kuk22
u.
3
4 Projected gradient descent
For optimization problem 1, given any arbitrary initialization variable x0 ∈ X , projected
gradient descent iterates as follows
yk+1 = xk − γ∇f (xk ), (5a)
xk+1 = PX [yk+1 ], ∀ k = 0, 1, 2, · · · (5b)
where γ is the learning rate.
5 Convergence analysis
5.1 Smooth and generally convex problem
Lemma 5.1. Suppose f (x) is L-smooth. If γ = L1 , then the sequence generated by
projected gradient descent (5) with arbitrary x0 ∈ X satisfies
1 L
f (xk+1 ) ≤ f (xk ) − k∇f (xk )k2 + kyk+1 − xk+1 k2 , k = 0, 1, 2, · · ·
2L 2
Proof. Since f (x) is L-smooth, it holds that
L
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + kxk+1 − xk k2
2
(5a) L
= f (xk ) − Lhyk+1 − xk , xk+1 − xk i + kxk+1 − xk k2
2
L L
= f (xk ) − (kyk+1 − xk k + kxk+1 − xk k2 − kyk+1 − xk+1 k2 ) + kxk+1 − xk k2
2
2 2
L 2 L 2
= f (xk ) − kyk+1 − xk k + kyk+1 − xk+1 k
2 2
1 L
= f (xk ) − k∇f (xk )k + kyk+1 − xk+1 k2 .
2
2L 2
Lemma 5.2. Suppose f (x) is L-smooth. If γ = L1 , then the sequence generated by
projected gradient descent (5) with arbitrary x0 ∈ X satisfies
L
f (xk+1 ) ≤ f (xk ) − kxk+1 − xk k2 , k = 0, 1, 2, · · ·
2
Proof. From Lemma 2.2, we have
PX [xk − γ∇f (xk )] = xk+1
⇒ h(xk − γ∇f (xk )) − xk+1 , xk − xk+1 i ≤ 0
⇒ kxk+1 − xk k2 + hγ∇f (xk ), xk+1 − xk i ≤ 0
1
⇒ h∇f (xk ), xk+1 − xk i ≤ − kxk+1 − xk k2 = −Lkxk+1 − xk k2 (6)
γ
4
Since f (x) is L-smooth, we have
L
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + kxk+1 − xk k2
2
(6) L
= f (xk ) − kxk+1 − xk k2 (7)
2
Theorem 5.3. Suppose f (x) is L-smooth. If γ = L1 , then the sequence generated
by projected gradient descent (5) with arbitrary x0 ∈ X satisfies
L
f (xK ) − f (x? ) ≤ kx0 − x? k2 , K > 0.
2K
Proof. First, we have
1 2
h∇f (xk ), xk − x? i = (γ k∇f (xk )k2 + kxk − x? k2 − kyk+1 − x? k2 ). (8)
2γ
From Theorem 2.2, it holds that
hyk+1 − xk+1 , x? − xk+1 i ≤ 0,
which leads to
kxk+1 − x? k2 + kyk+1 − xk+1 k2 ≤ kyk+1 − x? k2 . (9)
Substituting 9 into 8, we have
1 2
h∇f (xk ), xk − x? i ≤ (γ k∇f (xk )k2 + kxk − x? k2 − kxk+1 − x? k2 − kyk+1 − xk+1 k2 ).
2γ
(10)
Then,
K−1
X
(f (xk ) − f (x? ))
k=0
K−1
X
≤ h∇f (xk ), xk − x? i
k=0
K−1 K−1
1 X L L X
≤ k∇f (xk )k2 + kx0 − x? k2 − kyk+1 − xk+1 k2 . (11)
2L 2 2
k=0 k=0
5
From Lemma 5.1, we have
K−1 K−1
1 X X L
k∇f (xk )k2 ≤ (f (xk ) − f (xk+1 ) + kyk+1 − xk+1 k2 )
2L 2
k=0 k=0
K−1
L X
= f (x0 ) − f (xK ) + kyk+1 − xk+1 k2 .
2
k=0
Plugging this into 11, we have
K
X L
(f (xk ) − f (x? )) ≤ kx0 − x? k2 .
2
k=1
Using the Lemma 5.2, we complete the proof.
5.2 Smooth and strongly convex problem
Theorem 5.4. Let X ⊆ Rd be a closed convex set, f : X → R be differentiable, L-
smooth and µ-strongly convex. If γ = L1 , projected gradient descent 5 with arbitrary
x0 ∈ X satisfies
µ K
kxK − x? k2 ≤ (1 − ) kx0 − x? k2 , K > 0.
L
Proof.
f (xk ) − f (x∗ )
µ
≤h∇f (xk ), xk − x? i − kxk − x? k2
2
1 2 µ
≤ (γ k∇f (xk )k2 + kxk − x? k2 − kxk+1 − x? k2 − kyk+1 − xk+1 k2 ) − kxk − x? k2 , (12)
2γ 2
where the first inequality is from the strongly convex property and the last inequality is
from 10. 12 leads to
kxk+1 − x? k2 ≤ 2γ(f (x? ) − f (xk )) + γ 2 k∇f (xk )k2 − kyk+1 − xk+1 k2 + (1 − γµ)kxk − x? k2 .
(13)
Using Lemma 5.1 and Lemma 5.2, we have
1 L
f (x? ) − f (xk ) ≤ f (xk+1 ) − f (xk ) ≤ − k∇f (xk )k2 + kyk+1 − xk+1 k2 . (14)
2L 2
Substituting 14 into 13, we have
µ
kxk+1 − x? k2 ≤ (1 − )kxk − x? k2 ,
L
which completes the proof.
Remark: In Theorem 5.4, if we suppose f is differentiable, L-smooth and µ-strongly convex
6
on Rd , then the result can be strengthened as
µ K
kxK − x? k ≤ (1 − ) kx0 − x? k, K > 0.
L
The proof can be found in chap 4 Theorem 5.7.