MAT3007 Optimization
Optimality Conditions
Junfeng WU
School of Data Science
The Chinese University of Hong Kong, Shenzhen
1 / 33
Recap: Nonlinear Optimization
Some terminologies:
▶ Global vs local optimizer (minimizer)
▶ Gradient, Hessian, Taylor expansions
Then we studied the optimality conditions for unconstrained problems.
Theorem (First-Order Necessary Condition)
If x∗ is a local minimizer of f (·) for an unconstrained problem, then we
must have ∇f (x∗ ) = 0.
▶ The FONC can be used to find candidates for local minimizers
▶ However, FONC is not sufficient
2 / 33
Optimality Conditions for Unconstrained Problems-Continued
Optimality Conditions for Unconstrained Problems-Continued
3 / 33
.
Second-Order Necessary Conditions
4 / 33
Second-Order Necessary Condition
Consider the Taylor expansion again but to the 2nd order (assuming f is
twice continuously differentiable):
1
f (x + td) = f (x) + t∇f (x)⊤ d + t2 d⊤ ∇2 f (x)d + o(t2 ).
2
When the first-order necessary condition holds, we have:
1
f (x + td) = f (x) + t2 d⊤ ∇2 f (x)d + o(t2 ).
2
In order for x to be a local minimizer, we also need d⊤ ∇2 f (x)d to be
nonnegative for every d ∈ Rn .
5 / 33
Second-Order Necessary Condition (SONC)
Theorem: Second-Order Necessary Conditions
If x∗ is a local minimizer of f , then it holds that:
1. ∇f (x∗ ) = 0;
2. For all d ∈ Rn : d⊤ ∇2 f (x∗ )d ≥ 0.
Definition: Semidefiniteness
We call a (symmetric) matrix A positive (negative) semidefinite (PSD/NSD)
if and only if for all x we have x⊤ Ax ≥ 0 (≤ 0).
Remark:
▶ Therefore, the second-order necessary condition requires the Hessian
matrix at x∗ to be PSD. In the one-dimensional case, this is
equivalent to f ′′ (x∗ ) ≥ 0.
6 / 33
Positive Semidefinite Matrices
Here are some useful facts about PSD matrices:
▶ We usually only talk about PSD properties for symmetric matrices.
▶ If a matrix A is not symmetric, we use 12 (A + A⊤ ) to define the PSD
properties (because x⊤ Ax = 12 x⊤ (A + A⊤ )x).
▶ A symmetric matrix is PSD if and only if all the eigenvalues are
nonnegative.
▶ For any matrix A, A⊤ A is a (symmetric) PSD matrix.
7 / 33
Example Continued
For f (x) := x4 − 9x2 + 4x − 1, the second-order condition is:
f ′′ (x) = 12x2 − 18 ≥ 0
√
Only x1 = −1
√ − 6/2 and x3 ′′= 2 satisfy the condition.
√ But for the point
x2 = −1 + 6/2, we obtain f (x2 ) = 12(1 − 6) < 0 (thus, x2 is not a
local minimizer).
In the example of least squares problem, we use the following fact:
▶ If f (x) = x⊤ M x (M is symmetric), then ∇2 f (x) = 2M .
Therefore, the Hessian matrix in that problem is 2X ⊤ X, which is always a
PSD matrix. Therefore, the SONC always holds!
8 / 33
SONC is Not Sufficient
However, even if both the first- and second-order necessary conditions
hold, we still can not guarantee that the candidate is a local minimum!
Example: Consider f (x) = x3 at 0.
▶ f ′ (0) = f ′′ (0) = 0, thus FONC and SONC hold.
▶ But 0 is not a local minimum
▶ A point x satisfying ∇f (x) = 0 is called critical point or stationary
point.
▶ The SONC can be used to verify that a stationary point is not a local
minimizer.
⇝ By modifying the SONC, we can get a sufficient condition.
9 / 33
.
Second-Order Sufficient Conditions
10 / 33
Second-Order Sufficient Condition (SOSC)
Theorem: Second-Order Sufficient Conditions
Let f be twice continuously differentiable. If x∗ satisfies:
1. ∇f (x∗ ) = 0;
2. For all d ∈ Rn \{0}: d⊤ ∇2 f (x∗ )d > 0;
then x∗ is a strict local minimum/minimizer of f .
Definition: Definite Matrices
We call a (symmetric) matrix A positive (negative) definite (PD/ND) if and
only if for all x ̸= 0: x⊤ Ax > 0 (< 0).
▶ A PD matrix must be PSD (thus PD is a stronger notion).
▶ A symmetric matrix is PD ⇐⇒ all its eigenvalues are positive.
11 / 33
Proof
We need the following lemma
Lemma: Bounds and Eigenvalues
Let A ∈ Rn×n be a symmetric matrix. Then
λmin (A)∥x∥2 ≤ x⊤ Ax ≤ λmax (A)∥x∥2 ∀ x ∈ Rn ,
where λmin (A) and λmax (A) are the smallest and largest EV of A.
The proof is by another variant of Taylor expansion, i.e.,
1
f (x∗ + d) = f (x∗ ) + d⊤ ∇2 f (x∗ )d + o(∥d∥2 ),
2
for d tends to 0.
When ∇2 f (x∗ ) is positive definite, we have d⊤ ∇2 f (x∗ )d ≥ µ||d||2 , where
µ > 0 is the smallest eigenvalue of ∇2 f (x∗ ).
Thus, we have
µ o(∥d∥2 )
µ
f (x∗ + d) ≥ f (x∗ ) + ∥d∥2 + o(∥d∥2 ) = f (x∗ ) + ∥d∥2 + .
2 2 ∥d∥2
o(∥d∥2 )
Since ∥d∥ → 0, we can have ∥d∥2 ≥ − µ4 , which shows
f (x∗ + d) > f (x∗ ).
12 / 33
For Maximization Problems
Our conditions are derived for minimization problems. For maximization
problems, we just change the inequalities. Let f ∈ C 2 (twice continuously
differentiable).
Theorem: FONC for Maximization
If x∗ is a local (unconstrained) maximizer of f , then we must have ∇f (x∗ ) =
0.
Theorem: SONC for Maximization
If x∗ is a local maximizer of f , then we must have 1.) ∇f (x∗ ) = 0; 2.)
∇2 f (x∗ ) is negative semidefinite.
Theorem: SOSC for Maximization
If x∗ satisfies 1.) ∇f (x∗ ) = 0; 2.) ∇2 f (x∗ ) is negative definite, then x∗ is
a strict local maximizer.
13 / 33
Optimality Conditions
Optimality Conditions for Unconstrained Problems:
▶ First-order necessary condition.
▶ Second-order necessary condition.
▶ Second-order sufficient condition.
In many cases, we can utilize these conditions to identify local and global
optimal solutions.
General Strategy:
▶ Use FONC and SONC to identify all possible candidates. Then, use
the sufficient conditions to verify.
▶ If a problem only has one stationary point and one can reason that
the problem must have a finite optimal solution, then this point must
be the (global) optimum.
14 / 33
Examples–I
In the example f (x) = x4 − 9x2 + 4x − 1, the points x1 and x3 satisfy the
second-order sufficient conditions (f ′′ (x) > 0) and are local minimizer.
In the least squares problem, if X ⊤ X is positive definite (or if it is
invertible), then the solution β of the FONC
X ⊤ Xβ = X ⊤ y
is unique and it satisfies the second-order sufficient conditions.
⇝ It must be the unique global minimizer of the problem.
15 / 33
Optimality Conditions for Unconstrained Problems-Continued
Optimality Conditions for Unconstrained Problems-Continued
16 / 33
Constrained Problems
We have derived necessary and sufficient conditions for the local minimum
for unconstrained problems.
▶ What is the difference between constrained and unconstrained
problems?
Consider the example f (x) = 100x2 (1 − x)2 − x with constraint
−0.2 ≤ x ≤ 0.8.
In addition to the original local minimizer (x1 = 0.013), there is one more
local minimizer on the boundary (x = 0.8).
17 / 33
Constrained Problems
At the boundary (x∗ = 0.8), the FONC is not satisfied
f ′ (0.8) < 0
However, at this point, in order to stay feasible, we can only go leftward.
That is, in the Taylor expansion
f (x∗ + d) = f (x∗ ) + df ′ (x∗ ) + o(d)
we can only take d to be negative (otherwise it won’t be feasible).
Thus f (x∗ + d) > f (x∗ ) in a small neighborhood of x∗ in the feasible
region. Thus x∗ is a local minimizer.
18 / 33
Feasible Directions
Now we formalize the above arguments.
Definition (Feasible Direction)
Given x ∈ F , we call d to be a feasible direction at x if there exists ᾱ > 0
such that x + αd ∈ F for all 0 ≤ α ≤ ᾱ.
For example,
▶ If F = {x|Ax = b}, then the feasible directions at x is {d|Ad = 0}
▶ If F = {x|Ax ≥ b}, then the feasible directions at x is
{d|aTi d ≥ 0 if aTi x = bi }
19 / 33
FONC for Constrained Problems
Theorem (FONC for Constrained Problems)
If x∗ is a local minimum, then for any feasible direction d at x∗ , we must
have ∇f (x∗ )T d ≥ 0
In unconstrained problems, all directions are feasible, thus we must have
∇f (x∗ ) = 0.
20 / 33
An Alternative View
Definition (Descent Direction)
Let f be continuously differentiable. Then d is called a descent direction
at x if and only if ∇f (x)T d < 0.
⇝ If d is a descent direction at x, then there exists γ̄ > 0 such that
f (x + γd) < f (x) for all 0 < γ ≤ γ̄.
If we denote the set of feasible directions at x by SF (x) and the set of
descent directions at x by SD (x). Then the first order necessary condition
can be written as:
SF (x∗ ) ∩ SD (x∗ ) = ∅
Or in other words, there cannot be any feasible descent directions.
21 / 33
Nonlinear Optimization with Equality Constraints
Consider
minimizex f (x)
s.t. Ax = b
▶ The feasible direction set is {d|Ad = 0}.
▶ The descent direction set is {d|∇f (x)T d < 0}.
The FONC says that at local minimum, there cannot be a solution to both
systems (feasible and descent direction)
Theorem (Alternative System)
The system Ad = 0 and ∇f (x)T d < 0 does not have a solution if and
only if there exists y such that
AT y = ∇f (x)
22 / 33
Nonlinear Optimization with Equality Constraints
Therefore, the first-order necessary condition for
minimizex f (x) (1)
s.t. Ax = b
is that there exists y such that
AT y = ∇f (x)
Theorem
If x∗ is a local minimum for (1), then there must exist y such that
AT y = ∇f (x∗ )
23 / 33
Proof
First it is easy to see that if there exists y such that AT y = ∇f (x). Then
we can’t have a d such that Ad = 0 and ∇f (x)T d < 0 (multiplying dT to
both sides of the equation will reach a contradiction).
To prove the reverse, consider the LP:
minimized ∇f (x)T d
s.t. Ad = 0
If there doesn’t exist d satisfying Ad = 0 and ∇f (x)T d < 0, then the
optimal value of this LP must be 0.
Therefore, by the strong duality theorem, its dual problem must also be
feasible (and the optimal value is 0). However, the dual constraint is
AT y = ∇f (x). Thus the theorem is proved. □
24 / 33
Example
Consider the problem:
minimize (x1 − 1)2 + (x2 − 1)2
s.t. x1 + x2 = 1
▶ This problem finds the nearest point on the line x1 + x2 = 1 to the
point (1, 1)
Figure: Finding the nearest point on the line to (1,1)
25 / 33
Example Continued
By the FONC, x = (x1 , x2 ) is a local minimizer if there exists y such that
AT y = ∇f (x)
Here A = (1, 1). And ∇f (x) = (2x1 − 2; 2x2 − 2).
Thus it means there exists y such that
2x1 − 2 = y 2x2 − 2 = y
Also combined with the constraint x1 + x2 = 1. We have
x1 = x2 = 1/2
is the only candidate for local minimum. And it is indeed a local minimizer
(also a global minimizer)
26 / 33
Another Example
Consider a constrained version of the least squares problem:
minimizeβ ||Xβ − y||22
s.t. Wβ = ξ
The gradient is 2(X T Xβ − X T y).
Therefore, the FONC is that there exists z such that
W T z = 2(X T Xβ − X T y)
Therefore, an optimal β must satisfy:
1 T
W β = ξ, X T Xβ = W z + XT y
2
27 / 33
Another Example Continued
1 T
W β = ξ, X T Xβ = W z + XT y
2
We can write this as:
W 0 β ξ
=
XT X − 21 W T z XT y
Here the size of X be m × n , and the size of W be d × n. Then these are
n + d linear equations with n + d unknowns.
This is a system of linear equations with n + d equations and n + d
unknowns. Solving this equation will yield the unique candidate for local
minimizer (provided the left hand side matrix is of full rank).
28 / 33
Inequality Constraints
Now we consider an inequality constrained problem:
minimizex f (x)
s.t. Ax ≥ b (2)
What should be the necessary optimality conditions?
Theorem
If x∗ is a local minimum of (2), then there exists some y ≥ 0 satisfying
∇f (x∗ ) = AT y
yi · (aTi x∗ − bi ) = 0, ∀i
where aTi is the ith row of A.
29 / 33
Proof
We consider the descent directions and the feasible directions at x∗ .
First it is easy to see that the descent directions are:
SD (x∗ ) = {d : ∇f (x∗ )T d < 0}
For the feasible directions, it is
SF (x∗ ) = {d : aTi d ≥ 0, if aTi x∗ = bi }
Local optimality requires that SD (x∗ ) ∩ SF (x∗ ) = ∅. We define
A(x) = {i : aTi x = bi } to be the active constraints at x, then the
necessary condition should be:
There does not exist d such that
1. ∇f (x∗ )T d < 0
2. aTi d ≥ 0 for i ∈ A(x∗ )
30 / 33
Proof Continued
The nonexistence of d such that
1. ∇f (x)T d < 0
2. aTi d ≥ 0 for i ∈ A(x)
is equivalent to the existence of y ≥ 0, such that
X
∇f (x) = ai yi
i∈A(x)
This can be further written as the following conditions:
▶ There exists y ≥ 0 such that
∇f (x) = AT y
yi · (aTi x − bi ) = 0, ∀i
31 / 33
More General Cases — KKT Conditions
We have discussed cases with linear equality constraints or linear inequality
constraints and derived the (necessary) optimality conditions
▶ We want to extend them to more general cases — KKT conditions
▶ We call the first-order necessary conditions for a general optimization
problem the KKT conditions
▶ Solutions that satisfy the KKT conditions are called KKT points.
▶ KKT points are candidate points for local optimal solutions.
▶ The KKT conditions were originally named after H. Kuhn and A.
Tucker, who first published the conditions in 1951. Later scholars
discovered that the conditions had been stated by W. Karush in his
master’s thesis in 1939.
32 / 33
Find KKT Conditions
We consider the general nonlinear optimization problem:
minimizex f (x)
s.t. gi (x) ≥ 0 i = 1, ..., m
hi (x) = 0 i = 1, ..., p
ℓi (x) ≤ 0 i = 1, ..., r
xi ≥ 0 i∈M
xi ≤ 0 i∈N
xi free i∈
/ M ∪N
One can use the feasible/descent directions arguments to find the KKT
conditions. But it is not very convenient.
▶ In the next lecture, we present a direct approach
33 / 33