UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
Lecture 1–2: Optimization Background
Yudong Chen
1 Introduction
Our standard optimization problem
min f ( x ) (P)
x ∈X
• x: a vector, optimization/decision variable
• X : feasible set
• f ( x ) objective function, real-valued
• maxx f ( x ) ⇐⇒ minx − f ( x )
The (optimal) value of (P):
val(P) = inf f ( x ).
x ∈X
To fully specify (P), we need to specify
• vector space, feasible set, objective function;
• what it means to solve (P).
1.1 Can we even hope to solve an arbitrary optimization problem?
Example 1. Suppose we want to find positive integers x, y, z satisfying
x 3 + y3 = z3 .
Can be formulated as a (continuous) optimization problem (PF ):
min ( x n + yn − zn )2
x,y,z,n
s.t.x ≥ 1, y ≥ 1, z ≥ 1, n ≥ 3 (PF )
sin2 (πn) + sin2 (πx ) + sin2 (πy) + sin2 (πz) = 0.
If we could certify whether val(PF ) ̸= 0, we would have found a proof for Fermat’s Last theorem
(1637):
For any n ≥ 3, x n + yn = zn has no solutions over positive integers.
Proved by Andrew Wiles in 1994.
1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
Example 2. Unconstrained optimization, many local minima:1
We cannot hope for solving an arbitrary optimization problem.
We need some structure.
2 Specifying the optimization problem
2.1 Vector space
This is where the optimization variable and the feasible set live.
(Rd , ∥·∥): normed vector space, “primal space”.
• The variable x is a (column) vector in Rd .
x1
x2
x = . .
..
xd
• The norm tells us how to measure distances in Rd .
1/2
Most often, we will take ∥ x ∥ = ∥ x ∥2 = ∑id=1 xi2 (Euclidean norm)
1/p
We sometimes also consider ℓ p norm ∥ x ∥ p = ∑id=1 | xi | p ,p≥1
• ∥ x ∥1 = ∑ i | x i |,
• ∥ x ∥∞ = max1≤i≤d | xi |.
(Plots of unit balls of ℓ2 , ℓ1 , ℓ∞ norms.)
1 Left:
plot by Jelena Diakonikolas. Right: loss surfaces of ResNet-56 without skip connections (https://arxiv.
org/pdf/1712.09913.pdf).
2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
We will use ⟨·, ·⟩ to denote inner products. Standard inner product
d
⟨ x, y⟩ = x ⊤ y = ∑ xi yi .
i =1
When we work with Rd , ∥·∥ p , view ⟨y, x ⟩ as the value of a linear function y at x. So, if we
are measuring the length of x using the ∥·∥ p , we should measure the length of y using ∥·∥q ,where
1 1
p + q = 1.
Definition 1 (Dual norm). The dual norm of ∥·∥ is given by
∥z∥∗ := sup ⟨z, x ⟩ .
∥ x ∥≤1
From the definition we immediately have the
Proposition 1 (Holder Inequality). For all z, y ∈ Rd :
|⟨z, x ⟩| ≤ ∥z∥∗ · ∥ x ∥ .
x
Proof. Fix any two vectors x, z. Assume x ̸= 0, z ̸= 0, o.w. trivial. Define x̂ = ∥x∥
. Then
⟨z, x ⟩
∥z∥∗ ≥ ⟨z, x̂ ⟩ =
∥x∥
and hence ⟨z, x ⟩ ≤ ∥z∥∗ · ∥ x ∥. Applying same argument with x replaced by − x proves − ⟨z, x ⟩ ≤
∥ z ∥ ∗ · ∥ x ∥.
1 1
Example 3. ∥·∥ p and ∥·∥q are duals when p + q = 1. In particular, ∥·∥2 is its own dual; ∥·∥1 and
∥·∥∞ are dual to each other.
In Rd , all ℓ p norms are equivalent. In particular,
1 1
−
∀ x ∈ Rd , p ≥ 1, r > p : ∥ x ∥r ≤ ∥ x ∥ p ≤ d p r ∥ x ∥r .
However, choice of norm affects how algorithm performance depends on dimension d.
2.2 Feasible set
The feasible set
X ⊆ Rd
specifies what solution points we are allowed to output.
If X = Rd , we say that (P) is unconstrained. Otherwise we say that (P) is constrained.
X can be specified:
• as an abstract geometric body (a ball, a box, a polyhedron, a convex set)
• via functional constraints:
gi ( x ) ≤ 0, i = 1, 2, . . . , m,
hi ( x ) = 0, i = 1, . . . , p
Note that f i ( x ) ≥ C is equivalent to taking gi ( x ) = C − f i ( x ).
3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
Example 4.
X = B2 (0, 1) = unit Euclidean ball
X = { x ∈ Rd : ∥ x ∥ 2 ≤ 1 }
In this class, we will always assume that X is closed.
Hein-Borel Theorem: X ⊆ Rd is closed and bounded if and only if it is compact (if X ⊂ α∈ A Uα
S
for some family of open sets {Uα } ,then there there exists a finite subfamily {Uαi }in=1 such that
X ⊆ 1≤i≤n Uαi .)
S
Weierstrass Extreme Value Theorem: If X is compact and f is a function that is defined and
continuous on X , then f attains its extreme values on X .
What if X is not bounded? Consider f ( x ) = e x . Then infx∈R f ( x ) = 0, but not attained.
When we work with unconstrained problems, we will normally assume that f is bounded
below.
Convex sets: Except for some special cases, we often assume that the feasible set is convex, so
that we will be able to guarantee tractability.
Definition 2 (Convex set). A set X ⊆ Rd is convex if
∀ x, y ∈ X , ∀α ∈ (0, 1) : (1 − α) x + αy ∈ X
A picture.
We cannot hope to deal with arbitrary nonconvex constraints. E.g., xi (1 − xi ) = 0 ⇐⇒ xi ∈
{0, 1}, integer programs.
2.3 Objective function
“cost”, “loss”
Extended real valued functions:
f : D → R ∪ {−∞, ∞} ≡ R̄.
Here f is defined on D ⊆ Rd . Can extend the definition of f to all of Rd by assigning the value
+∞ at each point x ∈ Rd \ D .
Effective domain: n o
dom( f ) = x ∈ Rd : f ( x ) < ∞
In the sequel, domain means effective domain.
“Linear and nonlinear optimization” ≈ “continuous optimization” (as contrast to discrete/combinatorial
optimization)
4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
2.3.1 Lower semicontinuous functions
We mostly assume f to be continuous, which can be relaxed slightly.
Definition 3. A function f : Rd → R̄ is said to be lower semicontinuous (l.s.c) at x ∈ Rd if
f ( x ) ≤ lim inf f (y).
y→ x
We way f is l.s.c. on Rd if it is l.s.c. at every point x ∈ Rd .
This definition is mainly useful for allowing indicator functions.
Example 5. Verify yourself: Indicator of a closed set is l.s.c.
(
0, x ∈ X
IX ( x ) =
∞, x ∈ / X.
Using IX we can write
min f ( x ) ≡ min { f ( x ) + IX ( x )} ,
x ∈X x ∈Rd
thereby unifying constrained and unconstrained optimization.
2.3.2 Continuous and smooth functions
Unless we are abstracting away constraints, the least we will assume about f is that it is continu-
ous.
Sometimes we consider stronger assumptions.
Definition 4. f : Rd → R̄ is said to be
1. Lipschitz-continuous on X ⊆ Rd (w.r.t. the norm ∥·∥) if there exists M < ∞ such that
∀ x, y ∈ X : | f ( x ) − f (y)| ≤ M ∥ x − y∥ .
2. Smooth on X ⊆ Rd (w.r.t. the norm ∥·∥) if f ’s gradient are Lipschitz-continuous, i.e., there
exists L < ∞ such that2
∀ x, y ∈ X : ∥∇ f ( x ) − ∇ f (y)∥∗ ≤ L ∥ x − y∥ .
∂f
∂x. 1
(Gradient: ∇ f ( x ) = .
. .)
∂f
∂xd
2 This definition can be viewed a quantitative version of C1 -smoothness.
5
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024
• Picture:
In Rd , Lipschitz-continuity in some norm implies the same for every other norm, but M may differ.
Example 6. f ( x ) = 12 ∥ x ∥22 is 1-smooth on R2 w.r.t. ∥·∥2 . The log-sum-exp (or softmax) function
f ( x ) = log ∑id=1 exp( xi ) is 1-smooth on Rd w.r.t. ∥·∥∞ .
Example 7. Function that is continuously differentiable on its domain but not smooth:
1
f (x) =
x
dom( f ) = R++
2.3.3 Convex functions
Definition 5. f : Rd → R̄ is convex if ∀ x, y ∈ Rd , ∀α ∈ (0, 1) :
f ((1 − α) x + αy) ≤ (1 − α) f ( x ) + α f (y).
A picture.
Lemma 1. f : Rd → R is convex if and only its epigraph
n o
epi( f ) := ( x, a) : x ∈ Rd , a ∈ R, f ( x ) ≤ a
is convex.
Proof. Follows from definitions. Left as exercise.
Definition 6. We say that a function f : Rd → R̄ is proper if ∃ x ∈ Rd s.t. f ( x ) ∈ R.
Lemma 2. If f : Rd → R̄ is proper and convex, then dom( f ) is convex.