Optimization in Machine Learning
Lecture 8: Subgradient and its calculus, Necessary and sufficient conditions for
optimization with and without Convexity, Lipschitz Continuity
Ganesh Ramakrishnan
Department of Computer Science
Dept of CSE, IIT Bombay
https://www.cse.iitb.ac.in/~ganesh
January, 2025
1/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 1 / 87
Outline
Understanding the Convexity of Machine Learning Loss Functions [Done]
First Order Conditions for Convexity [Done]
▶ Direction Vector, Directional derivative
▶ Quasi convexity & Sub-level sets of convex functions
▶ Convex Functions & their Epigraphs
▶ First-Order Convexity Conditions
Second Order Conditions for Convexity [Almost Done]
Basic Subgradient Calculus: Subgradients for non-differentiable convex functions
Convex Optimization Problems and Basic Optimality Conditions
Lipschit Properties of functions
2/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 2 / 87
First-Order Convexity Conditions: The complete statement
The geometrical interpretation of this theorem is that at any point, the linear approximation
based on a local derivative gives a lower estimate of the function, i.e. the convex function
always lies above the supporting hyperplane at that point. This is pictorially depicted below:
3/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 3 / 87
Second Order Conditions of Convexity
Can we use the Hessian to prove that the logSumExp function is Convex?
Answer is YES
Boyd's book uses the fact that Hessian being positive semi-definite is necessary and sufficient condition for convexity
Recall the Hessian of a continuous function:
2 INTUITION:
1) First order condition: The directional
∂ f ∂2f ∂2f
2 ∂w1 ∂w2 ··· ∂w1 ∂wn derivative is non-decreasing in every
∂w 2
1
∂2f 2 direction
∂f ··· ∂ f 2) Second order condition: The
∂w2 ∂w1 ∂w22 ∂w2 ∂wn curvature is positive in every direction!
∇2 f (w ) = .. .. .. ..
. . . .
∂2f ∂2f ∂2f
∂wn ∂w1 ∂wn ∂w2 ··· ∂wn2
f is convex if and only if, a) dom(f ) is convex, and for all x ∈ dom(f ), ∇2 f (x) ≥ 0 (i.e.
∇2 f (x) is positive semi-definite).
To show that LogSumExp is convex, can we prove that the quadrative expression
is always non-negative
EXPAND AS HOMEWORK!
4/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 4 / 87
On page 67 of the Convex Optimization Notes: https://moodle.iitb.ac.in/mod/resource/view.php?id=18791
Table of Hessians for some Convex Optimization Problems
On page 67 of the Convex Optimization Notes: https://moodle.iitb.ac.in/mod/resource/view.php?id=67925
Table of Hessians for some Convex Optimization Problems The v can be pushed into multiplication
with the highlighted matrices in pink & green
(rest are scalars)
We will show that the quadratic
expression on the RHS is >=0 for all v
By Cauchy Schwarz inequality.. Thus we have proved that
logSumExp is strictly(?) convex using a different metholodology
(Sub)Gradients and Convexity (contd)
To say that a function f : ℜn 7→ ℜ is differentiable at x is to say that there is a single unique
linear tangent that under estimates the function:
f (y) ≥ f (x) + ▽f (x)T (y − x), ∀x, y
5/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 5 / 87
(Sub)Gradients and Convexity (contd)
At this point t of non-differentiability
of this convex function, there
seem to exist several supporting
hyperplanes (and their corresponding normals)
supporting
hyperplanes subgradients
To say that a function f : ℜn
7→ ℜ is differentiable at x is to say that there is a single unique
linear tangent that under estimates the function:
f (y) ≥ f (x) + ▽f (x)T (y − x), ∀x, y
5/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 5 / 87
([Homework discussion] Sub)Gradients and Convexity (contd)
To say that a function f : ℜn 7→ ℜ is differentiable at x is to say that there is a single unique
linear tangent that under estimates the function:
f (y) ≥ f (x) + ▽f (x)T (y − x), ∀x, y
Homework 1: Is the subgradient guaranteed to exist at each point of the domain for a
convex function even if the function is non-differentiable?
Optional Homework 2: How do we show that for a differentiable convex function, the
only subgradient will be its gradient?
6/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 6 / 87
Outline of next few topics
Subgradients, Subgradient Calculus and Convexity
Local and Global Minimum
Sufficient Subgradient condition for Global Minimum
Convexity and Local & Global Minimum
Rates of Convergence, Lipschitz Continuity and Smoothness
Algorithms for Optimization: First Order and thereafter
7/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 7 / 87
Outline of next few topics
Subgradients, Subgradient Calculus and Convexity
Local and Global Minimum
Sufficient Subgradient condition for Global Minimum
Convexity and Local & Global Minimum
Rates of Convergence, Lipschitz Continuity and Smoothness
Algorithms for Optimization: First Order and thereafter
What if the function is not differentiable everywhere?
Yet is convex?
Supporting Hyperplane theorem (there is a supporting hyperplane
to the epi(f) at every point) holds even if the convex f is not differentiable
==> There is a generalization of the gradient called the subgradient
7/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 7 / 87
(Sub)Gradients and Convexity (contd)
In this figure we see the function f at x has many possible linear tangents that may fit
appropriately. Then a subgradient is any h ∈ ℜn (same dimension as x) such that:
f (y) ≥ f (x) + hT (y − x), ∀y
Thus, intuitively, if a convex function is differentiable at a point x then
8/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 8 / 87
(Sub)Gradients and Convexity (contd)
Region of unique subgradient
Region of unique subgradient
This point has the two function values coinciding.. Hence
subgradient is not unique. In fact the subgradients form a set
which is the convex hull of the gradients to f1 and f2 at that point
In this figure we see the function f at x has many possible linear tangents that may fit
appropriately. Then a subgradient is any h ∈ ℜn (same dimension as x) such that:
f (y) ≥ f (x) + hT (y − x), ∀y
Thus, intuitively, if a convex function is differentiable at a point x then
8/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 8 / 87
(Sub)Gradients and Convexity (contd)
In this figure we see the function f at x has many possible linear tangents that may fit
appropriately. Then a subgradient is any h ∈ ℜn (same dimension as x) such that:
f (y) ≥ f (x) + hT (y − x), ∀y
Thus, intuitively, if a convex function is differentiable at a point x then it has a unique
subgradient at that point (▽f (x)). Formal Proof? 8/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 8 / 87
ADVANCED AND OPTIONAL:
Proof is using the limit definition of derivatives on pages 6-9 (slides 63-64) of
https://www.cse.iitb.ac.in/~ganesh/cs769/notes/enotes/12-28-08-2018-firstordergradient-why-what-how-subgradient-annotated.pdf
ADVANCED AND OPTIONAL:
Proof is using the limit definition of derivatives on pages 6-9 (slides 63-64) of
https://www.cse.iitb.ac.in/~ganesh/cs769/notes/enotes/12-28-08-2018-firstordergradient-why-what-how-subgradient-annotated.pdf
Detour: Convexity and Continuity
Let f be a convex function and suppose dom(f ) is open. Then f is continuous.
How wild can non-differentiable convex functions be?
While there are continuous functions which are nowhere differentiable, (see
https://en.wikipedia.org/wiki/Weierstrass_function), convex functions cannot
be pathological!
Infact, a convex function is differentiable almost everywhere. In other words, the set of
points where f is non-differentiable is of measure 0.
However we cannot ignore the non-differentiability, since a) the global minima could easily
be a point of non differentiability and b) with any optimization algorithms, you can
stumble upon these ”kinks”.
9/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 9 / 87
(Sub)Gradients and Convexity (contd)
A subdifferential is the closed convex set of all subgradients of the convex function f :
∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}
Note that this set is guaranteed to be nonempty unless f is not convex.
10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)
A subdifferential is the closed convex set of all subgradients of the convex function f :
∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}
Note that this set is guaranteed to be nonempty unless f is not convex.
Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
all active functions at x.
10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)
A subdifferential is the closed convex set of all subgradients of the convex function f :
∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}
Note that this set is guaranteed to be nonempty unless f is not convex.
Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
all active functions at x.
10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)
A subdifferential is the closed convex set of all subgradients of the convex function f :
∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}
Note that this set is guaranteed to be nonempty unless f is not convex.
Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
all active functions at x.
General pointwise maximum: if f (x) = max fs (x), then
s∈S
[
under some regularity conditions (on S, fs ), ∂f (x) = cl conv ∂fs (x)
s:fs (x)=f (x)
10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)
A subdifferential is the closed convex set of all subgradients of the convex function f :
∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}
Note that this set is guaranteed to be nonempty unless f is not convex.
Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
Eg: Vector induced Matrix norm
all active functions at x.
General pointwise maximum: if f (x) = max fs (x), then
s∈S
[
under some regularity conditions (on S, fs ), ∂f (x) = cl conv ∂fs (x)
s:fs (x)=f (x)
10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 =
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 =
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 = max xT s which is a pointwise maximum of 2n functions
s∈{−1,+1}n
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 = max xT s which is a pointwise maximum of 2n functions (or configurations of s)
s∈{−1,+1}n
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 = max xT s which is a pointwise maximum of 2n functions
s∈{−1,+1}n
Let S ∗ ⊆ {−1, +1}n be the set of s such that for each s ∈ S ∗ , the value of xT s is the
same max value.
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 = max xT s which is a pointwise maximum of 2n functions
s∈{−1,+1}n
Let S ∗ ⊆ {−1, +1}n be the set of s such that for each s ∈ S ∗ , the value of xT s is the
same max value.
=> At a point x, there
could be multiple
s indices which
yield that maximum value
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 = max xT s which is a pointwise maximum of 2n functions
s∈{−1,+1}n
Let S ∗ ⊆ {−1, +1}n be the set of s such that for each s ∈ S ∗ , the value of xT s is the
same max value. [
Thus, ∂∥x∥1 = conv s .
s∈S ∗
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
Subgradient of ∥x∥1
Assume x ∈ ℜn . Then
∥x∥1 = max xT s which is a pointwise maximum of 2n functions
s∈{−1,+1}n
Let S ∗ ⊆ {−1, +1}n be the set of s such that for each s ∈ S ∗ , the value of xT s is the
same max value. [
Thus, ∂∥x∥1 = conv s .
s∈S ∗
By invoking calculus of subgradients
11/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 11 / 87
More of Basic Subgradient Calculus
Scaling: ∂(af ) = a · ∂f provided a > 0. The condition a > 0 makes function f remain
convex.
Addition: ∂(f1 + f2 ) = ∂(f1 ) + ∂(f2 )
Affine composition: if g (x) = f (Ax + b), then ∂g (x) = AT ∂f (Ax + b)
Norms: important special case, f (x) = ||x||p
12/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 12 / 87
More of Basic Subgradient Calculus
Scaling: ∂(af ) = a · ∂f provided a > 0. The condition a > 0 makes function f remain
convex.
Addition: ∂(f1 + f2 ) = ∂(f1 ) + ∂(f2 )
Affine composition: if g (x) = f (Ax + b), then ∂g (x) = AT ∂f (Ax + b)
Norms: important special case, f (x) = ||x||p = max zT x where q is such that
||z||q ≤1
1/p + 1/q = 1. Then
12/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 12 / 87
More of Basic Subgradient Calculus
Scaling: ∂(af ) = a · ∂f provided a > 0. The condition a > 0 makes function f remain
convex.
Addition: ∂(f1 + f2 ) = ∂(f1 ) + ∂(f2 )
Affine composition: if g (x) = f (Ax + b), then ∂g (x) = AT ∂f (Ax + b)
Norms: important special case, f (x) = ||x||p = max zT x where q is such that
||z||q ≤1
1/p + 1/q = 1. Then
HOMEWORK: Try and Derive the Subgradients.
For the Norm case, you can use the Holder's inequality stated on the next slide
12/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 12 / 87
RECALL CAUCHY SHWARZ
with equality iff x = z
HOLDER'S INEQUALITY
Generalized to
HOLDER'S INEQUALITY (and our first exposure to duality)
Two ways of making a scultpure (or in this case, of defining a norm)
1) PRIMAL : Casting - fill up a mould
2) DUAL: Chiselling - carving out unwanted material from the base object (by discarding)
More of Basic Subgradient Calculus
Scaling: ∂(af ) = a · ∂f provided a > 0. The condition a > 0 makes function f remain
convex.
Addition: ∂(f1 + f2 ) = ∂(f1 ) + ∂(f2 )
Affine composition: if g (x) = f (Ax + b), then ∂g (x) = AT ∂f (Ax + b)
Norms: important special case, f (x) = ||x||p = max zT x where q is such that
||z||q ≤1
1/p + 1/q
= 1. Then
T
∂f (x) = y : ||y||q ≤ 1 and yT x = max z x
||z||q ≤1
Can we derive the sub-differential of ||x||1 ?
12/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 12 / 87