0% found this document useful (0 votes)

11 views37 pages

CS769 2025 Lecture 8-Annotated

The document discusses optimization in machine learning, focusing on subgradients, convexity, and necessary conditions for optimization. It covers first and second-order conditions for convexity, subgradient calculus, and the properties of Lipschitz continuity. Additionally, it outlines upcoming topics related to local and global minima, convergence rates, and optimization algorithms.

Uploaded by

kr.rajesh117

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views37 pages

CS769 2025 Lecture 8-Annotated

Uploaded by

kr.rajesh117

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Optimization in Machine Learning

Lecture 8: Subgradient and its calculus, Necessary and sufficient conditions for
optimization with and without Convexity, Lipschitz Continuity

Ganesh Ramakrishnan

Department of Computer Science

Dept of CSE, IIT Bombay
https://www.cse.iitb.ac.in/~ganesh

January, 2025

1/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 1 / 87
Outline

Understanding the Convexity of Machine Learning Loss Functions [Done]

First Order Conditions for Convexity [Done]
▶ Direction Vector, Directional derivative
▶ Quasi convexity & Sub-level sets of convex functions
▶ Convex Functions & their Epigraphs
▶ First-Order Convexity Conditions
Second Order Conditions for Convexity [Almost Done]
Basic Subgradient Calculus: Subgradients for non-differentiable convex functions
Convex Optimization Problems and Basic Optimality Conditions
Lipschit Properties of functions

2/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 2 / 87
First-Order Convexity Conditions: The complete statement

The geometrical interpretation of this theorem is that at any point, the linear approximation
based on a local derivative gives a lower estimate of the function, i.e. the convex function
always lies above the supporting hyperplane at that point. This is pictorially depicted below:

3/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 3 / 87
Second Order Conditions of Convexity
Can we use the Hessian to prove that the logSumExp function is Convex?
Answer is YES
Boyd's book uses the fact that Hessian being positive semi-definite is necessary and sufficient condition for convexity

Recall the Hessian of a continuous function:

 2  INTUITION:
1) First order condition: The directional
∂ f ∂2f ∂2f
2 ∂w1 ∂w2 ··· ∂w1 ∂wn derivative is non-decreasing in every
 ∂w 2
1
∂2f 2  direction
 ∂f ··· ∂ f  2) Second order condition: The
 ∂w2 ∂w1 ∂w22 ∂w2 ∂wn  curvature is positive in every direction!
∇2 f (w ) =  .. .. .. .. 
 . . . . 
 
∂2f ∂2f ∂2f
∂wn ∂w1 ∂wn ∂w2 ··· ∂wn2

f is convex if and only if, a) dom(f ) is convex, and for all x ∈ dom(f ), ∇2 f (x) ≥ 0 (i.e.
∇2 f (x) is positive semi-definite).
To show that LogSumExp is convex, can we prove that the quadrative expression
is always non-negative

EXPAND AS HOMEWORK!

4/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 4 / 87
On page 67 of the Convex Optimization Notes: https://moodle.iitb.ac.in/mod/resource/view.php?id=18791
Table of Hessians for some Convex Optimization Problems
On page 67 of the Convex Optimization Notes: https://moodle.iitb.ac.in/mod/resource/view.php?id=67925
Table of Hessians for some Convex Optimization Problems The v can be pushed into multiplication
with the highlighted matrices in pink & green
(rest are scalars)

We will show that the quadratic

expression on the RHS is >=0 for all v
By Cauchy Schwarz inequality.. Thus we have proved that
logSumExp is strictly(?) convex using a different metholodology
(Sub)Gradients and Convexity (contd)

To say that a function f : ℜn 7→ ℜ is differentiable at x is to say that there is a single unique

linear tangent that under estimates the function:

f (y) ≥ f (x) + ▽f (x)T (y − x), ∀x, y

5/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 5 / 87
(Sub)Gradients and Convexity (contd)

At this point t of non-differentiability

of this convex function, there
seem to exist several supporting
hyperplanes (and their corresponding normals)

supporting
hyperplanes subgradients

To say that a function f : ℜn

7→ ℜ is differentiable at x is to say that there is a single unique
linear tangent that under estimates the function:

f (y) ≥ f (x) + ▽f (x)T (y − x), ∀x, y

5/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 5 / 87
([Homework discussion] Sub)Gradients and Convexity (contd)

To say that a function f : ℜn 7→ ℜ is differentiable at x is to say that there is a single unique

linear tangent that under estimates the function:

f (y) ≥ f (x) + ▽f (x)T (y − x), ∀x, y

Homework 1: Is the subgradient guaranteed to exist at each point of the domain for a
convex function even if the function is non-differentiable?
Optional Homework 2: How do we show that for a differentiable convex function, the
only subgradient will be its gradient?
6/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 6 / 87
Outline of next few topics

Subgradients, Subgradient Calculus and Convexity

7/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 7 / 87
Outline of next few topics

Subgradients, Subgradient Calculus and Convexity

Local and Global Minimum
Sufficient Subgradient condition for Global Minimum
Convexity and Local & Global Minimum
Rates of Convergence, Lipschitz Continuity and Smoothness
Algorithms for Optimization: First Order and thereafter
What if the function is not differentiable everywhere?
Yet is convex?
Supporting Hyperplane theorem (there is a supporting hyperplane
to the epi(f) at every point) holds even if the convex f is not differentiable
==> There is a generalization of the gradient called the subgradient

7/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 7 / 87
(Sub)Gradients and Convexity (contd)

Region of unique subgradient

This point has the two function values coinciding.. Hence

subgradient is not unique. In fact the subgradients form a set
which is the convex hull of the gradients to f1 and f2 at that point

In this figure we see the function f at x has many possible linear tangents that may fit
appropriately. Then a subgradient is any h ∈ ℜn (same dimension as x) such that:
f (y) ≥ f (x) + hT (y − x), ∀y
Thus, intuitively, if a convex function is differentiable at a point x then it has a unique
subgradient at that point (▽f (x)). Formal Proof? 8/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 8 / 87
ADVANCED AND OPTIONAL:
Proof is using the limit definition of derivatives on pages 6-9 (slides 63-64) of
https://www.cse.iitb.ac.in/~ganesh/cs769/notes/enotes/12-28-08-2018-firstordergradient-why-what-how-subgradient-annotated.pdf
ADVANCED AND OPTIONAL:
Proof is using the limit definition of derivatives on pages 6-9 (slides 63-64) of
https://www.cse.iitb.ac.in/~ganesh/cs769/notes/enotes/12-28-08-2018-firstordergradient-why-what-how-subgradient-annotated.pdf
Detour: Convexity and Continuity

Let f be a convex function and suppose dom(f ) is open. Then f is continuous.

How wild can non-differentiable convex functions be?
While there are continuous functions which are nowhere differentiable, (see
https://en.wikipedia.org/wiki/Weierstrass_function), convex functions cannot
be pathological!
Infact, a convex function is differentiable almost everywhere. In other words, the set of
points where f is non-differentiable is of measure 0.
However we cannot ignore the non-differentiability, since a) the global minima could easily
be a point of non differentiability and b) with any optimization algorithms, you can
stumble upon these ”kinks”.

9/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 9 / 87
(Sub)Gradients and Convexity (contd)

A subdifferential is the closed convex set of all subgradients of the convex function f :

∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}

Note that this set is guaranteed to be nonempty unless f is not convex.

10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)

A subdifferential is the closed convex set of all subgradients of the convex function f :

∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}

Note that this set is guaranteed to be nonempty unless f is not convex.

Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
all active functions at x.

10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)

A subdifferential is the closed convex set of all subgradients of the convex function f :

∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}

Note that this set is guaranteed to be nonempty unless f is not convex.

Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
all active functions at x.

10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)

A subdifferential is the closed convex set of all subgradients of the convex function f :

∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}

Note that this set is guaranteed to be nonempty unless f is not convex.

Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
all active functions at x.
General pointwise maximum: if f (x) = max fs (x), then
s∈S
[
under some regularity conditions (on S, fs ), ∂f (x) = cl conv ∂fs (x)
s:fs (x)=f (x)

10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
(Sub)Gradients and Convexity (contd)

A subdifferential is the closed convex set of all subgradients of the convex function f :

∂f (x) = {h ∈ ℜn : h is a subgradient of f at x}

Note that this set is guaranteed to be nonempty unless f is not convex.

Pointwise Maximum:. if f (x) = max fi (x), then
[ i=1...m
∂f (x) = conv ∂fi (x) , which is the convex hull of union of subdifferentials of
i:fi (x)=f (x)
Eg: Vector induced Matrix norm
all active functions at x.
General pointwise maximum: if f (x) = max fs (x), then
s∈S
[
under some regularity conditions (on S, fs ), ∂f (x) = cl conv ∂fs (x)
s:fs (x)=f (x)

10/87
Ganesh Ramakrishnan Optimization in Machine Learning January, 2025 10 / 87
Subgradient of ∥x∥1

Assume x ∈ ℜn . Then
∥x∥1 =