Optimization for ML (2)
CS771: Introduction to Machine Learning
Piyush Rai
2
The Plan
Some basic techniques for solving optimization problems
First-order optimality
Gradient descent
Dealing with non-differentiable functions
Sub-gradients and sub-differential
CS771: Intro to ML
3
Optimization Problems in ML
The general form of an optimization problem in ML will usually be
Usually a sum of the
training error +
regularizer However, possible to have
linear/ridge regression
Here denotes the loss function to be optimized where solution has some
constraints (e.g., non-neg,
sparsity, or even both)
is the constraint set that the solution must belong to, e.g.,
Linear and ridge regression
Non-negativity constraint: All entries in must be non-negative that we saw were
Sparsity constraint: is a sparse vector with atmost non-zeros unconstrained ( was a real-
valued vector)
If no is specified, it is an unconstrained optimization problem
Constrained opt. probs can be converted into unconstrained opt. (will see later)
For now, assume we have an unconstrained optimization problem
CS771: Intro to ML
4
Methods for Solving
Optimization Problems
CS771: Intro to ML
5
Method 1: Using First-Order Optimality
Very simple. Already used this approach for linear and ridge regression
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take
First order optimality: The gradient must be equal to zero at the optima
=0
Sometimes, setting and solving for gives a closed form solution
If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent
CS771: Intro to ML
6
Method 2: Iterative Optimiz. via Gradient Descent
Can I used this approach For max. problems we Iterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
Initialize as set carefully (fixed
or chosen
adaptively). Will
For iteration (or until convergence) discuss some
Calculate the gradient using the current iterates strategies later
Set the learning rate Will see the Sometimes may be
justification shortly tricky to to assess
Move in the opposite direction of gradient convergence? Will
(𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
7
Gradient Descent: An Illustration
Negative gradient here . Let’s move Learning rate is very important
𝐿(𝒘 ) in the positive direction
Positive gradient
here. Let’s move in
the negative direction
𝒘
(0 )
𝒘
𝒘 𝒘
(1) (3 )
∗
𝒘
(2 )
𝒘
(2 )
𝒘(3 )
𝒘∗ 𝒘
(1) 𝒘
(0 )
𝒘
Woohoo!
Stuck at a local
Global minima
minima
found!!!
GD thanks you for the Good initialization
good initialization is very important CS771: Intro to ML
8
GD: An Example
Let’s apply GD for least squares linear regression
=
Training
The gradient: examples on
Prediction error of current model which the current
Each GD update will be of the form on the training example model’s error is
large contribute
more to the
update
Exercise: Assume , and show that GD update improves prediction on the
training input (, ), i.e, is closer to than to
This is sort of a proof that GD updates are “corrective” in nature (and it actually is true
not just for linear regression but can also be shown for various other ML models)CS771: Intro to ML
9
Dealing with Non-differentiable Functions
In many ML problems, the objective function will be non-differentiable
Some examples that we have already seen: Linear regression with absolute
loss, or Huber loss, or -insensitive loss; even norm regularizer is non-diff
¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨−𝜖
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿ Loss Loss
Not diff. here Not diff. here
) −𝛿 𝛿 ) −𝜖 𝜖 )
Not diff. here Not diff. here Not diff. here
Basically, any function in which there are points with kink is non-diff
At such points, the function is non-differentiable and thus gradients not defined
Reason: Can’t define a unique tangent at such points
CS771: Intro to ML
10
Sub-gradients
For convex non-diff fn, can define sub-gradients at point(s) of non-
differentiabilty Convex, thus lies
𝑓 (𝑥 )
above all its tangents
Equation of unique tangent at differentiable
here One extreme tangent at
non-differentiable
here Region containing all sub-gradients
The other extreme tangent at
𝑥1 𝑥2
For a convex, non-diff function , sub-gradient
⊤
atis any vector s.t.
𝑓 ( 𝑥 ) ≥ 𝑓 ( 𝒙 ∗) + 𝒈 ( 𝒙 − 𝒙 ∗)
CS771: Intro to ML
11
Sub-gradients, Sub-differential, and Some Rules
Set of all sub-gradient at a non-diff point is called the sub-differential
𝜕 𝑓 ( 𝒙 ∗ ) ≜ { 𝒈 : 𝑓 ( 𝐱 ) ≥ 𝑓 ( 𝒙 ∗ ) +𝒈 ( 𝒙 − 𝒙∗ ) ∀ 𝐱 }
⊤
The affine transform
Some basic rules of sub-diff calculus to keep in mind rule is a special case of
Scaling rule: the more general chain
rule
Sum rule:
Affine trans: , where
Max rule: If then we calculate at as
If , If
If
is a stationary point for a non-diff function if the zero vector belongs to the
sub-differential at , i.e.,
CS771: Intro to ML
12
Sub-Gradient For Absolute Loss Regression
0 𝑦 𝑛 − 𝒘 ⊤ 𝒙𝑛 0 𝑡
The loss function for linear reg. with absolute loss:
Non-differentiable at
Can use the affine transform rule of sub-diff calculus
Assume . Then
if
if
where if
CS771: Intro to ML
13
Sub-Gradient Descent
Suppose we have a non-differentiable function
Sub-gradient descent is almost identical to GD except we use subgradients
Sub-Gradient Descent
Initialize as
For iteration (or until convergence)
Calculate the sub-gradient
Set the learning rate
Move in the opposite direction of subgradient
(𝑡 +1) (𝑡 ) (𝑡 )
𝒘 =𝒘 − 𝜂𝑡 𝒈
CS771: Intro to ML
14
Coming up next
Making GD faster: Stochastic gradient descent
Constrained optimization
Co-ordinate descent
Alternating optimization
Practical issue in optimization for ML
CS771: Intro to ML