Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views17 pages

Class 5 - ML Concepts (Part II)

The document outlines fundamental concepts of machine learning, focusing on optimization algorithms known as solvers, particularly Gradient Descent (GD) and Stochastic Gradient Descent (SGD). It discusses the advantages and disadvantages of these methods, including learning rate challenges and potential upgrades like Momentum, Adagrad, and Adam. The final message emphasizes that these solvers are not machine learning algorithms themselves but tools for minimizing functions with gradients.

Uploaded by

omarfaroque910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Class 5 - ML Concepts (Part II)

The document outlines fundamental concepts of machine learning, focusing on optimization algorithms known as solvers, particularly Gradient Descent (GD) and Stochastic Gradient Descent (SGD). It discusses the advantages and disadvantages of these methods, including learning rate challenges and potential upgrades like Momentum, Adagrad, and Adam. The final message emphasizes that these solvers are not machine learning algorithms themselves but tools for minimizing functions with gradients.

Uploaded by

omarfaroque910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Class 5- Machine Learning concepts

Part II

Prof. Pedram Jahangiry 1


Motivation

Machine learning fundamental concepts:


• Inference and prediction
Prediction Inference
• Part I: The Model
• Part II: Evaluation metrics Inference Prediction

• Part III: Bias-Variance tradeoff


• Part IV: Resampling methods
• Part V: Solvers/learners (GD, SGD, Adagrad, Adam, …)
• Part VI: How do machines learn?
• Part VII: Scaling the features

Prof. Pedram Jahangiry 2


Part V
Solvers (GD, SGD, Adagrad, Adam, …)

Prof. Pedram Jahangiry 3


Solvers (learners)!
A Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost
function has its own curve and its own gradients. The slope of this curve tells us how to update our parameters to
make the model more accurate.

The two most frequently used optimization algorithms when the loss function is differentiable are:
1) Gradient Descent (GD)
2) Stochastic Gradient Descent (SGD)

Gradient Descent: is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a
function using gradient descent, one starts at some random point and takes steps proportional to the negative of the gradient of
the function at the current point.
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗
• 𝜃j is the model’s 𝑗𝑡ℎ parameter
• 𝛼 is the learning rate
• 𝐽 𝜃 is the loss function (which is differentiable)

Prof. Pedram Jahangiry 4


Gradient Descent Visualization 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃𝑗
𝐽 𝜃

Gradient descent proceeds in epochs. An epoch consists of using the training set entirely to update
each parameter. The learning rate 𝛼 controls the size of an update

𝐿𝑜𝑠𝑠: 𝐽 𝜃𝑗

𝜃𝑗

Prof. Pedram Jahangiry 5


Learning rate schedules

𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗

• If 𝛼 is too small, gradient


descent can be slow

• If 𝛼 is too large, gradient


descent can overshoot the
minimum. It may fail to
converge, or even diverge.

Prof. Pedram Jahangiry 6


Beyond Gradient Descent?
Disadvantages of gradient descent:
• Single batch: use the entire training set to update a parameter!
• Sensitive to the choice of the learning rate
• Slow for large datasets

(Minibatch) Stochastic Gradient Descent: is a version of the algorithm that speeds


up the computation by approximating the gradient using smaller batches (subsets)
of the training data. SGD itself has various “upgrades”.
1) Adagrad
2) Adam

Prof. Pedram Jahangiry 7


Why SGD?

Prof. Pedram Jahangiry 8


SGD vs GD

Prof. Pedram Jahangiry 9


Why upgrade SGD?

Prof. Pedram Jahangiry 10


Beyond Stochastic Gradient Descent?
Disadvantages of Stochastic gradient descent:
• Get trapped in suboptimal local minima (for non-convex loss functions)
• The same learning rate applies to all parameter updates

SGD upgrades:
1) Momentum
2) Adagrad
3) Adam

Prof. Pedram Jahangiry


Momentum
• Momentum is a method that helps accelerate SGD in the relevant direction and dampens
oscillations.
• Essentially, when using momentum, we push a ball down a hill. The ball accumulates
momentum as it rolls downhill, becoming faster and faster on the way.

Prof. Pedram Jahangiry


Adagrad
• Adaptive Gradient Algorithm is a version of SGD that scales the learning rate for each
parameter according to the history of gradients. As a result, the learning rate is
reduced for very large gradients and vice-versa.

• It adapts the learning rate to the parameters, performing smaller updates (low learning
rates) for parameters associated with frequently occurring features, and larger updates
(high learning rates) for parameters associated with infrequent features. For this
reason, it is well-suited for dealing with sparse data.

Prof. Pedram Jahangiry


Adam

• Adaptive Moment Estimation takes


both momentum and adaptive
learning rate (RMSprop) and putting
them together.

• Whereas momentum can be seen as a


ball running down a slope, Adam
behaves like a heavy ball with
friction, which thus prefers flat
minima in the error surface

Prof. Pedram Jahangiry


Final message!

Notice that gradient descent and its variants are not machine
learning algorithms. They are solvers of minimization problems
in which the function to minimize has a gradient (in most points
of its domain).

Prof. Pedram Jahangiry 15


Question of the day!

Prof. Pedram Jahangiry 16


Prof. Pedram Jahangiry

You might also like