Class 5- Machine Learning concepts
Part II
Prof. Pedram Jahangiry 1
Motivation
Machine learning fundamental concepts:
• Inference and prediction
Prediction Inference
• Part I: The Model
• Part II: Evaluation metrics Inference Prediction
• Part III: Bias-Variance tradeoff
• Part IV: Resampling methods
• Part V: Solvers/learners (GD, SGD, Adagrad, Adam, …)
• Part VI: How do machines learn?
• Part VII: Scaling the features
Prof. Pedram Jahangiry 2
Part V
Solvers (GD, SGD, Adagrad, Adam, …)
Prof. Pedram Jahangiry 3
Solvers (learners)!
A Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost
function has its own curve and its own gradients. The slope of this curve tells us how to update our parameters to
make the model more accurate.
The two most frequently used optimization algorithms when the loss function is differentiable are:
1) Gradient Descent (GD)
2) Stochastic Gradient Descent (SGD)
Gradient Descent: is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a
function using gradient descent, one starts at some random point and takes steps proportional to the negative of the gradient of
the function at the current point.
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗
• 𝜃j is the model’s 𝑗𝑡ℎ parameter
• 𝛼 is the learning rate
• 𝐽 𝜃 is the loss function (which is differentiable)
Prof. Pedram Jahangiry 4
Gradient Descent Visualization 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃𝑗
𝐽 𝜃
Gradient descent proceeds in epochs. An epoch consists of using the training set entirely to update
each parameter. The learning rate 𝛼 controls the size of an update
𝐿𝑜𝑠𝑠: 𝐽 𝜃𝑗
𝜃𝑗
Prof. Pedram Jahangiry 5
Learning rate schedules
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗
• If 𝛼 is too small, gradient
descent can be slow
• If 𝛼 is too large, gradient
descent can overshoot the
minimum. It may fail to
converge, or even diverge.
Prof. Pedram Jahangiry 6
Beyond Gradient Descent?
Disadvantages of gradient descent:
• Single batch: use the entire training set to update a parameter!
• Sensitive to the choice of the learning rate
• Slow for large datasets
(Minibatch) Stochastic Gradient Descent: is a version of the algorithm that speeds
up the computation by approximating the gradient using smaller batches (subsets)
of the training data. SGD itself has various “upgrades”.
1) Adagrad
2) Adam
Prof. Pedram Jahangiry 7
Why SGD?
Prof. Pedram Jahangiry 8
SGD vs GD
Prof. Pedram Jahangiry 9
Why upgrade SGD?
Prof. Pedram Jahangiry 10
Beyond Stochastic Gradient Descent?
Disadvantages of Stochastic gradient descent:
• Get trapped in suboptimal local minima (for non-convex loss functions)
• The same learning rate applies to all parameter updates
SGD upgrades:
1) Momentum
2) Adagrad
3) Adam
Prof. Pedram Jahangiry
Momentum
• Momentum is a method that helps accelerate SGD in the relevant direction and dampens
oscillations.
• Essentially, when using momentum, we push a ball down a hill. The ball accumulates
momentum as it rolls downhill, becoming faster and faster on the way.
Prof. Pedram Jahangiry
Adagrad
• Adaptive Gradient Algorithm is a version of SGD that scales the learning rate for each
parameter according to the history of gradients. As a result, the learning rate is
reduced for very large gradients and vice-versa.
• It adapts the learning rate to the parameters, performing smaller updates (low learning
rates) for parameters associated with frequently occurring features, and larger updates
(high learning rates) for parameters associated with infrequent features. For this
reason, it is well-suited for dealing with sparse data.
Prof. Pedram Jahangiry
Adam
• Adaptive Moment Estimation takes
both momentum and adaptive
learning rate (RMSprop) and putting
them together.
• Whereas momentum can be seen as a
ball running down a slope, Adam
behaves like a heavy ball with
friction, which thus prefers flat
minima in the error surface
Prof. Pedram Jahangiry
Final message!
Notice that gradient descent and its variants are not machine
learning algorithms. They are solvers of minimization problems
in which the function to minimize has a gradient (in most points
of its domain).
Prof. Pedram Jahangiry 15
Question of the day!
Prof. Pedram Jahangiry 16
Prof. Pedram Jahangiry