0% found this document useful (0 votes)

10 views17 pages

Class 5 - ML Concepts (Part II)

The document outlines fundamental concepts of machine learning, focusing on optimization algorithms known as solvers, particularly Gradient Descent (GD) and Stochastic Gradient Descent (SGD). It discusses the advantages and disadvantages of these methods, including learning rate challenges and potential upgrades like Momentum, Adagrad, and Adam. The final message emphasizes that these solvers are not machine learning algorithms themselves but tools for minimizing functions with gradients.

Uploaded by

omarfaroque910

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views17 pages

Class 5 - ML Concepts (Part II)

Uploaded by

omarfaroque910

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Class 5- Machine Learning concepts

Part II

Prof. Pedram Jahangiry 1

Motivation

Machine learning fundamental concepts:

• Inference and prediction
Prediction Inference
• Part I: The Model
• Part II: Evaluation metrics Inference Prediction

• Part III: Bias-Variance tradeoff

• Part IV: Resampling methods
• Part V: Solvers/learners (GD, SGD, Adagrad, Adam, …)
• Part VI: How do machines learn?
• Part VII: Scaling the features

Prof. Pedram Jahangiry 2

Part V
Solvers (GD, SGD, Adagrad, Adam, …)

Prof. Pedram Jahangiry 3

Solvers (learners)!
A Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost
function has its own curve and its own gradients. The slope of this curve tells us how to update our parameters to
make the model more accurate.

The two most frequently used optimization algorithms when the loss function is differentiable are:
1) Gradient Descent (GD)
2) Stochastic Gradient Descent (SGD)

Gradient Descent: is an iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a
function using gradient descent, one starts at some random point and takes steps proportional to the negative of the gradient of
the function at the current point.
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗
• 𝜃j is the model’s 𝑗𝑡ℎ parameter
• 𝛼 is the learning rate
• 𝐽 𝜃 is the loss function (which is differentiable)

Prof. Pedram Jahangiry 4

Gradient Descent Visualization 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃𝑗
𝐽 𝜃

Gradient descent proceeds in epochs. An epoch consists of using the training set entirely to update
each parameter. The learning rate 𝛼 controls the size of an update

𝐿𝑜𝑠𝑠: 𝐽 𝜃𝑗

𝜃𝑗

Prof. Pedram Jahangiry 5

Learning rate schedules

𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗

• If 𝛼 is too small, gradient

descent can be slow

• If 𝛼 is too large, gradient

descent can overshoot the
minimum. It may fail to
converge, or even diverge.

Prof. Pedram Jahangiry 6

Beyond Gradient Descent?
Disadvantages of gradient descent:
• Single batch: use the entire training set to update a parameter!
• Sensitive to the choice of the learning rate
• Slow for large datasets

(Minibatch) Stochastic Gradient Descent: is a version of the algorithm that speeds

up the computation by approximating the gradient using smaller batches (subsets)
of the training data. SGD itself has various “upgrades”.
1) Adagrad
2) Adam

Prof. Pedram Jahangiry 7

Why SGD?

Prof. Pedram Jahangiry 8

SGD vs GD

Prof. Pedram Jahangiry 9

Why upgrade SGD?

Prof. Pedram Jahangiry 10

Beyond Stochastic Gradient Descent?
Disadvantages of Stochastic gradient descent:
• Get trapped in suboptimal local minima (for non-convex loss functions)
• The same learning rate applies to all parameter updates

SGD upgrades:
1) Momentum
2) Adagrad
3) Adam

Prof. Pedram Jahangiry

Momentum
• Momentum is a method that helps accelerate SGD in the relevant direction and dampens
oscillations.
• Essentially, when using momentum, we push a ball down a hill. The ball accumulates
momentum as it rolls downhill, becoming faster and faster on the way.

Prof. Pedram Jahangiry

Adagrad
• Adaptive Gradient Algorithm is a version of SGD that scales the learning rate for each
parameter according to the history of gradients. As a result, the learning rate is
reduced for very large gradients and vice-versa.

• It adapts the learning rate to the parameters, performing smaller updates (low learning
rates) for parameters associated with frequently occurring features, and larger updates
(high learning rates) for parameters associated with infrequent features. For this
reason, it is well-suited for dealing with sparse data.

Prof. Pedram Jahangiry

Adam

• Adaptive Moment Estimation takes

both momentum and adaptive
learning rate (RMSprop) and putting
them together.

• Whereas momentum can be seen as a

ball running down a slope, Adam
behaves like a heavy ball with
friction, which thus prefers flat
minima in the error surface

Prof. Pedram Jahangiry

Final message!

Notice that gradient descent and its variants are not machine
learning algorithms. They are solvers of minimization problems
in which the function to minimize has a gradient (in most points
of its domain).

Prof. Pedram Jahangiry 15

Question of the day!

Prof. Pedram Jahangiry 16

Prof. Pedram Jahangiry

DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
DL Class2
No ratings yet
DL Class2
30 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Op Tim Ization
No ratings yet
Op Tim Ization
37 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
SGD 1
No ratings yet
SGD 1
86 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Module 2
No ratings yet
Module 2
67 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent for Deep Learning
No ratings yet
Gradient Descent for Deep Learning
21 pages
Maths
No ratings yet
Maths
13 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Cours 5
No ratings yet
Cours 5
23 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
Deep Learning Optimizers Explained
No ratings yet
Deep Learning Optimizers Explained
20 pages
Optimizers
No ratings yet
Optimizers
4 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Types of Gradient Descent
No ratings yet
Types of Gradient Descent
9 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
No ratings yet
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
6 pages
Visualizing SGD and Optimizations
No ratings yet
Visualizing SGD and Optimizations
8 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
AML - Lecture 5
No ratings yet
AML - Lecture 5
97 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
GD Compare
No ratings yet
GD Compare
5 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Lec 8
No ratings yet
Lec 8
43 pages
Unit 2.a Optimzer
No ratings yet
Unit 2.a Optimzer
10 pages
Class 10 - Logistic Regression-Checkpoint
No ratings yet
Class 10 - Logistic Regression-Checkpoint
20 pages
Covid-19 Detection From Chest X-Tay Images
No ratings yet
Covid-19 Detection From Chest X-Tay Images
12 pages
Module 6 - Deep Sequence Modeling-Original
No ratings yet
Module 6 - Deep Sequence Modeling-Original
65 pages
Module 7 - Deep Sequence Modeling
No ratings yet
Module 7 - Deep Sequence Modeling
61 pages
SQLforEveryone1 1
No ratings yet
SQLforEveryone1 1
10 pages
Greedy Algorithms & Spanning Trees
No ratings yet
Greedy Algorithms & Spanning Trees
7 pages
1155 CS F407 20230601095112 Mid Semester Question Paper
No ratings yet
1155 CS F407 20230601095112 Mid Semester Question Paper
1 page
Cheeger Chung
No ratings yet
Cheeger Chung
22 pages
Solution of Non Linear Equation by Using Newton Rapshon Method
No ratings yet
Solution of Non Linear Equation by Using Newton Rapshon Method
5 pages
Algorithm Circuit Design & Analysis
100% (1)
Algorithm Circuit Design & Analysis
16 pages
FSM Sequence Detector Lab Guide
No ratings yet
FSM Sequence Detector Lab Guide
3 pages
Decision Tree Clips
100% (1)
Decision Tree Clips
25 pages
Confluence by Decreasing Diagrams: Vincent Van Oostrom
No ratings yet
Confluence by Decreasing Diagrams: Vincent Van Oostrom
22 pages
Stock Profit Maximization Guide
No ratings yet
Stock Profit Maximization Guide
3 pages
22 GCC
No ratings yet
22 GCC
9 pages
Feb 01
No ratings yet
Feb 01
4 pages
Automata and Computability
No ratings yet
Automata and Computability
240 pages
Circular Queue
No ratings yet
Circular Queue
2 pages
Ai
No ratings yet
Ai
3 pages
Assignment For Theory of Automata.
No ratings yet
Assignment For Theory of Automata.
14 pages
S25 POIS Quiz1 Practice
No ratings yet
S25 POIS Quiz1 Practice
4 pages
Algorithm Analysis Basics
No ratings yet
Algorithm Analysis Basics
23 pages
ML Practical 2 Worksheet
No ratings yet
ML Practical 2 Worksheet
4 pages
Report 20220209 Talent Matcht Prueba 1 Backend C.buitron Outlook - Com77911720767
No ratings yet
Report 20220209 Talent Matcht Prueba 1 Backend C.buitron Outlook - Com77911720767
10 pages
Assignment # 01 2273107 20-10-2022
No ratings yet
Assignment # 01 2273107 20-10-2022
8 pages
Topological Sort in Delphi
No ratings yet
Topological Sort in Delphi
8 pages
CS204 - Final
No ratings yet
CS204 - Final
3 pages
DS Programs Record
No ratings yet
DS Programs Record
54 pages
MISR Polynomial Equations
No ratings yet
MISR Polynomial Equations
10 pages
Pseudocode Test 2 Cram Up
100% (1)
Pseudocode Test 2 Cram Up
7 pages
Dsa Report 1
No ratings yet
Dsa Report 1
66 pages
GM Lab - 7 - NM - SE14AB-BA
No ratings yet
GM Lab - 7 - NM - SE14AB-BA
4 pages
EE2008 Data Structures and Algorithms - OBTL
0% (1)
EE2008 Data Structures and Algorithms - OBTL
6 pages
Machine Learning Interview Prep
No ratings yet
Machine Learning Interview Prep
2 pages
Dsa Notes of Class 05 by Manoj Kumar
No ratings yet
Dsa Notes of Class 05 by Manoj Kumar
8 pages

Class 5 - ML Concepts (Part II)

Uploaded by

Class 5 - ML Concepts (Part II)

Uploaded by

Class 5- Machine Learning concepts

Prof. Pedram Jahangiry 1

Machine learning fundamental concepts:

• Part III: Bias-Variance tradeoff

Prof. Pedram Jahangiry 2

Prof. Pedram Jahangiry 3

Prof. Pedram Jahangiry 4

Prof. Pedram Jahangiry 5

• If 𝛼 is too small, gradient

• If 𝛼 is too large, gradient

Prof. Pedram Jahangiry 6

(Minibatch) Stochastic Gradient Descent: is a version of the algorithm that speeds

Prof. Pedram Jahangiry 7

Prof. Pedram Jahangiry 8

Prof. Pedram Jahangiry 9

Prof. Pedram Jahangiry 10

Prof. Pedram Jahangiry

Prof. Pedram Jahangiry

Prof. Pedram Jahangiry

• Adaptive Moment Estimation takes

• Whereas momentum can be seen as a

Prof. Pedram Jahangiry

Prof. Pedram Jahangiry 15

Prof. Pedram Jahangiry 16

You might also like