0% found this document useful (0 votes)

9 views26 pages

5 Gradients

The document provides an overview of Gradient Descent (GD) and its variants, including Stochastic Gradient Descent (SGD) and Mini-Batch SGD, as optimization algorithms used in machine learning. It discusses the importance of choosing an appropriate learning rate, stopping criteria, and feature scaling techniques to improve convergence and model performance. Additionally, it highlights the differences between traditional optimization and machine learning optimization, emphasizing the need for models that generalize well to unseen data.

Uploaded by

mckaymcfadden

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views26 pages

5 Gradients

Uploaded by

mckaymcfadden

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

5 - Gradient Descent Methods

UCLA Math156: Machine Learning

Instructor: Lara Kassab
Gradient Descent Algorithm

Gradient Descent (GD) is a very widely used first-order iterative

optimization algorithm for finding a local minimizer of a function
f : Rn → R.

Algorithm 1 Gradient Descent (GD)

1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K
2: for k = 0, · · · , K − 1 do
3: w(k+1) = w(k) − ηk ∇f (w(k) )
4: end for
5: Return w(K)
Gradient Descent Algorithm

Intuition: walking downhill using only the slope you “feel” nearby.

Figure 1: Illustration of gradient descent on level sets of a function from

R2 → R. Image source: Wikipedia.
Gradient Descent Algorithm

Note that GD in its most general form is not guaranteed to

converge, be a descent algorithm, or reach a local minimizer.

Figure 2: Functions from R2 → R with local and global mininmizers.

Image source: Charu C. Aggarwal textbook.
Gradient Descent Algorithm

Many variants of GD exist whether to address challenges in

standard GD or tailor to specific problems.
GD and its variants are highly used for large and/or
high-dimensional datasets when a direct solution cannot be
computed (e.g. solving ∇f (w∗ ) = 0) or takes too long or is
numerically unstable.
Further, given the nature of GD as an iterative method we
gain more control over the learning process when using GD
over direct solutions (if they are computable).
Learning Rate

How do we choose the learning rate (or step size) ηk ≥ 0?

Too small learning rate → slow convergence. Too large

learning rate → possible divergence.
Classic techniques: use a fixed step size; set a decaying
learning rate; line search for optimal step size; low accuracy
line search; etc.
There are many advanced ways to control learning rates. We
will see later in the course adaptive learning rates (highly used
for training neural networks).
What Are Some Stopping Criteria?

Stopping Criterion Description

∇f (w(k) ) < ε Gradient norm is small

f (w(k+1) ) − f (w(k) ) < ε Small decrease in objective value

w(k+1) − w(k) < ε Small change in approximation

|f (w(k+1) )−f (w(k) )|
max{1,|f (w(k) )|}
<ε Relative change & numerical stability

∥w(k+1) −w(k) ∥
max{1,∥w(k) ∥}
<ε Relative change & numerical stability
What Are Some Stopping Criteria?

These are some stopping criteria that can be used.

→ Note that having ∇f wk = 0, means FONC is satisfied, so we

can stop, but does not guarantee a local minimizer has been found.

→ We will discuss later in the course useful approaches such as

momentum (highly used for training neural networks).
Stochastic Gradient Descent

Many loss functions in machine learning problems are linearly

separable:
XN
minimize f (w) := fn (w)
w
n=1

Additively separable data-centric objective functions are common

in machine learning. For example, each fn (w) is associated to the
loss term of an individual sample point n.
Stochastic Gradient Descent

Many loss functions in machine learning problems are linearly

separable:
XN
minimize f (w) := fn (w)
w
n=1

For example, least squares problem with N sample points:

N
X 2
minimize ∥Aw − b∥22 := a⊤
n w − bn
w∈RM | {z }
n=1 | {z }
f (w)
fn (w)
Stochastic Gradient Descent

Algorithm 2 Stochastic Gradient Descent (SGD)

1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K
2: for k = 0, · · · , K − 1 do
3: Randomly select n ∈ {1, . . . , N }
4: w(k+1) = w(k) − ηk ∇fn (w(k) )
5: end for
6: Return w(K)

Idea. Use a random sub-function fn to compute the gradient ∇fn

instead of computing the full gradient ∇f (costly!). Thus,
approximating the full gradient ∇f (w).
Example: Least-Mean-Squares Algorithm

Consider the least squares problem as an example,

N n
X o2
minimize ∥t − Φw∥22 = tn − w⊤ ϕ (xn )
w
n=1

The SGD update with step size η ≥ 0 is:

This is known as the least-mean-squares algorithm.

Stochastic Gradient Descent

Figure 3: Illustration of noisy gradient approximations in SGD.

Gradient Descent Methods Comparisons

Suppose each fn (w) in f (w) = N

P
n=1 fn (w) is associated to the
loss term of an individual sample point n in an ML setting.
1 GD uses the entire N data points to compute the gradient.
SGD uses one randomly sampled point at each iteration to
approximate the full gradient.
2 For large datasets, GD is very slow in every iteration while
SGD will need a significant number of iterations to roughly
cover all data, each iteration is far less computationally
intensive (and usually overall faster).
3 Mini-batch SGD serves as a middle case and a “smoother”
approach.
Mini-Batch SGD

Algorithm 3 Mini-Batch SGD

1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K; batch size |Bk | ≪ N ;
2: for k = 0, · · · , K − 1 do
3: Randomly select a batch
(k+1) (k)
P Bk ⊆ {1, . . . , N}
(k)
4: w = w − ηk ∇fn (w )
n∈Bk
5: end for
6: Return w(K)
Gradient Descent Methods Comparisons

While SGD methods have lower accuracy than gradient-descent

methods on training data, they often perform comparably or even
better on the test data than GD.

→ This is because the random sampling of training instances

during optimization reduces overfitting.
Gradient Descent Methods Comparisons

There are some subtle differences in how optimization is used in

machine learning from the way it is used in traditional optimization.

An important difference is that traditional optimization

focuses on learning the parameters so as to optimize the
objective function as much as possible.
However, in machine learning, we seek to learn a model that
can generalize well to unseen data.
Feature Preprocessing

Now is a good time to discuss feature preprocessing, particularly,

feature scaling.

→ Vastly varying sensitivities of the loss function to different

parameters tend to hurt the learning. This issue can be controlled
by the scale of the features.
Feature Scaling

Consider linear regression model uses the coefficient w1 for Feature

1 and the coefficient w2 for Feature 2 in order to predict the target:

y = w1 x 1 + w2 x 2

Feature 1 (x1 ) Feature 2 (x2 ) Target (t)

0.1 25 7
0.8 10 1
0.4 10 4
Table 1: Toy dataset for illustration.
Feature Scaling

One can model the least-squares objective function E(w) as:

3
X 2
w⊤ xn − tn = 0.81w12 +825w22 +29w1 w2 −6.2w1 −450w2 +66
n=1

→ The objective function is far more sensitive to w2 as compared

to w1 .
→ This is caused by the fact that Feature 2 is on a much larger
scale (resulting in a “dramatic” variance) than Feature 1. This
shows up in the coefficients of the objective function.
Feature Scaling

As a result, the gradient will often bounce along the w2 direction,

while making tiny progress along the w1 direction.

→ However, if we standardize each feature to zero mean and unit

variance, the coefficients of w12 and w22 will become much more
similar. Recommended Exercise.
Feature Scaling

As a result, the bouncing behavior of gradient descent is typically

reduced and convergence becomes faster.

→ In this particular example, the interaction term of the form

w1 w2 causes some additional issues that can be addressed by a
procedure called whitening.
Feature Scaling: Standardization

Feature Standardization rescales features to have zero mean and

unit variance:

Let denote µj be the mean and σj is the standard deviation of

feature j across all training data points. Replace the jth
feature of each sample data xn with:

(xn )j − µj
(xn )j ←
σj

In feature scaling or preprocessing, in general, we should not

use the test set to learn the transformation. Further, we must
apply the same learned transformation from the train set to
the test set.
Feature Scaling: Normalization

Feature Normalization (also known as the min-max normalization),

rescales the range of features to [0, 1]:

Let denote Mj and mj denote the maximum and minimum

values of feature j respectively, across all training data points.

(xn )j − mj
(xn )j ←
Mj − mj

We can also normalize to a more general interval [a, b].

Note that outliers can cause problems with feature
normalization.
Feature Scaling: Scaling to unit length

Scaling to unit length rescales the feature vector to have norm one.
If vj ∈ RD denotes the jth feature vector over the train set,
vj
vj ←
∥vj ∥

We can use any desired vector norm ∥ · ∥ e.g., euclidean norm.

Which Feature Scaling Technique to Use?

There are many feature scaling techniques.

→ Whether or not one is required and which one to choose,

depends on the model, problem domain, data types, etc. Always
check what a model requires and understand how the
transformation affects the learning.

10.1 - A Classifying Polynomials - E - Preferred PDF
67% (3)
10.1 - A Classifying Polynomials - E - Preferred PDF
2 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
UNIT2
No ratings yet
UNIT2
25 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
ML Lec-6
No ratings yet
ML Lec-6
16 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Gradient Descent for Data Scientists
No ratings yet
Gradient Descent for Data Scientists
9 pages
Grid Independence Study
100% (1)
Grid Independence Study
6 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
ML Regression & Gradient Descent
No ratings yet
ML Regression & Gradient Descent
37 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
UNIT3
No ratings yet
UNIT3
37 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
C1 W2 Lab03 Feature Scaling and Learning Rate Soln
No ratings yet
C1 W2 Lab03 Feature Scaling and Learning Rate Soln
10 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
ML Notes
No ratings yet
ML Notes
14 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Cubic Spline Interpolation Guide
No ratings yet
Cubic Spline Interpolation Guide
14 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Machine Learning 45 A 87
No ratings yet
Machine Learning 45 A 87
43 pages
Week3 Large Scale ML
No ratings yet
Week3 Large Scale ML
66 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
Chapter04 Training Models
No ratings yet
Chapter04 Training Models
33 pages
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
No ratings yet
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
8 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Search Algorithms Overview
100% (1)
Search Algorithms Overview
43 pages
Champo Carpets Problem Statement
0% (1)
Champo Carpets Problem Statement
2 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
Support Vector Machine
0% (1)
Support Vector Machine
7 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
UNIT 2 Study Materials 1
No ratings yet
UNIT 2 Study Materials 1
42 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
Hyper-Parameter Optimization: A Review of Algorithms and Applications
No ratings yet
Hyper-Parameter Optimization: A Review of Algorithms and Applications
56 pages
07-Fec RSC
No ratings yet
07-Fec RSC
30 pages
Neural Network
No ratings yet
Neural Network
14 pages
Intro To Machine Learning Lecture Notes5
No ratings yet
Intro To Machine Learning Lecture Notes5
8 pages
Human Evolutionary Algorithm Insights
No ratings yet
Human Evolutionary Algorithm Insights
25 pages
QT Duality and Sensitivity Analysis
No ratings yet
QT Duality and Sensitivity Analysis
29 pages
Lab 2: Doubly LL + Stack + Queue + Sorting
No ratings yet
Lab 2: Doubly LL + Stack + Queue + Sorting
33 pages
Power Method
No ratings yet
Power Method
19 pages
FVM Presentation
No ratings yet
FVM Presentation
33 pages
Recursion RT
No ratings yet
Recursion RT
18 pages
Comp2712 l05 ML Feature
No ratings yet
Comp2712 l05 ML Feature
20 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
Gomory
No ratings yet
Gomory
40 pages
DSP: Interpolation & Decimation
No ratings yet
DSP: Interpolation & Decimation
32 pages
IF4071 Model
No ratings yet
IF4071 Model
10 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Zwe Lee Gaing
No ratings yet
Zwe Lee Gaing
9 pages
Hashing
No ratings yet
Hashing
9 pages
DSP Lab Syllabus - Fall 2021
No ratings yet
DSP Lab Syllabus - Fall 2021
4 pages
AI Game Playing Seminar Overview
No ratings yet
AI Game Playing Seminar Overview
27 pages
Dijkstra'S Algorithm: (For Programming Based Labs)
No ratings yet
Dijkstra'S Algorithm: (For Programming Based Labs)
5 pages
Design and Implementation of A Turbo Code System On FPGA: November 2011
No ratings yet
Design and Implementation of A Turbo Code System On FPGA: November 2011
6 pages
Data Structures Exam 2009-2010
No ratings yet
Data Structures Exam 2009-2010
2 pages
Tabu Search Heuristics For The Vehicle Routing Problem
No ratings yet
Tabu Search Heuristics For The Vehicle Routing Problem
19 pages
Image Processing for Students
No ratings yet
Image Processing for Students
3 pages

5 Gradients

Uploaded by

5 Gradients

Uploaded by

5 - Gradient Descent Methods

UCLA Math156: Machine Learning

Gradient Descent (GD) is a very widely used first-order iterative

Algorithm 1 Gradient Descent (GD)

Figure 1: Illustration of gradient descent on level sets of a function from

Note that GD in its most general form is not guaranteed to

Figure 2: Functions from R2 → R with local and global mininmizers.

Many variants of GD exist whether to address challenges in

How do we choose the learning rate (or step size) ηk ≥ 0?

Too small learning rate → slow convergence. Too large

Stopping Criterion Description

∇f (w(k) ) < ε Gradient norm is small

f (w(k+1) ) − f (w(k) ) < ε Small decrease in objective value

w(k+1) − w(k) < ε Small change in approximation

These are some stopping criteria that can be used.

→ Note that having ∇f wk = 0, means FONC is satisfied, so we

→ We will discuss later in the course useful approaches such as

Many loss functions in machine learning problems are linearly

Additively separable data-centric objective functions are common

Many loss functions in machine learning problems are linearly

For example, least squares problem with N sample points:

Algorithm 2 Stochastic Gradient Descent (SGD)

Idea. Use a random sub-function fn to compute the gradient ∇fn

Consider the least squares problem as an example,

The SGD update with step size η ≥ 0 is:

This is known as the least-mean-squares algorithm.

Figure 3: Illustration of noisy gradient approximations in SGD.

Suppose each fn (w) in f (w) = N

Algorithm 3 Mini-Batch SGD

While SGD methods have lower accuracy than gradient-descent

→ This is because the random sampling of training instances

There are some subtle differences in how optimization is used in

An important difference is that traditional optimization

Now is a good time to discuss feature preprocessing, particularly,

→ Vastly varying sensitivities of the loss function to different

Consider linear regression model uses the coefficient w1 for Feature

Feature 1 (x1 ) Feature 2 (x2 ) Target (t)

One can model the least-squares objective function E(w) as:

→ The objective function is far more sensitive to w2 as compared

As a result, the gradient will often bounce along the w2 direction,

→ However, if we standardize each feature to zero mean and unit

As a result, the bouncing behavior of gradient descent is typically

→ In this particular example, the interaction term of the form

Feature Standardization rescales features to have zero mean and

Let denote µj be the mean and σj is the standard deviation of

In feature scaling or preprocessing, in general, we should not

Feature Normalization (also known as the min-max normalization),

Let denote Mj and mj denote the maximum and minimum

We can also normalize to a more general interval [a, b].

We can use any desired vector norm ∥ · ∥ e.g., euclidean norm.

There are many feature scaling techniques.

→ Whether or not one is required and which one to choose,

You might also like