Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views26 pages

5 Gradients

The document provides an overview of Gradient Descent (GD) and its variants, including Stochastic Gradient Descent (SGD) and Mini-Batch SGD, as optimization algorithms used in machine learning. It discusses the importance of choosing an appropriate learning rate, stopping criteria, and feature scaling techniques to improve convergence and model performance. Additionally, it highlights the differences between traditional optimization and machine learning optimization, emphasizing the need for models that generalize well to unseen data.

Uploaded by

mckaymcfadden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views26 pages

5 Gradients

The document provides an overview of Gradient Descent (GD) and its variants, including Stochastic Gradient Descent (SGD) and Mini-Batch SGD, as optimization algorithms used in machine learning. It discusses the importance of choosing an appropriate learning rate, stopping criteria, and feature scaling techniques to improve convergence and model performance. Additionally, it highlights the differences between traditional optimization and machine learning optimization, emphasizing the need for models that generalize well to unseen data.

Uploaded by

mckaymcfadden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

5 - Gradient Descent Methods

UCLA Math156: Machine Learning


Instructor: Lara Kassab
Gradient Descent Algorithm

Gradient Descent (GD) is a very widely used first-order iterative


optimization algorithm for finding a local minimizer of a function
f : Rn → R.

Algorithm 1 Gradient Descent (GD)


1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K
2: for k = 0, · · · , K − 1 do
3: w(k+1) = w(k) − ηk ∇f (w(k) )
4: end for
5: Return w(K)
Gradient Descent Algorithm

Intuition: walking downhill using only the slope you “feel” nearby.

Figure 1: Illustration of gradient descent on level sets of a function from


R2 → R. Image source: Wikipedia.
Gradient Descent Algorithm

Note that GD in its most general form is not guaranteed to


converge, be a descent algorithm, or reach a local minimizer.

Figure 2: Functions from R2 → R with local and global mininmizers.


Image source: Charu C. Aggarwal textbook.
Gradient Descent Algorithm

Many variants of GD exist whether to address challenges in


standard GD or tailor to specific problems.
GD and its variants are highly used for large and/or
high-dimensional datasets when a direct solution cannot be
computed (e.g. solving ∇f (w∗ ) = 0) or takes too long or is
numerically unstable.
Further, given the nature of GD as an iterative method we
gain more control over the learning process when using GD
over direct solutions (if they are computable).
Learning Rate

How do we choose the learning rate (or step size) ηk ≥ 0?

Too small learning rate → slow convergence. Too large


learning rate → possible divergence.
Classic techniques: use a fixed step size; set a decaying
learning rate; line search for optimal step size; low accuracy
line search; etc.
There are many advanced ways to control learning rates. We
will see later in the course adaptive learning rates (highly used
for training neural networks).
What Are Some Stopping Criteria?

Stopping Criterion Description

∇f (w(k) ) < ε Gradient norm is small

f (w(k+1) ) − f (w(k) ) < ε Small decrease in objective value

w(k+1) − w(k) < ε Small change in approximation


|f (w(k+1) )−f (w(k) )|
max{1,|f (w(k) )|}
<ε Relative change & numerical stability

∥w(k+1) −w(k) ∥
max{1,∥w(k) ∥}
<ε Relative change & numerical stability
What Are Some Stopping Criteria?

These are some stopping criteria that can be used.

→ Note that having ∇f wk = 0, means FONC is satisfied, so we




can stop, but does not guarantee a local minimizer has been found.

→ We will discuss later in the course useful approaches such as


momentum (highly used for training neural networks).
Stochastic Gradient Descent

Many loss functions in machine learning problems are linearly


separable:
XN
minimize f (w) := fn (w)
w
n=1

Additively separable data-centric objective functions are common


in machine learning. For example, each fn (w) is associated to the
loss term of an individual sample point n.
Stochastic Gradient Descent

Many loss functions in machine learning problems are linearly


separable:
XN
minimize f (w) := fn (w)
w
n=1

For example, least squares problem with N sample points:


N 
X 2
minimize ∥Aw − b∥22 := a⊤
n w − bn
w∈RM | {z }
n=1 | {z }
f (w)
fn (w)
Stochastic Gradient Descent

Algorithm 2 Stochastic Gradient Descent (SGD)


1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K
2: for k = 0, · · · , K − 1 do
3: Randomly select n ∈ {1, . . . , N }
4: w(k+1) = w(k) − ηk ∇fn (w(k) )
5: end for
6: Return w(K)

Idea. Use a random sub-function fn to compute the gradient ∇fn


instead of computing the full gradient ∇f (costly!). Thus,
approximating the full gradient ∇f (w).
Example: Least-Mean-Squares Algorithm

Consider the least squares problem as an example,


N n
X o2
minimize ∥t − Φw∥22 = tn − w⊤ ϕ (xn )
w
n=1

The SGD update with step size η ≥ 0 is:

This is known as the least-mean-squares algorithm.


Stochastic Gradient Descent

Figure 3: Illustration of noisy gradient approximations in SGD.


Gradient Descent Methods Comparisons

Suppose each fn (w) in f (w) = N


P
n=1 fn (w) is associated to the
loss term of an individual sample point n in an ML setting.
1 GD uses the entire N data points to compute the gradient.
SGD uses one randomly sampled point at each iteration to
approximate the full gradient.
2 For large datasets, GD is very slow in every iteration while
SGD will need a significant number of iterations to roughly
cover all data, each iteration is far less computationally
intensive (and usually overall faster).
3 Mini-batch SGD serves as a middle case and a “smoother”
approach.
Mini-Batch SGD

Algorithm 3 Mini-Batch SGD


1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K; batch size |Bk | ≪ N ;
2: for k = 0, · · · , K − 1 do
3: Randomly select a batch
(k+1) (k)
P Bk ⊆ {1, . . . , N}
(k)
4: w = w − ηk ∇fn (w )
n∈Bk
5: end for
6: Return w(K)
Gradient Descent Methods Comparisons

While SGD methods have lower accuracy than gradient-descent


methods on training data, they often perform comparably or even
better on the test data than GD.

→ This is because the random sampling of training instances


during optimization reduces overfitting.
Gradient Descent Methods Comparisons

There are some subtle differences in how optimization is used in


machine learning from the way it is used in traditional optimization.

An important difference is that traditional optimization


focuses on learning the parameters so as to optimize the
objective function as much as possible.
However, in machine learning, we seek to learn a model that
can generalize well to unseen data.
Feature Preprocessing

Now is a good time to discuss feature preprocessing, particularly,


feature scaling.

→ Vastly varying sensitivities of the loss function to different


parameters tend to hurt the learning. This issue can be controlled
by the scale of the features.
Feature Scaling

Consider linear regression model uses the coefficient w1 for Feature


1 and the coefficient w2 for Feature 2 in order to predict the target:

y = w1 x 1 + w2 x 2

Feature 1 (x1 ) Feature 2 (x2 ) Target (t)

0.1 25 7
0.8 10 1
0.4 10 4
Table 1: Toy dataset for illustration.
Feature Scaling

One can model the least-squares objective function E(w) as:


3 
X 2
w⊤ xn − tn = 0.81w12 +825w22 +29w1 w2 −6.2w1 −450w2 +66
n=1

→ The objective function is far more sensitive to w2 as compared


to w1 .
→ This is caused by the fact that Feature 2 is on a much larger
scale (resulting in a “dramatic” variance) than Feature 1. This
shows up in the coefficients of the objective function.
Feature Scaling

As a result, the gradient will often bounce along the w2 direction,


while making tiny progress along the w1 direction.

→ However, if we standardize each feature to zero mean and unit


variance, the coefficients of w12 and w22 will become much more
similar. Recommended Exercise.
Feature Scaling

As a result, the bouncing behavior of gradient descent is typically


reduced and convergence becomes faster.

→ In this particular example, the interaction term of the form


w1 w2 causes some additional issues that can be addressed by a
procedure called whitening.
Feature Scaling: Standardization

Feature Standardization rescales features to have zero mean and


unit variance:

Let denote µj be the mean and σj is the standard deviation of


feature j across all training data points. Replace the jth
feature of each sample data xn with:

(xn )j − µj
(xn )j ←
σj

In feature scaling or preprocessing, in general, we should not


use the test set to learn the transformation. Further, we must
apply the same learned transformation from the train set to
the test set.
Feature Scaling: Normalization

Feature Normalization (also known as the min-max normalization),


rescales the range of features to [0, 1]:

Let denote Mj and mj denote the maximum and minimum


values of feature j respectively, across all training data points.

(xn )j − mj
(xn )j ←
Mj − mj

We can also normalize to a more general interval [a, b].


Note that outliers can cause problems with feature
normalization.
Feature Scaling: Scaling to unit length

Scaling to unit length rescales the feature vector to have norm one.
If vj ∈ RD denotes the jth feature vector over the train set,
vj
vj ←
∥vj ∥

We can use any desired vector norm ∥ · ∥ e.g., euclidean norm.


Which Feature Scaling Technique to Use?

There are many feature scaling techniques.

→ Whether or not one is required and which one to choose,


depends on the model, problem domain, data types, etc. Always
check what a model requires and understand how the
transformation affects the learning.

You might also like