5 - Gradient Descent Methods
UCLA Math156: Machine Learning
Instructor: Lara Kassab
Gradient Descent Algorithm
Gradient Descent (GD) is a very widely used first-order iterative
optimization algorithm for finding a local minimizer of a function
f : Rn → R.
Algorithm 1 Gradient Descent (GD)
1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K
2: for k = 0, · · · , K − 1 do
3: w(k+1) = w(k) − ηk ∇f (w(k) )
4: end for
5: Return w(K)
Gradient Descent Algorithm
Intuition: walking downhill using only the slope you “feel” nearby.
Figure 1: Illustration of gradient descent on level sets of a function from
R2 → R. Image source: Wikipedia.
Gradient Descent Algorithm
Note that GD in its most general form is not guaranteed to
converge, be a descent algorithm, or reach a local minimizer.
Figure 2: Functions from R2 → R with local and global mininmizers.
Image source: Charu C. Aggarwal textbook.
Gradient Descent Algorithm
Many variants of GD exist whether to address challenges in
standard GD or tailor to specific problems.
GD and its variants are highly used for large and/or
high-dimensional datasets when a direct solution cannot be
computed (e.g. solving ∇f (w∗ ) = 0) or takes too long or is
numerically unstable.
Further, given the nature of GD as an iterative method we
gain more control over the learning process when using GD
over direct solutions (if they are computable).
Learning Rate
How do we choose the learning rate (or step size) ηk ≥ 0?
Too small learning rate → slow convergence. Too large
learning rate → possible divergence.
Classic techniques: use a fixed step size; set a decaying
learning rate; line search for optimal step size; low accuracy
line search; etc.
There are many advanced ways to control learning rates. We
will see later in the course adaptive learning rates (highly used
for training neural networks).
What Are Some Stopping Criteria?
Stopping Criterion Description
∇f (w(k) ) < ε Gradient norm is small
f (w(k+1) ) − f (w(k) ) < ε Small decrease in objective value
w(k+1) − w(k) < ε Small change in approximation
|f (w(k+1) )−f (w(k) )|
max{1,|f (w(k) )|}
<ε Relative change & numerical stability
∥w(k+1) −w(k) ∥
max{1,∥w(k) ∥}
<ε Relative change & numerical stability
What Are Some Stopping Criteria?
These are some stopping criteria that can be used.
→ Note that having ∇f wk = 0, means FONC is satisfied, so we
can stop, but does not guarantee a local minimizer has been found.
→ We will discuss later in the course useful approaches such as
momentum (highly used for training neural networks).
Stochastic Gradient Descent
Many loss functions in machine learning problems are linearly
separable:
XN
minimize f (w) := fn (w)
w
n=1
Additively separable data-centric objective functions are common
in machine learning. For example, each fn (w) is associated to the
loss term of an individual sample point n.
Stochastic Gradient Descent
Many loss functions in machine learning problems are linearly
separable:
XN
minimize f (w) := fn (w)
w
n=1
For example, least squares problem with N sample points:
N
X 2
minimize ∥Aw − b∥22 := a⊤
n w − bn
w∈RM | {z }
n=1 | {z }
f (w)
fn (w)
Stochastic Gradient Descent
Algorithm 2 Stochastic Gradient Descent (SGD)
1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K
2: for k = 0, · · · , K − 1 do
3: Randomly select n ∈ {1, . . . , N }
4: w(k+1) = w(k) − ηk ∇fn (w(k) )
5: end for
6: Return w(K)
Idea. Use a random sub-function fn to compute the gradient ∇fn
instead of computing the full gradient ∇f (costly!). Thus,
approximating the full gradient ∇f (w).
Example: Least-Mean-Squares Algorithm
Consider the least squares problem as an example,
N n
X o2
minimize ∥t − Φw∥22 = tn − w⊤ ϕ (xn )
w
n=1
The SGD update with step size η ≥ 0 is:
This is known as the least-mean-squares algorithm.
Stochastic Gradient Descent
Figure 3: Illustration of noisy gradient approximations in SGD.
Gradient Descent Methods Comparisons
Suppose each fn (w) in f (w) = N
P
n=1 fn (w) is associated to the
loss term of an individual sample point n in an ML setting.
1 GD uses the entire N data points to compute the gradient.
SGD uses one randomly sampled point at each iteration to
approximate the full gradient.
2 For large datasets, GD is very slow in every iteration while
SGD will need a significant number of iterations to roughly
cover all data, each iteration is far less computationally
intensive (and usually overall faster).
3 Mini-batch SGD serves as a middle case and a “smoother”
approach.
Mini-Batch SGD
Algorithm 3 Mini-Batch SGD
1: Input: initial point w(0) ; learning rate ηk ≥ 0; maximum num-
ber of iterations K; batch size |Bk | ≪ N ;
2: for k = 0, · · · , K − 1 do
3: Randomly select a batch
(k+1) (k)
P Bk ⊆ {1, . . . , N}
(k)
4: w = w − ηk ∇fn (w )
n∈Bk
5: end for
6: Return w(K)
Gradient Descent Methods Comparisons
While SGD methods have lower accuracy than gradient-descent
methods on training data, they often perform comparably or even
better on the test data than GD.
→ This is because the random sampling of training instances
during optimization reduces overfitting.
Gradient Descent Methods Comparisons
There are some subtle differences in how optimization is used in
machine learning from the way it is used in traditional optimization.
An important difference is that traditional optimization
focuses on learning the parameters so as to optimize the
objective function as much as possible.
However, in machine learning, we seek to learn a model that
can generalize well to unseen data.
Feature Preprocessing
Now is a good time to discuss feature preprocessing, particularly,
feature scaling.
→ Vastly varying sensitivities of the loss function to different
parameters tend to hurt the learning. This issue can be controlled
by the scale of the features.
Feature Scaling
Consider linear regression model uses the coefficient w1 for Feature
1 and the coefficient w2 for Feature 2 in order to predict the target:
y = w1 x 1 + w2 x 2
Feature 1 (x1 ) Feature 2 (x2 ) Target (t)
0.1 25 7
0.8 10 1
0.4 10 4
Table 1: Toy dataset for illustration.
Feature Scaling
One can model the least-squares objective function E(w) as:
3
X 2
w⊤ xn − tn = 0.81w12 +825w22 +29w1 w2 −6.2w1 −450w2 +66
n=1
→ The objective function is far more sensitive to w2 as compared
to w1 .
→ This is caused by the fact that Feature 2 is on a much larger
scale (resulting in a “dramatic” variance) than Feature 1. This
shows up in the coefficients of the objective function.
Feature Scaling
As a result, the gradient will often bounce along the w2 direction,
while making tiny progress along the w1 direction.
→ However, if we standardize each feature to zero mean and unit
variance, the coefficients of w12 and w22 will become much more
similar. Recommended Exercise.
Feature Scaling
As a result, the bouncing behavior of gradient descent is typically
reduced and convergence becomes faster.
→ In this particular example, the interaction term of the form
w1 w2 causes some additional issues that can be addressed by a
procedure called whitening.
Feature Scaling: Standardization
Feature Standardization rescales features to have zero mean and
unit variance:
Let denote µj be the mean and σj is the standard deviation of
feature j across all training data points. Replace the jth
feature of each sample data xn with:
(xn )j − µj
(xn )j ←
σj
In feature scaling or preprocessing, in general, we should not
use the test set to learn the transformation. Further, we must
apply the same learned transformation from the train set to
the test set.
Feature Scaling: Normalization
Feature Normalization (also known as the min-max normalization),
rescales the range of features to [0, 1]:
Let denote Mj and mj denote the maximum and minimum
values of feature j respectively, across all training data points.
(xn )j − mj
(xn )j ←
Mj − mj
We can also normalize to a more general interval [a, b].
Note that outliers can cause problems with feature
normalization.
Feature Scaling: Scaling to unit length
Scaling to unit length rescales the feature vector to have norm one.
If vj ∈ RD denotes the jth feature vector over the train set,
vj
vj ←
∥vj ∥
We can use any desired vector norm ∥ · ∥ e.g., euclidean norm.
Which Feature Scaling Technique to Use?
There are many feature scaling techniques.
→ Whether or not one is required and which one to choose,
depends on the model, problem domain, data types, etc. Always
check what a model requires and understand how the
transformation affects the learning.