Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views8 pages

Intro To Machine Learning Lecture Notes5

Lecture 5 focuses on optimizing neural networks, specifically through stochastic gradient descent and backpropagation. It discusses the challenges of gradient computation, empirical risk evaluation, and sensitivity to local minima in training complex models. The lecture aims to provide foundational knowledge for obtaining optimal weights in neural networks using these optimization techniques.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Intro To Machine Learning Lecture Notes5

Lecture 5 focuses on optimizing neural networks, specifically through stochastic gradient descent and backpropagation. It discusses the challenges of gradient computation, empirical risk evaluation, and sensitivity to local minima in training complex models. The lecture aims to provide foundational knowledge for obtaining optimal weights in neural networks using these optimization techniques.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 5: Optimizing Neural Networks (Basics)


Fall 2024/5
Lecturer: Nir Shlezinger

In Lecture 4 we applied a simple two-layer perception to learn to implement the XOR function.
In this example, we simply guessed the setting of the model parameters which minimize the loss,
and witnessed that it indeed achieves the minimal loss value of L(θ) = 0. In practice, one is
rarely capable of guessing the proper setting of the weights, and must resort to some optimization
mechanisms. In this lecture we will review the two key optimization tools upon which training of
neural networks relies – stochastic gradient descent and backpropagation. This lecture is mostly
based on [1, Ch. 4-6].

1 Preliminaries
Our goal throughout the following two lectures is to review methods for obtaining the weights
vector θ based on a given empirical loss measure. This lecture focuses on the basics of neural
network optimization, while the next lecture will discuss the ‘tricks of the trade‘ which are essential
in training deep networks. Mathematically speaking, for a loss function l(·), a parametric model
fθ (·), and labeled data set D, we wish to find θ which minimizes:
1 ∑
LD (θ) ≜ l(fθ , xt , st ). (1)
|D|
{xt ,st }∈D

The optimization procedure accounts for the following properties of deep learning:

1. The training set D is comprised of a large amount of samples, e.g., |D| is of the order of tens
of thousands, and possibly more, labeled pairs.

2. The model fθ (·) is comprised of massive amount of trainable parameters, where θ can in-
clude possibly millions of parameters.

As example, consider training a simple one-layer neural classifier using the established CIFAR-10
dataset. This data set includes a training set of |D| = 50000 images, each of size 32×32×3 = 3072
pixels, divided into 10 different labels. Consequently, a simple one-layer classifier of the form

fθ (x) = Softmax(W x + b),

would be comprised of 3072 × 10 + 1 × 10 = 30730 trainable parameters.

1
2 Gradient-Based Optimization
As discussed earlier, guessing a good configuration of θ is typically not feasible. One can also
carry out random search, i.e., guess several different settings and select the one which achieves the
minimal loss. However, due to the high dimensionality of the problem, the probability of randomly
guessing a vector θ which lies in proximity of the vector which minimizes the empirical loss LD (·)
is negligible. Nonetheless, while just guessing configurations blindly rarely works, one can refine
a given setting of θ using gradient-based optimization.

2.1 Gradient Descent


One of the most basic optimization algorithms for doing so is the gradient-descent method, which
dates back to Cauchy in the 19th century. The gradient descent methods iteratively refines the
parameters by following the direction of the steepest descent, i.e., the gradient of the loss function.
The update equation, which is repeated over the iteration index j = 0, 1, 2, . . ., is given by

θj+1 ← θj − µj ∇θ LD (θj ), (2)

where µj is referred to as the learning rate. In (2), the term ∇θ LD (θj ) denotes the gradient of the
loss function with respect to the parameters vector, evaluated at θj . This term is thus a vector of
[ ] ∂LD (θj )
the same dimensions as θ, whose ith entry is given by ∇θ LD (θj ) i = ∂θ i
. This formulation
stems from the first-order multivariate Taylor series expansion of LD (·) around θj , which implies
that
LD (θj + ∆θ) ≈ LD (θj ) + ∆θ T ∇θ LD (θj ), (3)
which holds when θj + ∆θ is in the proximity of θj . Since the right hand side of (3) is minimized
when ∆θ = −µ∇θ LD (θj ) by the Cauchy-Schwartz inequality, the gradient step rule (2) is ob-
tained, though one must take caution to guarantee that we are still in the proximity of θj where the
approximation (3) holds.
The learning rate is most commonly set to a positive scalar, though in the broader family of
first-order gradient methods, it can be a positive-definite matrix (since such a setting also results
in the right hand side of (3) being smaller than LD (θj )). If the loss surface is convex, then there
exists a range of learning rate values for which iterating over (2) approaches the optimal θ for any
stating point θ0 . The selection of the learning rate affects both the convergence rate, as well as the
ability to converge – when the learning rate is small convergence is long, though large learning rate
can result in overshoot, as illustrated in Fig. 1.
Training models using gradient-based optimization relies on the ability to compute the gradi-
ents. Such computations can be carried out either numerically or analytically.

Numerical Gradient Computation The basic element of the gradient term is the partial deriva-
D (θ)
tive ∂L∂θ i
. This term can be numerically approximated using the definition of the derivative,

2
Figure 1: Effect of different learning rate: (a) Convergence vs. (b) overshoot due to large step size.

i.e.,
∂LD (θ) LD (θ + [0, . . . , 0, ϵ, 0, . . . 0]) − LD (θ)
= lim
∂θi ϵ→0 ϵ
LD (θ + [0, . . . , 0, ϵ, 0, . . . 0]) − LD (θ)
≈ , (4)
ϵ
where in (4) ϵ is fixed to a small positive constant, and [0, . . . , 0, ϵ, 0, . . . 0] is a vector whose entries
are zero, except for its ith entry which is set to ϵ.

Analytical Gradient Computation The numerical gradient is very simple to compute using the
finite difference approximation. This is in fact implemented under the hood using the Autograd
engine of the PyTorch package, invoked upon calling the backward() function (which will serve
us greatly throughout the course). The downside is that it is approximate (since we have to pick a
small value of ϵ, while the true gradient is defined as the limit as ϵ goes to zero), and that it can be
very computationally expensive to compute for each individual parameter on each iteration.
Alternatively, a major portion of the loss measures applied to neural networks facilitate analyt-
ical computation of the gradient, as shown in the following example:
( T
Example
) 2.1. Consider a single neuron with a ReLU activation, namely, f θ (x) = ReLU w x+
b , trained to minimize the squared-error loss, i.e., lMSE (fθ , x, s) = (s − fθ (x))2 . Here, the

3
empirical risk is given by
1 ∑
LD (θ) = (st − fθ (xt ))2
|D|
{xt ,st }∈D
1 ∑
= (st − max{0, wT xt + b})2 . (5)
|D|
{xt ,st }∈D

In this case, taking the derivative with respect to a given entry of w, denoted wi , results in
∂ 1 ∑ ∂
LD (θ) = (st − max{0, wT xt + b})2
∂wi |D| ∂wi
{xt ,st }∈D
1 ∑
= 2(wT xt + b − st ) · [xt ]i · 1wT xt +b>0 . (6)
|D|
{xt ,st }∈D

The derived expression can now be used to implement gradient-based optimization in a straight-
forward manner.

Challenges The gradient descent update rule applied to the empirical risk can be written as:
1 ∑
θj+1 ← θj − µj ∇θ l(fθ , xt , st ). (7)
|D|
{xt ,st }∈D

This optimization procedure gives rise to three main challenges in the context of training complex
highly-parameterized models. The first two are computationally-oriented, while the third follows
from the non-convex nature of the loss measures characterizing such models:

1. Gradient Computation - As noted above, numerically computing the gradients, i.e., eval-
uating ∇θ l(fθ , xt , st ) in (7), can be quite computationally extensive, particularly when the
number of parameters is large. This follows since the full empirical risk has to be computed
anew for each parameter. While analytical expressions are desirable, they can be derived for
relatively simple models (such as the single neuron case considered in Example 2.1).

2. Empirical Risk Computation - The empirical risk in (1) involves evaluating the individual
loss measure for each training sample, due to the summation over D in (7). Since data sets
are typically large in deep learning, each such evaluation, which is repeated many times in
the learning process, can involve a significant computational burden.

3. Sensitivity to Local Minimas - The gradient descent step (2) stops when it reaches a point in
which the gradient is zero. For convex objectives, there is only a single such point, which is
the global optimum. However, objectives characterizing complex abstract models are rarely
convex, and typically include local minimas, saddle points, and plateaus. This implies that
gradient descent is highly likely to get stuck in some point which is not the global minima.

4
Deep learning employs two key mechanism to overcome these challenges: the difficulty in
computing the gradients is mitigated due to the sequential operation of neural networks via back-
porpagation, detailed in Section 3; The challenges associated with computing the empirical risk
and the sensitivity to local minimas are handled by replacing the gradient descent algorithm with
its stochastic version, detailed in the following.

2.2 Stochastic Gradient Descent


Stochastic gradient descent is the leading workhorse for optimization in deep learning. The algo-
rithm implements gradient descent as in (2), only with the gradient computed over a random subset
of D, rather than the complete data set. In particular, on each iteration index j, a mini-batch com-
prised of B samples, denoted Dj , is randomly drawn from D, and is used to compute the gradient
in the update equation. The resulting algorithm is summarized as Algorithm 1 below.

Algorithm 1: (Mini-batch) stochastic gradient descent


Initialization: Fix number of iterations n.
1 Initialize θ0 randomly;
2 for j = 0, 1, . . . , n − 1 do
3 Sample B different samples uniformly from D as Dj ;
4 Update parameters via
θj+1 ← θj − µj ∇θ LDj (θj ). (8)
5 end
Output: Trained parameters θn .

The term stochastic gradient descent is often used to refer to Algorithm 1 with mini-batch size
B = 1, while for B > 1 it is referred to as mini-batch stochastic gradient descent. Each time the
training procedure goes over the entire data set, i.e., every d|D|/Be iterations, are referred to as an
epoch.

Why Does it Work? The main motivation for using stochastic gradient descent is computational
- it is drastically cheaper to compute compared to full gradient descent as it does not involve going
over the entire data set on each iteration. Nonetheless, it is also capable of learning accurate
models. To see this, we first note that when each mini-batch Dj is uniformly sampled such that it
obeys a uniform distribution over all subsets of D of cardinality B, then it holds that
 
1 ∑ 
EDj {∇θ LDj (θj )} = EDj ∇θ l(fθ , xt , st )
B 
{xt ,st }∈Dj
1 ∑
= ∇θ l(fθ , xt , st ) = ∇θ LD (θj ). (9)
|D|
{xt ,st }∈D

5
Figure 2: Illustration of (a) gradient steps resulting in getting trapped in a local minima, and (b)
noisy gradients allowing to escape local minima.

Note that the expectation is taken with respect to the mini-batch selection, while the complete data
set D is assumed to be given. It follows from (9) that the stochastic gradients are in fact an unbiased
estimate of the full gradients, and thus

EDj {θj+1 } = θj − µj EDj {∇θ LDj (θj )} = θj − µj ∇θ LD (θj ). (10)

Consequently, the learned parameters are unbiased estimates of the parameters learned using full
gradient descent. The variance of this estimate, representing the level of noise in the computed
gradients, is reduced by increasing B.
Since the weights learned by stochastic gradient descent can be viewed as those learned by full
gradient descent, their trajectory along the loss surface can also be viewed as that of full gradient
descent with some additive noise. As a result, the weights learned follow the general gradient path
as in full gradient descent, while the presence of noise allows to avoid shallow local minimas and
escape plateaus, as illustrated in Fig. 2.

Challenges The random nature of stochastic gradient descent implies that its formulation in Al-
gorithm 1 is not stable. For instance, it is noted that even if the global optimum is reached, the
fact that noisy gradients are used implies the the weights may diverge. Furthermore, when it does
converge, the fact that it does not follow the steepest path implies that its convergence is expected
to be slower compared to full gradient descent. Fortunately, various methods have been proposed
in the literature to improve and stabilize the convergence profile of stochatstic gradient descent
based optimization. These will be discussed in the next lecture.

6
3 Backpropagation
As discussed earlier, one of the main challenges in optimizing complex highly-parameterized mod-
els using gradient-based methods stems from the difficulty in computing the empirical risk gradient
with respect to each parameter. In principle, these parameters may be highly coupled, making the
computation of the gradient a difficult task. Nonetheless, neural networks are not arbitrary com-
plex models, but have a sequential structure comprised of a concatenation of layers, where each
trainable parameter typically belongs to a single layer. This sequential structure facilitates the
computation of the gradients using the backpropagation process.
The backpropagation methods proposed by Rumelhart, Hinton, and Williams in [2], is based
on the calculus chain rule. Suppose that one is given two multivariate functions such that y = g(x)
and L = f (y) = f (g(x)), where f : Rn 7→ R and g : Rm 7→ Rn . By the chain rule, it holds that

∂L ∑ ∂L ∂yj
n
= , (11)
∂xi j=1
∂y j ∂x i

which can be written in vector form as


( )T
∂y
∇x L = ∇y L, (12)
∂x
∂y
where ∂x is the n × m Jacobian matrix of the operator g(·).
The formulation of the gradient computation via the chain rule in (12) is exploited to compute
the gradients of the empirical risk of a multi-layered neural network with respect to its weights in
a recursive manner. To see this, consider a neural network with K layers, given by

fθ (x) = hK ◦ · · · ◦ h1 (x), (13)

where each layer hk (·) is comprised of a (non-parameterized) activation function σk (·) applied to
an affine transformation with parameters θk = (Wk , bk ). In this case, define ak as the output to
the kth layer, i.e., ak = hk (ak−1 ), and zk as the affine transformation such that

zk = Wk ak−1 + bk , ak = σk (zk ).

Now, the empirical risk L is a function of fθ (x) = hK ◦ · · · ◦ hk+1 (σk (zk )). Consequently, we use
the matrix version of the chain rule to obtain
( )
∇ak−1 L = WkT ∇zk L , (14a)
( ) T
∇Wk L = ∇zk L ak−1 , (14b)
∇bk L = ∇zk L. (14c)

Furthermore, since ak = σk (zk ) it holds that

∇ z k L = ∇ ak L σk′ (zk ), (14d)

7
where denotes the element-wise product, and σk′ (·) is the element-wise derivative of the activa-
tion function σk (·). Equation (14) implies that the gradients of the empirical risk L with respect
to (Wk , bk ) can be obtained by first evaluating the outputs of each layer, i.e., the vectors {zk , ak },
also known as the forward path. Then, the gradients of the loss with respect to each layer’s out-
put can be computed recursively from the corresponding gradients of its subsequent layers via
(14a) and (14d), while the desired weights gradients are obtained via (14b)-(14c). The resulting
algorithm is summarized below as Algorithm 2.

Algorithm 2: Backpropagation for neural networks


Initialization: Compute forward path, evaluating {zk , ak }.
1 g ← ∇ aK L ; // Risk gradient w.r.t. network output
2 for k = K, K − 1, . . . , 1 do
3 g ← ∇zk L = g σk′ (zk ) ;
4 ∇Wk L = gaTk−1 ; // weights gradient
5 ∇bk L = g ; // bias gradient
6 g ← ∇ak−1 L = Wk g ; T
// propagate gradient to previous layer
7 end
Output: Gradients {∇Wk L, ∇bk L}.

References
[1] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error


propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,
1985.

You might also like