0% found this document useful (0 votes)

5 views8 pages

Intro To Machine Learning Lecture Notes5

Lecture 5 focuses on optimizing neural networks, specifically through stochastic gradient descent and backpropagation. It discusses the challenges of gradient computation, empirical risk evaluation, and sensitivity to local minima in training complex models. The lecture aims to provide foundational knowledge for obtaining optimal weights in neural networks using these optimization techniques.

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

Intro To Machine Learning Lecture Notes5

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 5: Optimizing Neural Networks (Basics)

Fall 2024/5
Lecturer: Nir Shlezinger

In Lecture 4 we applied a simple two-layer perception to learn to implement the XOR function.
In this example, we simply guessed the setting of the model parameters which minimize the loss,
and witnessed that it indeed achieves the minimal loss value of L(θ) = 0. In practice, one is
rarely capable of guessing the proper setting of the weights, and must resort to some optimization
mechanisms. In this lecture we will review the two key optimization tools upon which training of
neural networks relies – stochastic gradient descent and backpropagation. This lecture is mostly
based on [1, Ch. 4-6].

1 Preliminaries
Our goal throughout the following two lectures is to review methods for obtaining the weights
vector θ based on a given empirical loss measure. This lecture focuses on the basics of neural
network optimization, while the next lecture will discuss the ‘tricks of the trade‘ which are essential
in training deep networks. Mathematically speaking, for a loss function l(·), a parametric model
fθ (·), and labeled data set D, we wish to ﬁnd θ which minimizes:
1 ∑
LD (θ) ≜ l(fθ , xt , st ). (1)
|D|
{xt ,st }∈D

The optimization procedure accounts for the following properties of deep learning:

1. The training set D is comprised of a large amount of samples, e.g., |D| is of the order of tens
of thousands, and possibly more, labeled pairs.

2. The model fθ (·) is comprised of massive amount of trainable parameters, where θ can in-
clude possibly millions of parameters.

As example, consider training a simple one-layer neural classiﬁer using the established CIFAR-10
dataset. This data set includes a training set of |D| = 50000 images, each of size 32×32×3 = 3072
pixels, divided into 10 different labels. Consequently, a simple one-layer classiﬁer of the form

fθ (x) = Softmax(W x + b),

would be comprised of 3072 × 10 + 1 × 10 = 30730 trainable parameters.

1
2 Gradient-Based Optimization
As discussed earlier, guessing a good configuration of θ is typically not feasible. One can also
carry out random search, i.e., guess several different settings and select the one which achieves the
minimal loss. However, due to the high dimensionality of the problem, the probability of randomly
guessing a vector θ which lies in proximity of the vector which minimizes the empirical loss LD (·)
is negligible. Nonetheless, while just guessing configurations blindly rarely works, one can refine
a given setting of θ using gradient-based optimization.

2.1 Gradient Descent

One of the most basic optimization algorithms for doing so is the gradient-descent method, which
dates back to Cauchy in the 19th century. The gradient descent methods iteratively reﬁnes the
parameters by following the direction of the steepest descent, i.e., the gradient of the loss function.
The update equation, which is repeated over the iteration index j = 0, 1, 2, . . ., is given by

θj+1 ← θj − µj ∇θ LD (θj ), (2)

where µj is referred to as the learning rate. In (2), the term ∇θ LD (θj ) denotes the gradient of the
loss function with respect to the parameters vector, evaluated at θj . This term is thus a vector of
[ ] ∂LD (θj )
the same dimensions as θ, whose ith entry is given by ∇θ LD (θj ) i = ∂θ i
. This formulation
stems from the first-order multivariate Taylor series expansion of LD (·) around θj , which implies
that
LD (θj + ∆θ) ≈ LD (θj ) + ∆θ T ∇θ LD (θj ), (3)
which holds when θj + ∆θ is in the proximity of θj . Since the right hand side of (3) is minimized
when ∆θ = −µ∇θ LD (θj ) by the Cauchy-Schwartz inequality, the gradient step rule (2) is ob-
tained, though one must take caution to guarantee that we are still in the proximity of θj where the
approximation (3) holds.
The learning rate is most commonly set to a positive scalar, though in the broader family of
first-order gradient methods, it can be a positive-definite matrix (since such a setting also results
in the right hand side of (3) being smaller than LD (θj )). If the loss surface is convex, then there
exists a range of learning rate values for which iterating over (2) approaches the optimal θ for any
stating point θ0 . The selection of the learning rate affects both the convergence rate, as well as the
ability to converge – when the learning rate is small convergence is long, though large learning rate
can result in overshoot, as illustrated in Fig. 1.
Training models using gradient-based optimization relies on the ability to compute the gradi-
ents. Such computations can be carried out either numerically or analytically.

Numerical Gradient Computation The basic element of the gradient term is the partial deriva-
D (θ)
tive ∂L∂θ i
. This term can be numerically approximated using the deﬁnition of the derivative,

2
Figure 1: Effect of different learning rate: (a) Convergence vs. (b) overshoot due to large step size.

i.e.,
∂LD (θ) LD (θ + [0, . . . , 0, ϵ, 0, . . . 0]) − LD (θ)
= lim
∂θi ϵ→0 ϵ
LD (θ + [0, . . . , 0, ϵ, 0, . . . 0]) − LD (θ)
≈ , (4)
ϵ
where in (4) ϵ is ﬁxed to a small positive constant, and [0, . . . , 0, ϵ, 0, . . . 0] is a vector whose entries
are zero, except for its ith entry which is set to ϵ.

Analytical Gradient Computation The numerical gradient is very simple to compute using the
ﬁnite difference approximation. This is in fact implemented under the hood using the Autograd
engine of the PyTorch package, invoked upon calling the backward() function (which will serve
us greatly throughout the course). The downside is that it is approximate (since we have to pick a
small value of ϵ, while the true gradient is deﬁned as the limit as ϵ goes to zero), and that it can be
very computationally expensive to compute for each individual parameter on each iteration.
Alternatively, a major portion of the loss measures applied to neural networks facilitate analyt-
ical computation of the gradient, as shown in the following example:
( T
Example
) 2.1. Consider a single neuron with a ReLU activation, namely, f θ (x) = ReLU w x+
b , trained to minimize the squared-error loss, i.e., lMSE (fθ , x, s) = (s − fθ (x))2 . Here, the

3
empirical risk is given by
1 ∑
LD (θ) = (st − fθ (xt ))2
|D|
{xt ,st }∈D
1 ∑
= (st − max{0, wT xt + b})2 . (5)
|D|
{xt ,st }∈D

In this case, taking the derivative with respect to a given entry of w, denoted wi , results in
∂ 1 ∑ ∂
LD (θ) = (st − max{0, wT xt + b})2
∂wi |D| ∂wi
{xt ,st }∈D
1 ∑
= 2(wT xt + b − st ) · [xt ]i · 1wT xt +b>0 . (6)
|D|
{xt ,st }∈D

The derived expression can now be used to implement gradient-based optimization in a straight-
forward manner.

Challenges The gradient descent update rule applied to the empirical risk can be written as:
1 ∑
θj+1 ← θj − µj ∇θ l(fθ , xt , st ). (7)
|D|
{xt ,st }∈D

This optimization procedure gives rise to three main challenges in the context of training complex
highly-parameterized models. The ﬁrst two are computationally-oriented, while the third follows
from the non-convex nature of the loss measures characterizing such models:

1. Gradient Computation - As noted above, numerically computing the gradients, i.e., eval-
uating ∇θ l(fθ , xt , st ) in (7), can be quite computationally extensive, particularly when the
number of parameters is large. This follows since the full empirical risk has to be computed
anew for each parameter. While analytical expressions are desirable, they can be derived for
relatively simple models (such as the single neuron case considered in Example 2.1).

2. Empirical Risk Computation - The empirical risk in (1) involves evaluating the individual
loss measure for each training sample, due to the summation over D in (7). Since data sets
are typically large in deep learning, each such evaluation, which is repeated many times in
the learning process, can involve a signiﬁcant computational burden.

3. Sensitivity to Local Minimas - The gradient descent step (2) stops when it reaches a point in
which the gradient is zero. For convex objectives, there is only a single such point, which is
the global optimum. However, objectives characterizing complex abstract models are rarely
convex, and typically include local minimas, saddle points, and plateaus. This implies that
gradient descent is highly likely to get stuck in some point which is not the global minima.

4
Deep learning employs two key mechanism to overcome these challenges: the difﬁculty in
computing the gradients is mitigated due to the sequential operation of neural networks via back-
porpagation, detailed in Section 3; The challenges associated with computing the empirical risk
and the sensitivity to local minimas are handled by replacing the gradient descent algorithm with
its stochastic version, detailed in the following.

2.2 Stochastic Gradient Descent

Stochastic gradient descent is the leading workhorse for optimization in deep learning. The algo-
rithm implements gradient descent as in (2), only with the gradient computed over a random subset
of D, rather than the complete data set. In particular, on each iteration index j, a mini-batch com-
prised of B samples, denoted Dj , is randomly drawn from D, and is used to compute the gradient
in the update equation. The resulting algorithm is summarized as Algorithm 1 below.

Algorithm 1: (Mini-batch) stochastic gradient descent

Initialization: Fix number of iterations n.
1 Initialize θ0 randomly;
2 for j = 0, 1, . . . , n − 1 do
3 Sample B different samples uniformly from D as Dj ;
4 Update parameters via
θj+1 ← θj − µj ∇θ LDj (θj ). (8)
5 end
Output: Trained parameters θn .

The term stochastic gradient descent is often used to refer to Algorithm 1 with mini-batch size
B = 1, while for B > 1 it is referred to as mini-batch stochastic gradient descent. Each time the
training procedure goes over the entire data set, i.e., every d|D|/Be iterations, are referred to as an
epoch.

Why Does it Work? The main motivation for using stochastic gradient descent is computational
- it is drastically cheaper to compute compared to full gradient descent as it does not involve going
over the entire data set on each iteration. Nonetheless, it is also capable of learning accurate
models. To see this, we ﬁrst note that when each mini-batch Dj is uniformly sampled such that it
obeys a uniform distribution over all subsets of D of cardinality B, then it holds that
 
1 ∑ 
EDj {∇θ LDj (θj )} = EDj ∇θ l(fθ , xt , st )
B 
{xt ,st }∈Dj
1 ∑
= ∇θ l(fθ , xt , st ) = ∇θ LD (θj ). (9)
|D|
{xt ,st }∈D

5
Figure 2: Illustration of (a) gradient steps resulting in getting trapped in a local minima, and (b)
noisy gradients allowing to escape local minima.

Note that the expectation is taken with respect to the mini-batch selection, while the complete data
set D is assumed to be given. It follows from (9) that the stochastic gradients are in fact an unbiased
estimate of the full gradients, and thus

EDj {θj+1 } = θj − µj EDj {∇θ LDj (θj )} = θj − µj ∇θ LD (θj ). (10)

Consequently, the learned parameters are unbiased estimates of the parameters learned using full
gradient descent. The variance of this estimate, representing the level of noise in the computed
gradients, is reduced by increasing B.
Since the weights learned by stochastic gradient descent can be viewed as those learned by full
gradient descent, their trajectory along the loss surface can also be viewed as that of full gradient
descent with some additive noise. As a result, the weights learned follow the general gradient path
as in full gradient descent, while the presence of noise allows to avoid shallow local minimas and
escape plateaus, as illustrated in Fig. 2.

Challenges The random nature of stochastic gradient descent implies that its formulation in Al-
gorithm 1 is not stable. For instance, it is noted that even if the global optimum is reached, the
fact that noisy gradients are used implies the the weights may diverge. Furthermore, when it does
converge, the fact that it does not follow the steepest path implies that its convergence is expected
to be slower compared to full gradient descent. Fortunately, various methods have been proposed
in the literature to improve and stabilize the convergence proﬁle of stochatstic gradient descent
based optimization. These will be discussed in the next lecture.

6
3 Backpropagation
As discussed earlier, one of the main challenges in optimizing complex highly-parameterized mod-
els using gradient-based methods stems from the difﬁculty in computing the empirical risk gradient
with respect to each parameter. In principle, these parameters may be highly coupled, making the
computation of the gradient a difﬁcult task. Nonetheless, neural networks are not arbitrary com-
plex models, but have a sequential structure comprised of a concatenation of layers, where each
trainable parameter typically belongs to a single layer. This sequential structure facilitates the
computation of the gradients using the backpropagation process.
The backpropagation methods proposed by Rumelhart, Hinton, and Williams in [2], is based
on the calculus chain rule. Suppose that one is given two multivariate functions such that y = g(x)
and L = f (y) = f (g(x)), where f : Rn 7→ R and g : Rm 7→ Rn . By the chain rule, it holds that

∂L ∑ ∂L ∂yj
n
= , (11)
∂xi j=1
∂y j ∂x i

which can be written in vector form as

( )T
∂y
∇x L = ∇y L, (12)
∂x
∂y
where ∂x is the n × m Jacobian matrix of the operator g(·).
The formulation of the gradient computation via the chain rule in (12) is exploited to compute
the gradients of the empirical risk of a multi-layered neural network with respect to its weights in
a recursive manner. To see this, consider a neural network with K layers, given by

fθ (x) = hK ◦ · · · ◦ h1 (x), (13)

where each layer hk (·) is comprised of a (non-parameterized) activation function σk (·) applied to
an affine transformation with parameters θk = (Wk , bk ). In this case, define ak as the output to
the kth layer, i.e., ak = hk (ak−1 ), and zk as the affine transformation such that

zk = Wk ak−1 + bk , ak = σk (zk ).

Now, the empirical risk L is a function of fθ (x) = hK ◦ · · · ◦ hk+1 (σk (zk )). Consequently, we use
the matrix version of the chain rule to obtain
( )
∇ak−1 L = WkT ∇zk L , (14a)
( ) T
∇Wk L = ∇zk L ak−1 , (14b)
∇bk L = ∇zk L. (14c)

Furthermore, since ak = σk (zk ) it holds that

∇ z k L = ∇ ak L σk′ (zk ), (14d)

7
where denotes the element-wise product, and σk′ (·) is the element-wise derivative of the activa-
tion function σk (·). Equation (14) implies that the gradients of the empirical risk L with respect
to (Wk , bk ) can be obtained by ﬁrst evaluating the outputs of each layer, i.e., the vectors {zk , ak },
also known as the forward path. Then, the gradients of the loss with respect to each layer’s out-
put can be computed recursively from the corresponding gradients of its subsequent layers via
(14a) and (14d), while the desired weights gradients are obtained via (14b)-(14c). The resulting
algorithm is summarized below as Algorithm 2.

Algorithm 2: Backpropagation for neural networks

Initialization: Compute forward path, evaluating {zk , ak }.
1 g ← ∇ aK L ; // Risk gradient w.r.t. network output
2 for k = K, K − 1, . . . , 1 do
3 g ← ∇zk L = g σk′ (zk ) ;
4 ∇Wk L = gaTk−1 ; // weights gradient
5 ∇bk L = g ; // bias gradient
6 g ← ∇ak−1 L = Wk g ; T
// propagate gradient to previous layer
7 end
Output: Gradients {∇Wk L, ∇bk L}.

References
[1] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error

propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,
1985.

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Lecture 20
No ratings yet
Lecture 20
71 pages
04-NN Training GoodF
No ratings yet
04-NN Training GoodF
82 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
DLUNIT2
No ratings yet
DLUNIT2
25 pages
Unit 2
No ratings yet
Unit 2
76 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
No ratings yet
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
29 pages
Machine Learning 45 A 87
No ratings yet
Machine Learning 45 A 87
43 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Gradient Descent (v2)
No ratings yet
Gradient Descent (v2)
38 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
87 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Lec 2
No ratings yet
Lec 2
5 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
UNIT2
No ratings yet
UNIT2
25 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Deep Neura Network Lab
No ratings yet
Deep Neura Network Lab
11 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
EE102 Homework Solutions
No ratings yet
EE102 Homework Solutions
15 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Machine Vesion hw6
No ratings yet
Machine Vesion hw6
18 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Deep Learning Numerical Challenges
No ratings yet
Deep Learning Numerical Challenges
46 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Quantitative Techniques in Business
100% (1)
Quantitative Techniques in Business
18 pages
Decs I Sem
No ratings yet
Decs I Sem
14 pages
Workbook Workbook Workbook Workbook Workbook: Try Yourself Questions
No ratings yet
Workbook Workbook Workbook Workbook Workbook: Try Yourself Questions
15 pages
Machine Learning Basics for Engineers
No ratings yet
Machine Learning Basics for Engineers
9 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
ML Notes
No ratings yet
ML Notes
14 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
No ratings yet
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
11 pages
3 - Discrete-Time Systems
No ratings yet
3 - Discrete-Time Systems
61 pages
STMOL Lecture 1
No ratings yet
STMOL Lecture 1
54 pages
Algorithm Design Lab Guide
No ratings yet
Algorithm Design Lab Guide
50 pages
Cost Estimation Methods Guide
No ratings yet
Cost Estimation Methods Guide
2 pages
Linear Algebra Exam Prep
No ratings yet
Linear Algebra Exam Prep
5 pages
System Identification Overview: System Identification Is A Methodology For Building Mathematical Models of
No ratings yet
System Identification Overview: System Identification Is A Methodology For Building Mathematical Models of
14 pages
A Survey On Kolmogorov-Arnold Networks
No ratings yet
A Survey On Kolmogorov-Arnold Networks
35 pages
Model Risk Forrest
No ratings yet
Model Risk Forrest
15 pages
Linear Equations: Iterative Methods
No ratings yet
Linear Equations: Iterative Methods
34 pages
Parallel DFS
No ratings yet
Parallel DFS
10 pages
Random Matrix Theory: Wigner-Dyson Statistics and Beyond. Lecture Notes Given at SISSA (Trieste, Italy)
No ratings yet
Random Matrix Theory: Wigner-Dyson Statistics and Beyond. Lecture Notes Given at SISSA (Trieste, Italy)
29 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Decision Analysis for Textile Plant
No ratings yet
Decision Analysis for Textile Plant
4 pages
4.1 K-Nearest Neighbours (K-NN
No ratings yet
4.1 K-Nearest Neighbours (K-NN
9 pages
Nonlinear Systems Analysis Guide
No ratings yet
Nonlinear Systems Analysis Guide
20 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Automatic Grading with Machine Learning
No ratings yet
Automatic Grading with Machine Learning
10 pages
Chapter 1 - Lesson 1 - Course Intro and Discrete or Continuous Random Variables
No ratings yet
Chapter 1 - Lesson 1 - Course Intro and Discrete or Continuous Random Variables
21 pages
Quicksort Algorithm Average Case Analysis
No ratings yet
Quicksort Algorithm Average Case Analysis
16 pages
322CST07-C Programming and Data Structures Lab
No ratings yet
322CST07-C Programming and Data Structures Lab
2 pages
PG DataMiningR Practicals
No ratings yet
PG DataMiningR Practicals
2 pages
Conver Flat File Into Staing Area
No ratings yet
Conver Flat File Into Staing Area
1 page
Machine Learning in Antenna Design
No ratings yet
Machine Learning in Antenna Design
9 pages
Association Rules 1. Data Yang Digunakan Adalah Sebagai Berikut
No ratings yet
Association Rules 1. Data Yang Digunakan Adalah Sebagai Berikut
7 pages
TECHNICAL ASSESSMENT-batch4
No ratings yet
TECHNICAL ASSESSMENT-batch4
3 pages
List of UL Mobility Courses Faculty of Mathematics and Computer Science 2024 25
No ratings yet
List of UL Mobility Courses Faculty of Mathematics and Computer Science 2024 25
2 pages
Prelim Paper Information Security
No ratings yet
Prelim Paper Information Security
2 pages
UNIT3
No ratings yet
UNIT3
17 pages

Intro To Machine Learning Lecture Notes5

Uploaded by

Intro To Machine Learning Lecture Notes5

Uploaded by

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 5: Optimizing Neural Networks (Basics)

fθ (x) = Softmax(W x + b),

would be comprised of 3072 × 10 + 1 × 10 = 30730 trainable parameters.

2.1 Gradient Descent

θj+1 ← θj − µj ∇θ LD (θj ), (2)

2.2 Stochastic Gradient Descent

Algorithm 1: (Mini-batch) stochastic gradient descent

EDj {θj+1 } = θj − µj EDj {∇θ LDj (θj )} = θj − µj ∇θ LD (θj ). (10)

which can be written in vector form as

fθ (x) = hK ◦ · · · ◦ h1 (x), (13)

Furthermore, since ak = σk (zk ) it holds that

∇ z k L = ∇ ak L σk′ (zk ), (14d)

Algorithm 2: Backpropagation for neural networks

[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error

You might also like