0% found this document useful (0 votes)

3 views7 pages

Intro To Machine Learning Lecture Notes4

This lecture introduces neural networks as a highly-expressive parametric model essential for deep learning, building upon simple machine learning models. It covers the perceptron as the basic building block, the structure and function of multi-layered perceptrons, and the importance of activation functions in capturing complex mappings. The lecture also discusses the architecture of deep neural networks, including the roles of hidden and output layers, and provides examples of activation functions and their applications.

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views7 pages

Intro To Machine Learning Lecture Notes4

Uploaded by

Or Shraga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 4: Neural Networks

Fall 2024/5
Lecturer: Nir Shlezinger

So far we focused on very simple machine learning models – linear architectures and heuristic
models. In practice, we are usually interested in a highly-expressive generic parametric model,
which for some given conﬁguration of θ can approach the true risk minimizer. This model should
also be one which we can actually optimize based on the empirical risk.
This week’s lecture introduces the highly-expressive generic parametric model which gave rise
to the deep learning revolution – neural networks. This lecture is mostly based on [1, Ch. 6]. Here
is a nice video on some of the hype around neural networks from the 1960’s (which did not last to
the 70’s, yet is nothing compared to hype these days...).
Neural networks refer to broad type of non-linear models/parametrizations fθ (x) that involve
combinations of matrix multiplications and other entry-wise non-linear operations. We will start
small and slowly build up a neural network, step by step.

1 Perceptron
The basic building block of neural network architectures is the perceptron. The original formu-
lation of the perceptorn stems from a simpliﬁed model of a biological neuron, dating back to the
works of Rosenblatt from the 1950’s [2].

Neurons The original formulation of the perceptron, also referred to as artiﬁcial neuron or
merely neuron, is that of a mapping hθ : RN 7→ {0, 1} which takes the form
{
1 wT x + b > 0,
hθ (x) = θ = (w, b). (1)
0 otherwise

Although the perceptron initially seemed promising, it was quickly proved that perceptrons could
not be conﬁgured to recognize many classes of patterns, mostly due to their binary output for-
mulation, which is highly inspired by the way neurons operate in the human brain. In fact, the
understanding that parameterized models of the form (1) cannot be tuned to carry out simple oper-
ations such as XOR computations, as we will show later in this lecture, is considered to be one of
the main reasons that neural network research was effectively abandoned for over a decade during
the 1970’s and the early 1980’s.

Layers So should we throw perceptrons out the window? deﬁnitely not! we just need to deviate
from the Boolean setting, and stack a few more of these neurons to get powerful highly expressive

1
Figure 1: Neuron illustration

parameterized mappings. In particular, modern neural networks are comprised of multivariate

non-binary perceptrons, also referred to as layers. The mapping induced by each layer is given by

hθ (x) = σ(W x + b), θ = {W , b}. (2)

In (2), the function σ is non-linear mapping applied element-wise, referred to as activation func-
tion. The activation is applied to an afﬁne transformation of x, parameterized by W , b.

Activations Activation functions are often ﬁxed, i.e., their mapping is not parametric and is thus
not optimized in the learning process. Note that for σ(x) = 1x>0 , the formulation in (2) boils
down to a stacking of the original boolean neurons in (1). Some notable examples of widely-used
activation functions include

Example 1.1 (ReLU). The rectiﬁed linear unit (ReLU) is an extremely common type of activation
function, given by:
σ(x) = max{x, 0}. (3)

Example 1.2 (Leaky ReLU). A variation of the ReLU activation includes an additional parameter
α < 1, typically set to 10−2 , and is given by:

σ(x) = max{x, αx}. (4)

Example 1.3 (Sigmoid). Another common activation function is the sigmoid, which is given by:

σ(x) = (1 + exp(−x))−1 . (5)

2
Figure 2: Mutli-layered perceptron illustration

2 Multi-Layered Perceptron
While the layer mapping in (2) may be limited in its ability to capture complex mappings, one
can stack multiple layers to obtain a more flexible family of parameterized mappings. Such
compositions are referred to multi-layered perceptrons, or more commonly, as Deep neural net-
works (DNNs) or deep feedforward networks. Specifically, a DNN fθ consisting of k layers
{h1 , . . . , hk } maps the input x to the output ŝ = fθ (x) = hk ◦ · · · ◦ h1 (x), where ◦ denotes
function composition. An illustration of a DNN with k = 3 layers, which are also referred to
as two hidden layers and one output layer, is illustrated in Fig. 2. Since each layer hi is itself a
parametric function, the parameters set of the entire network fθ is the union of all of its layers’
parameters, and thus fθ denotes a DNN with parameters θ. In particular, by letting Wi , bi denote
the configurable parameters of the ith layer hi (·), the trainable parameters of the DNN are written
as
θ = {Wi , bi }ki=1 . (6)
The architecture of a DNN refers to the specification of its layers {hi }ki=1 .

Terminology We use the following terms to describe a DNN:

• Depth - a network depth refers to the number of layers, i.e., k.

• Hidden layers - each layer whose output is not the output of the networks is referred to as
hidden. For instance, in a k-layer DNN, layers h1 , . . . , hk−1 are hidden layers.

• Output layers - the ﬁnal layer of a DNN, whose output is the output of the network, i.e., hk ,
is referred to as the output layer.

• Width - the width of layer hi is its number of neurons, i.e., for a layer of the generic form (2)
the width equals the number of rows in W / dimensionality of b.

3
Why to Stack More Layers? The beneﬁts of using multi-layered structures follows from their
ability to capture a broader family of mappings compared to single layers. To see this, let us
consider the XOR example, which is considered to be one of the main causes to the reduced
interest in neural networks, known as the AI winter, of the 1970’s:

Example 2.1 (Learning XOR). Consider the tuning of a neural network fθ for the task of learning
to carry out a XOR operation to the entries of a vector of two binary elements. In such a case,
the possible inputs are X = {[0, 0], [0, 1], [1, 0], [1, 1]}. The risk function with the MSE loss for
identically distributed inputs (which in this case is also the the empirical MSE assuming we are
given an identical number of data points from each of these inputs) can be written as
1∑
L(θ) = (XOR(x) − fθ (x))2 . (7)
4 x∈X

Now, we choose the model we use to learn via (7). Suppose that we choose a linear model, i.e.,
fθ (x) = wT x + b with θ = (w, b). In this case, (7) becomes
1( 2 )
L(θ) = (b) + (1 − w1 − b)2 + (1 − w2 − b)2 + (w1 + w2 + b)2 . (8)
4
Taking the derivative of the loss with respect to each variable and comparing to zero yields
∂L 1
= (−1 + 2w1 + w2 + 2b) = 0
∂w1 2
∂L 1
= (−1 + 2w2 + w1 + 2b) = 0
∂w2 2
∂L 1
= (−2 + 2w1 + 2w2 + 4b) = 0,
∂b 2
which is optimized for setting w1 = w2 = 0 and b = 21 . In this case, the output of the model is
fθ ≡ 12 , indicating its inability to learn to carry out XOR mapping. This inability follows from
the fact that a perceptron classiﬁer divides the input space using a single line, as illustrated in
Fig. 3(a), which cannot describe a XOR function.
Next, let us consider a two layer network with a hidden layer of depth 2 and ReLU activation.
In this case, the network output is

fθ (x) = wT h1 (x) + b
= wT max {0, W x + b} + b, (9)

where the max operation is taken element-wise. We can now show that under the following setting,
the network mapping fθ in (9) is exactly the XOR function:
[ ] [ ] [ ]
1 1 0 1
W = ; b= ; w= ; b = 0. (10)
1 1 −1 −2

4
Figure 3: Attempting to carry out XOR mapping using: a) a peceptron classiﬁer; and b) a two-
layer perceptron classiﬁer.

To see this, we note that

([ ]) { [ ][ ] [ ]}
0 0 1 1 1 1 0 0 1 1 0
fθ = [1 − 2] × max 0, +
0 1 0 1 1 1 0 1 0 1 −1
{ [ ] [ ]}
0 1 1 2 0
= [1 − 2] × max 0, +
0 1 1 2 −1
{ [ ]}
0 1 1 2
= [1 − 2] × max 0,
−1 0 0 1
[ ]
0 1 1 2
= [1 − 2] ×
0 0 0 1
([ ])
0 0 1 1
= [0 1 1 0] = XOR . (11)
0 1 0 1

The neural network has obtained the correct answer for every example. An illustration of its
resulting decision regions is given in Fig. 3(b).

DNNs allow the function space Fθ to capture a broad range of functions. In fact, by making
the network sufﬁciently large, one can approximate any Borel measurable mapping, i.e., Fθ is the
space of all Borel measurable functions from X to S, as follows from the universal approximation
theorem [1, Ch. 6.4.1]. The expressiveness of DNNs combined with the fact that the parameters θ
can be learned from data allows trained DNNs to operate reliably in a model-agnostic manner.

Why are Activations Needed? If all the layers of a DNN were affine, the composition of all such
layers would also be affine, and thus the resulting network would only represent affine functions.

5
To see this, consider the composition of two affine layers, i.e.,
fθ (x) = W2 (W1 x + b1 ) + b2
= W2 W1 x + W2 b1 + b2 . (12)
We note that by defining W̃ ≜ W2 W1 and b̃ ≜ W2 b1 + b2 we have that
fθ (x) = W̃ x + b̃,
which is also an affine mapping with different parameters.
For this reason, layers in a DNN are interleaved with activation functions, which are simple
non-linear functions applied to each dimension of the input separately. Activations are often fixed,
i.e., their mapping is not parametric and is thus not optimized in the learning process.

Output Layers The choice of the output layer hk (·) is tightly coupled with the task of the net-
work, and more specifically, with the loss function L. In particular, the output layer dictates the
possible outputs the parameteric mapping can realize. The following are commonly used output
layers based on the system task:
• Regression tasks involve the estimation of a continuous-amplitude vector, e.g., S = Rd . In
this case, the output must be allowed to take any value in Rd , and thus the common output
layer is a linear unit of width d, i.e.,
hk (z) = Wk z + bk , (13)
where the number of rows of Wk and dimension of bk is set to d.
• Detection is a binary form of classification, i.e., |S| = 2. In classification tasks, one is
typically interested in soft outputs, e.g., P(s|x), hence the output must be a probability
vector over S. As S is binary, a single output taking values in [0, 1], representing P(s =
s1 |x), is sufficient. Thus, the typical output layer is a sigmoid unit as given in (5), and the
output layer is given by
hk (z) = σ(wkT z + bk ). (14)
• Classification in general allows any finite number of different labels, i.e. |S| = d, where d
is a positive integer not smaller than two. Here, to guarantee that the output is a probability
vector over S, classifiers typically employ the softmax function (e.g. on top of the output
layer), given by:
[ ]
exp(z1 ) exp(zd )
Softmax(z) = ∑d , . . . , ∑d .
i=1 exp(z i ) i=1 exp(z i )
The resulting output layer is given by
hk (z) = Softmax(Wk z + bk ), (15)
where the number of rows of Wk and dimension of bk is set to d. Due to the exponentiation
followed by normalization, the output of the softmax function is guaranteed to be a valid
probability vector.

6
References
[1] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

[2] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65(6):386, 1958.

Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Nns Are A Study of Parallel and Distributed Processing Systems (PDPS)
No ratings yet
Nns Are A Study of Parallel and Distributed Processing Systems (PDPS)
46 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
01 Neural Nets
No ratings yet
01 Neural Nets
15 pages
Neural Networks
No ratings yet
Neural Networks
28 pages
Contents MLP PDF
No ratings yet
Contents MLP PDF
60 pages
Activation Functions
No ratings yet
Activation Functions
11 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
DL Mod 1 Final
No ratings yet
DL Mod 1 Final
4 pages
Neural Networks & Deep Learning Lecture
No ratings yet
Neural Networks & Deep Learning Lecture
9 pages
3-Neural Networks - Parts 1 and 2
No ratings yet
3-Neural Networks - Parts 1 and 2
48 pages
ch11 NeuralNetworks
No ratings yet
ch11 NeuralNetworks
51 pages
NNDL
No ratings yet
NNDL
96 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
73 pages
Unit I
No ratings yet
Unit I
90 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
Module 2
No ratings yet
Module 2
44 pages
Lecture - 05 (Introduction To ANN)
No ratings yet
Lecture - 05 (Introduction To ANN)
27 pages
Unit II
No ratings yet
Unit II
56 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Deep Learning
No ratings yet
Deep Learning
180 pages
3 Neural Networks
No ratings yet
3 Neural Networks
72 pages
Computer Vision Notes
No ratings yet
Computer Vision Notes
30 pages
Unit-4 MLT
No ratings yet
Unit-4 MLT
105 pages
SL2VIVA
No ratings yet
SL2VIVA
25 pages
ANNs
No ratings yet
ANNs
57 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
House Dzone Refcard 383 Neural Network Essentials
No ratings yet
House Dzone Refcard 383 Neural Network Essentials
5 pages
Ai Unit 4 Part 2
No ratings yet
Ai Unit 4 Part 2
45 pages
Introduction To Artificial Neural Networks
No ratings yet
Introduction To Artificial Neural Networks
31 pages
Lecture 36 40
No ratings yet
Lecture 36 40
25 pages
ML Lect6n7 NN Architecture and Training
No ratings yet
ML Lect6n7 NN Architecture and Training
122 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
Unit 5
No ratings yet
Unit 5
102 pages
Module 2
100% (1)
Module 2
62 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
The Deep Neural Network-A Review
No ratings yet
The Deep Neural Network-A Review
5 pages
UNIT1 Perceptron MLP
No ratings yet
UNIT1 Perceptron MLP
26 pages
Neural Networks Basics & Training
No ratings yet
Neural Networks Basics & Training
8 pages
03-Feedforward NN Editable
No ratings yet
03-Feedforward NN Editable
74 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
221 pages
NN PDF
No ratings yet
NN PDF
23 pages
Neural Deep Learning
No ratings yet
Neural Deep Learning
221 pages
Neural Network
No ratings yet
Neural Network
82 pages
Deep Learning: On Artificial Neural Networks (Anns)
No ratings yet
Deep Learning: On Artificial Neural Networks (Anns)
16 pages
UNIT-II Chapter-2
No ratings yet
UNIT-II Chapter-2
20 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
Neural Network and Fuzzy Logic
50% (2)
Neural Network and Fuzzy Logic
54 pages
Unit 1 Fundamentals of Deep Learning
No ratings yet
Unit 1 Fundamentals of Deep Learning
20 pages
Unit 5
No ratings yet
Unit 5
59 pages
Deep Learning
No ratings yet
Deep Learning
37 pages
Neural Network Basics 2.1 Neurons or Nodes and Layers
No ratings yet
Neural Network Basics 2.1 Neurons or Nodes and Layers
9 pages
Unit 3 Endsem PYQs
No ratings yet
Unit 3 Endsem PYQs
19 pages
Learning XOR - Gradient Based Learning - Hidden Units
No ratings yet
Learning XOR - Gradient Based Learning - Hidden Units
43 pages
A) Explanation of Two Tensor Operations With Examp
No ratings yet
A) Explanation of Two Tensor Operations With Examp
11 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
BBMA Flow Diagram
100% (1)
BBMA Flow Diagram
212 pages
01 History of Philippine Architecture
No ratings yet
01 History of Philippine Architecture
18 pages
EMC Engineering Exam Insights
No ratings yet
EMC Engineering Exam Insights
3 pages
4 Equilibrium P
No ratings yet
4 Equilibrium P
6 pages
Thesis Help for Trade Students
100% (2)
Thesis Help for Trade Students
6 pages
Bit Locker Administration and Monitoring
No ratings yet
Bit Locker Administration and Monitoring
17 pages
6EP1332-1SH31 - Industry Support Siemens
No ratings yet
6EP1332-1SH31 - Industry Support Siemens
3 pages
All RE
100% (3)
All RE
98 pages
Ebook Monitoring Can Help Make Tailings Dams Safer
No ratings yet
Ebook Monitoring Can Help Make Tailings Dams Safer
17 pages
RSV4 Factory APRC - SM - 2010-11 - GB - 898952
No ratings yet
RSV4 Factory APRC - SM - 2010-11 - GB - 898952
504 pages
Unit 1 - What Kind of Movies Have You Been Watching Recently
No ratings yet
Unit 1 - What Kind of Movies Have You Been Watching Recently
12 pages
Technical Catalogue PQC
No ratings yet
Technical Catalogue PQC
2 pages
Constructive and Destructive Feedback Notes
No ratings yet
Constructive and Destructive Feedback Notes
5 pages
A Detailed Lesson Plan in Mathematics Grade 7 (Algebra) : I. Objectives
0% (1)
A Detailed Lesson Plan in Mathematics Grade 7 (Algebra) : I. Objectives
7 pages
New Criticism and Formalism PPT - PPT - 20240224 - 120834 - 0000
No ratings yet
New Criticism and Formalism PPT - PPT - 20240224 - 120834 - 0000
23 pages
2.catalouge With Certificate of Smoke Detector
No ratings yet
2.catalouge With Certificate of Smoke Detector
10 pages
IFPS User Manual
100% (1)
IFPS User Manual
678 pages
C13-Rating A
100% (1)
C13-Rating A
5 pages
Moldflow 2021 Features Comparison Matrix A4 en
No ratings yet
Moldflow 2021 Features Comparison Matrix A4 en
4 pages
Institutional Theory Framework
No ratings yet
Institutional Theory Framework
9 pages
Autodesk Inventor - Design Accelerator
No ratings yet
Autodesk Inventor - Design Accelerator
23 pages
Landform Development Theories
No ratings yet
Landform Development Theories
25 pages
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
No ratings yet
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
36 pages
Instruction Manual: Sync-Check Relay BE1-25
No ratings yet
Instruction Manual: Sync-Check Relay BE1-25
53 pages
Mohr's Circle
100% (1)
Mohr's Circle
13 pages
Malaysian School Counsellors' Challenges in Job Description, Job Satisfaction and Competency
No ratings yet
Malaysian School Counsellors' Challenges in Job Description, Job Satisfaction and Competency
7 pages
30GX
0% (1)
30GX
12 pages
Photography - Tips & Tricks
No ratings yet
Photography - Tips & Tricks
13 pages
Chapter 1 Notes
No ratings yet
Chapter 1 Notes
9 pages
Ensayo Sobre El Patriotismo
100% (1)
Ensayo Sobre El Patriotismo
6 pages

Intro To Machine Learning Lecture Notes4

Uploaded by

Intro To Machine Learning Lecture Notes4

Uploaded by

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 4: Neural Networks

parameterized mappings. In particular, modern neural networks are comprised of multivariate

hθ (x) = σ(W x + b), θ = {W , b}. (2)

σ(x) = max{x, αx}. (4)

σ(x) = (1 + exp(−x))−1 . (5)

Terminology We use the following terms to describe a DNN:

• Depth - a network depth refers to the number of layers, i.e., k.

To see this, we note that

You might also like