Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views7 pages

Intro To Machine Learning Lecture Notes4

This lecture introduces neural networks as a highly-expressive parametric model essential for deep learning, building upon simple machine learning models. It covers the perceptron as the basic building block, the structure and function of multi-layered perceptrons, and the importance of activation functions in capturing complex mappings. The lecture also discusses the architecture of deep neural networks, including the roles of hidden and output layers, and provides examples of activation functions and their applications.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Intro To Machine Learning Lecture Notes4

This lecture introduces neural networks as a highly-expressive parametric model essential for deep learning, building upon simple machine learning models. It covers the perceptron as the basic building block, the structure and function of multi-layered perceptrons, and the importance of activation functions in capturing complex mappings. The lecture also discusses the architecture of deep neural networks, including the roles of hidden and output layers, and provides examples of activation functions and their applications.

Uploaded by

Or Shraga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Ben-Gurion University - School of Electrical and Computer Engineering - 361-1-3040

Lecture 4: Neural Networks


Fall 2024/5
Lecturer: Nir Shlezinger

So far we focused on very simple machine learning models – linear architectures and heuristic
models. In practice, we are usually interested in a highly-expressive generic parametric model,
which for some given configuration of θ can approach the true risk minimizer. This model should
also be one which we can actually optimize based on the empirical risk.
This week’s lecture introduces the highly-expressive generic parametric model which gave rise
to the deep learning revolution – neural networks. This lecture is mostly based on [1, Ch. 6]. Here
is a nice video on some of the hype around neural networks from the 1960’s (which did not last to
the 70’s, yet is nothing compared to hype these days...).
Neural networks refer to broad type of non-linear models/parametrizations fθ (x) that involve
combinations of matrix multiplications and other entry-wise non-linear operations. We will start
small and slowly build up a neural network, step by step.

1 Perceptron
The basic building block of neural network architectures is the perceptron. The original formu-
lation of the perceptorn stems from a simplified model of a biological neuron, dating back to the
works of Rosenblatt from the 1950’s [2].

Neurons The original formulation of the perceptron, also referred to as artificial neuron or
merely neuron, is that of a mapping hθ : RN 7→ {0, 1} which takes the form
{
1 wT x + b > 0,
hθ (x) = θ = (w, b). (1)
0 otherwise

Although the perceptron initially seemed promising, it was quickly proved that perceptrons could
not be configured to recognize many classes of patterns, mostly due to their binary output for-
mulation, which is highly inspired by the way neurons operate in the human brain. In fact, the
understanding that parameterized models of the form (1) cannot be tuned to carry out simple oper-
ations such as XOR computations, as we will show later in this lecture, is considered to be one of
the main reasons that neural network research was effectively abandoned for over a decade during
the 1970’s and the early 1980’s.

Layers So should we throw perceptrons out the window? definitely not! we just need to deviate
from the Boolean setting, and stack a few more of these neurons to get powerful highly expressive

1
Figure 1: Neuron illustration

parameterized mappings. In particular, modern neural networks are comprised of multivariate


non-binary perceptrons, also referred to as layers. The mapping induced by each layer is given by

hθ (x) = σ(W x + b), θ = {W , b}. (2)

In (2), the function σ is non-linear mapping applied element-wise, referred to as activation func-
tion. The activation is applied to an affine transformation of x, parameterized by W , b.

Activations Activation functions are often fixed, i.e., their mapping is not parametric and is thus
not optimized in the learning process. Note that for σ(x) = 1x>0 , the formulation in (2) boils
down to a stacking of the original boolean neurons in (1). Some notable examples of widely-used
activation functions include

Example 1.1 (ReLU). The rectified linear unit (ReLU) is an extremely common type of activation
function, given by:
σ(x) = max{x, 0}. (3)

Example 1.2 (Leaky ReLU). A variation of the ReLU activation includes an additional parameter
α < 1, typically set to 10−2 , and is given by:

σ(x) = max{x, αx}. (4)

Example 1.3 (Sigmoid). Another common activation function is the sigmoid, which is given by:

σ(x) = (1 + exp(−x))−1 . (5)

2
Figure 2: Mutli-layered perceptron illustration

2 Multi-Layered Perceptron
While the layer mapping in (2) may be limited in its ability to capture complex mappings, one
can stack multiple layers to obtain a more flexible family of parameterized mappings. Such
compositions are referred to multi-layered perceptrons, or more commonly, as Deep neural net-
works (DNNs) or deep feedforward networks. Specifically, a DNN fθ consisting of k layers
{h1 , . . . , hk } maps the input x to the output ŝ = fθ (x) = hk ◦ · · · ◦ h1 (x), where ◦ denotes
function composition. An illustration of a DNN with k = 3 layers, which are also referred to
as two hidden layers and one output layer, is illustrated in Fig. 2. Since each layer hi is itself a
parametric function, the parameters set of the entire network fθ is the union of all of its layers’
parameters, and thus fθ denotes a DNN with parameters θ. In particular, by letting Wi , bi denote
the configurable parameters of the ith layer hi (·), the trainable parameters of the DNN are written
as
θ = {Wi , bi }ki=1 . (6)
The architecture of a DNN refers to the specification of its layers {hi }ki=1 .

Terminology We use the following terms to describe a DNN:

• Depth - a network depth refers to the number of layers, i.e., k.

• Hidden layers - each layer whose output is not the output of the networks is referred to as
hidden. For instance, in a k-layer DNN, layers h1 , . . . , hk−1 are hidden layers.

• Output layers - the final layer of a DNN, whose output is the output of the network, i.e., hk ,
is referred to as the output layer.

• Width - the width of layer hi is its number of neurons, i.e., for a layer of the generic form (2)
the width equals the number of rows in W / dimensionality of b.

3
Why to Stack More Layers? The benefits of using multi-layered structures follows from their
ability to capture a broader family of mappings compared to single layers. To see this, let us
consider the XOR example, which is considered to be one of the main causes to the reduced
interest in neural networks, known as the AI winter, of the 1970’s:

Example 2.1 (Learning XOR). Consider the tuning of a neural network fθ for the task of learning
to carry out a XOR operation to the entries of a vector of two binary elements. In such a case,
the possible inputs are X = {[0, 0], [0, 1], [1, 0], [1, 1]}. The risk function with the MSE loss for
identically distributed inputs (which in this case is also the the empirical MSE assuming we are
given an identical number of data points from each of these inputs) can be written as
1∑
L(θ) = (XOR(x) − fθ (x))2 . (7)
4 x∈X

Now, we choose the model we use to learn via (7). Suppose that we choose a linear model, i.e.,
fθ (x) = wT x + b with θ = (w, b). In this case, (7) becomes
1( 2 )
L(θ) = (b) + (1 − w1 − b)2 + (1 − w2 − b)2 + (w1 + w2 + b)2 . (8)
4
Taking the derivative of the loss with respect to each variable and comparing to zero yields
∂L 1
= (−1 + 2w1 + w2 + 2b) = 0
∂w1 2
∂L 1
= (−1 + 2w2 + w1 + 2b) = 0
∂w2 2
∂L 1
= (−2 + 2w1 + 2w2 + 4b) = 0,
∂b 2
which is optimized for setting w1 = w2 = 0 and b = 21 . In this case, the output of the model is
fθ ≡ 12 , indicating its inability to learn to carry out XOR mapping. This inability follows from
the fact that a perceptron classifier divides the input space using a single line, as illustrated in
Fig. 3(a), which cannot describe a XOR function.
Next, let us consider a two layer network with a hidden layer of depth 2 and ReLU activation.
In this case, the network output is

fθ (x) = wT h1 (x) + b
= wT max {0, W x + b} + b, (9)

where the max operation is taken element-wise. We can now show that under the following setting,
the network mapping fθ in (9) is exactly the XOR function:
[ ] [ ] [ ]
1 1 0 1
W = ; b= ; w= ; b = 0. (10)
1 1 −1 −2

4
Figure 3: Attempting to carry out XOR mapping using: a) a peceptron classifier; and b) a two-
layer perceptron classifier.

To see this, we note that


([ ]) { [ ][ ] [ ]}
0 0 1 1 1 1 0 0 1 1 0
fθ = [1 − 2] × max 0, +
0 1 0 1 1 1 0 1 0 1 −1
{ [ ] [ ]}
0 1 1 2 0
= [1 − 2] × max 0, +
0 1 1 2 −1
{ [ ]}
0 1 1 2
= [1 − 2] × max 0,
−1 0 0 1
[ ]
0 1 1 2
= [1 − 2] ×
0 0 0 1
([ ])
0 0 1 1
= [0 1 1 0] = XOR . (11)
0 1 0 1

The neural network has obtained the correct answer for every example. An illustration of its
resulting decision regions is given in Fig. 3(b).

DNNs allow the function space Fθ to capture a broad range of functions. In fact, by making
the network sufficiently large, one can approximate any Borel measurable mapping, i.e., Fθ is the
space of all Borel measurable functions from X to S, as follows from the universal approximation
theorem [1, Ch. 6.4.1]. The expressiveness of DNNs combined with the fact that the parameters θ
can be learned from data allows trained DNNs to operate reliably in a model-agnostic manner.

Why are Activations Needed? If all the layers of a DNN were affine, the composition of all such
layers would also be affine, and thus the resulting network would only represent affine functions.

5
To see this, consider the composition of two affine layers, i.e.,
fθ (x) = W2 (W1 x + b1 ) + b2
= W2 W1 x + W2 b1 + b2 . (12)
We note that by defining W̃ ≜ W2 W1 and b̃ ≜ W2 b1 + b2 we have that
fθ (x) = W̃ x + b̃,
which is also an affine mapping with different parameters.
For this reason, layers in a DNN are interleaved with activation functions, which are simple
non-linear functions applied to each dimension of the input separately. Activations are often fixed,
i.e., their mapping is not parametric and is thus not optimized in the learning process.

Output Layers The choice of the output layer hk (·) is tightly coupled with the task of the net-
work, and more specifically, with the loss function L. In particular, the output layer dictates the
possible outputs the parameteric mapping can realize. The following are commonly used output
layers based on the system task:
• Regression tasks involve the estimation of a continuous-amplitude vector, e.g., S = Rd . In
this case, the output must be allowed to take any value in Rd , and thus the common output
layer is a linear unit of width d, i.e.,
hk (z) = Wk z + bk , (13)
where the number of rows of Wk and dimension of bk is set to d.
• Detection is a binary form of classification, i.e., |S| = 2. In classification tasks, one is
typically interested in soft outputs, e.g., P(s|x), hence the output must be a probability
vector over S. As S is binary, a single output taking values in [0, 1], representing P(s =
s1 |x), is sufficient. Thus, the typical output layer is a sigmoid unit as given in (5), and the
output layer is given by
hk (z) = σ(wkT z + bk ). (14)
• Classification in general allows any finite number of different labels, i.e. |S| = d, where d
is a positive integer not smaller than two. Here, to guarantee that the output is a probability
vector over S, classifiers typically employ the softmax function (e.g. on top of the output
layer), given by:
[ ]
exp(z1 ) exp(zd )
Softmax(z) = ∑d , . . . , ∑d .
i=1 exp(z i ) i=1 exp(z i )
The resulting output layer is given by
hk (z) = Softmax(Wk z + bk ), (15)
where the number of rows of Wk and dimension of bk is set to d. Due to the exponentiation
followed by normalization, the output of the softmax function is guaranteed to be a valid
probability vector.

6
References
[1] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

[2] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65(6):386, 1958.

You might also like