Classification Using a Single Perceptron
Let’s understand the term “deep learning”. The word deep refers to the layers. So deep
learning is about building complex hierarchical representations from simple building
blocks.
”The hierarchy of concepts allows the computer to learn complicated concepts by
building them out of simpler ones. If we draw a graph showing how these concepts are
built on top of each other, the graph is deep, with many layers. For this reason, we call
this approach to AI deep learning.” - Ian Goodfellow, the inventor of Generative
Adversarial Networks (GANs)
Deep learning models have the ability to perform automatic feature extraction from
raw data, also called feature learning.
Now that we understand deep learning is built using simple building blocks, let’s dive
in and understand what these building blocks are.
[email protected]
ZV0GDF798E
The perceptron is the basic building block of deep learning. Sometimes we also call it a
neuron, so the neural network is nothing but a network or layer of neurons or
perceptrons.
Let’s recall what the perceptron is:
It’s a function that has several inputs and one output. Let's say that it has n inputs
{X1,....., Xn}. Then the output of a perceptron is computed in two steps:
1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Step 1: We compute a linear function of the inputs; the coefficients of this linear
function are called weights. We define these weights with random values.
[email protected]
ZV0GDF798EStep 2: We take this linear combination as the output of the first step and compute a
threshold. This threshold takes any value above some cutoff tau and maps it to the
value +1, and maps everything below tau to -1. The second step is the only
non-linearity in the network.
So the above perceptron required n weights corresponding to n inputs and one value of
tau. So in total, we need n+1 parameters for each perceptron.
What's the connection to Support Vector Machines?
The perceptron only recognizes linear patterns, and to combat this, we introduce the
idea of mapping the input space to a new space using a kernel. In deep learning, we
take a different approach to SVMs; we layer perceptrons on top of each other to get our
non-linearity.
Let’s start with a concrete example - the problem of image classification. Here we need
to classify which image contains the dog and which doesn’t.
2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
These are colored images so each pixel is associated with three values - Red, Green,
and Blue which we call RGB. We can represent our input as a 256 by 256-sized image.
The image dimensions will be (256,256,3) where 256, 256 represent the height and
width of the image, and 3 represents the number of color channels or depth (R, G, and
B - hence 3) for each pixel. Hence, every image is a 256 x 256 x 3 array of values.
[email protected]
ZV0GDF798E
Now, we want the perceptron to solve our image classification problem. What we're
really asking for is a linear function of pixels that returns positive if the image contains
a dog and negative if it doesn't.
3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
We might imagine a perceptron is just too simple a function to solve such a complex
task efficiently. There probably isn't a nice linear function of the pixels that decides
whether or not there is a dog in the picture. So the key takeaway here is that a simple
linear function cannot identify complex patterns, such as the presence of a dog in
an image. We need more complex functions or non-linearity in the functions to do this.
Here,
[email protected] non-linearity means the function tries to fit a curve instead of a line as the
ZV0GDF798E
decision boundary.
So how can we create more complex functions out of perceptrons as building blocks?
As we have discussed we need to add more layers, and in the simplest setup imagine
that there are just two layers of perceptrons.
4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Our input is an n-dimensional vector representing a picture. Each perceptron, in the
first layer, has n inputs, each with its own weight. So each perceptron, in the first layer,
computes its own threshold and returns the output. This output will be an input to the
next layer perceptron and these inputs for the second layer perceptron have their own
set of weights and threshold and return the output.
The above diagram can be divided into three regions:
[email protected]
ZV0GDF798EThe red one corresponds to the input layer and the brown one in the middle
corresponds to the hidden layer which is responsible for calculating complex patterns.
The orange one is the output layer.
So,
Total number of layers = Number of hidden layers + output layer
We don’t count the input layer in the total number of layers.
As we know, each perceptron has n+1 parameters. So in the above network, the first
layer has m perceptrons, which take a total of (n+1) x m parameters. Similarly, for the
next layer, we have one perceptron that takes m+1 parameters. So the total number of
parameters this network takes is:
5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Let’s understand all these steps with a simple example:
Here we consider 2 perceptrons in the first layer and 1 perceptron in the second layer.
Suppose we have 6 input values,
𝑥1 10
𝑥2 24
𝑥3 15
𝑥4 18
𝑥5 5
𝑥6 20
We have
[email protected] two layers. Let’s start with layer 1 with two perceptrons. Since we have 6
ZV0GDF798E
inputs, each perceptron takes 6 weights (random values for now), and let’s say the
threshold is 10. So:
Perceptron1_layer_1 = 𝑥 𝑤 + 𝑥 𝑤 + 𝑥 𝑤 + 𝑥 𝑤 + 𝑥 𝑤 + 𝑥 𝑤
1 1 2 2 3 3 4 4 5 5 6 6
= (10 x 0.2 )+ (24 x 0.1) + (15 x 0.15) + (18 x 0.25) + (5 x 0.4) + (20 x 0.12 )
= 15.55
Now if we compare this with the threshold value (which is 10), it’s higher than the
threshold, so the function returns +1 and not -1.
Now we need to calculate this for perceptron 2. It takes six weights which are random
values different from the first layer, and let’s say the threshold here is 15.
Perceptron2_layer_1 = 𝑥 𝑤' + 𝑥 𝑤' + 𝑥 𝑤' + 𝑥 𝑤' + 𝑥 𝑤' + 𝑥 𝑤'
1 1 2 2 3 3 4 4 5 5 6 6
= (10 x 0.14 )+ (24 x 0.23) + (15 x 0.2) + (18 x 0.3) + (5 x 0.28) + (20 x 0.1 )
= 18.72
Now if we compare this with the threshold value (which is 15), it’s higher than the
threshold, so again, the function returns +1.
6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
So the outputs of the first layer are { +1, +1 }. These will be inputs for the second
model. For the second layer which has one perceptron, we do a similar calculation. It
has two inputs, so we need two weights here along with a threshold: let’s say 0.5.
Perceptron1_layer_2 = 𝑥 𝑤 + 𝑥2𝑤22 = (1 x 0.5) + (1 x 0.2) = 0.7
1 21
Comparing this with the threshold value (which is 0.5), it returns +1 as it’s higher.
The total number of parameters are = (n+1)m + (m+1) = (6+1)2 + (2+1) = 17.
This is how the values are calculated and can be extended to multiple layers with
multiple perceptrons.
We can take this idea much further and we can have more than two layers - the
functions that we get are called deep neural networks and in practice, one usually
sets them to have between 6 and 8 layers and millions of perceptrons in between.
There are several fragments of the intuition behind these types of functions. The hope
is that the lower-level layers of the network identify some base features like edges and
patterns
[email protected] and that each layer on top of them, builds on the previous layer to create
ZV0GDF798E
much more complex features.
Source - Andrew NG
7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Let's take an instance, where the model has to identify whether or not a dog is present
in an image. First, you would need to identify its edges, and then you'd identify which
collections of arrangements of edges represent legs, which represent the body and the
head, and then which arrangements of these parts represent the dog.
Recall that in ImageNet, we want to correctly classify according to 1000 different
labels at once. Even though there are a million total images, that's not actually that
many examples considering the number of labels. So what's important is that the
features that are useful for identifying one breed of dog can be useful in identifying
other breeds of dog as well. In this sense, a deep network can have 1000 outputs, one
for each label, built on top of a common deep network underneath it, one which is
hopefully identifying useful high-level representations that are needed to understand
images.
Another rationalization for deep neural networks is that they parallel what happens in
the visual cortex. There's still a lot about the brain that we don't understand. But it
does seem that the visual cortex has a similar type of hierarchical structure, with
[email protected]
ZV0GDF798Eneurons in the lower layers recognizing lower level features like edges.
Moreover, we can measure how quickly a human can recognize an object, and how
quickly a neuron fires. These tell us that even though the visual cortex is performing
some hierarchical computation, it requires at most six to eight layers (with a sufficient
number of neurons in each layer) in order to solve even complex, high-level recognition
problems.
8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
So let's conclude - first, the perceptron has a linear and nonlinear threshold function. If
it wasn't for the threshold, creating deeper networks would not buy us anything. We
would still be composing linear functions and no matter how deep we make the
computation, the function we get would still be linear. It's the non-linearity that we
added that makes deep networks so functionally expressive. We have understood that
by using the threshold function we are introducing some kind of non-linearity into the
network, but why do we need non-linearity? It is hard to find any physical world
phenomenon which follows linearity straightforwardly. We need a non-linear function
that can approximate the non-linear phenomenon we observe in the real world. The
image below is an example of such non-linear patterns that need to be identified:
[email protected]
ZV0GDF798E
To introduce non-linearity we use some kind of function, for example, the threshold
function, and these functions are called activation functions. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
In fact, there are many other non-linear functions or activation functions that we could
have chosen instead of a threshold, we could have chosen a logistic sigmoid or
hyperbolic tangent or any other smooth approximation to a step function. This and
many other aspects of the architecture of deep neural networks are all valid design
choices that have their own merits. There are many research papers that grapple with
9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
issues like, which non-linear functions work the best, and how we should structure the
internal layers, for example using the convolution operation. We will talk more about
these issues, but it's good to know that they're very important.
Second, we've said nothing about actually finding the parameters of the deep network.
So while calculating the perceptron we have used weights that we defined randomly.
These weights need to be learned by the network. Modern deep networks have
millions of parameters, which is a very large space to search for. When we talked
about the support vector machine we had the perceptron algorithm that told us if there
is a linear classifier, we can find it algorithmically. But even if there is a setting of the
parameters of a deep network that really can classify images accurately into say
different breeds of dog, how can we find them? There is no simple answer to this
question. There are approaches that seem to work in practice, but why they do is still
very much a mystery, and perhaps a phenomenon that has to do with some of the
strange idioms of searching in such a high dimensional space.
Additional Content:
[email protected]
ZV0GDF798ETypes of Activation functions:
1) Sigmoid
The main reason why we use the sigmoid function is that the range of values it
outputs is between 0 and 1. Therefore, it is especially used for models where
we have to predict the probability as an output since the probability of anything
exists only between the range of 0 and 1 - in such contexts, Sigmoid is the right
choice. However, the logistic sigmoid function when used in the hidden layers
can cause a neural network to get stuck during training time, due to the
Vanishing / Exploding Gradient problem. Therefore, the Sigmoid is mostly only
used in the output layer, especially in the case of binary classification.
2) Tanh
Tanh is a shifted version of the Sigmoid function, where its output range is
between -1 and 1. The mean of the activations that come out of the hidden layer
is closer to having a zero mean, therefore the data is more centered, which
makes the learning for the next layer easier and faster. One of the downsides of
10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
both Sigmoid and Tanh is the Vanishing / Exploding Gradient problem; if our
weighted sum input is either very large or very small, then the gradient (also
called the derivative or slope) of this function becomes very small and ends up
being very close to zero. This can slow down learning, and this is why Sigmoid
and Tanh are not preferred in the hidden layers of deep neural networks.
3) ReLU
ReLU is increasingly the default choice of activation function in the hidden layers
of deep neural networks. If you are not sure what to use in the hidden layers,
just use the ReLU activation function or one of its variants. It is a bit faster to
compute than other activation functions, and gradient descent does not get
stuck as much on plateaus and thanks to the fact that it does not saturate for
the large input values as opposed to the logistic sigmoid function or the
hyperbolic tangent function.
One disadvantage of ReLU is that the derivative is equal to zero when (some
weighted input) is negative. The problem is known as the dying ReLU. If the
weights in the network always lead to negative inputs into a ReLU neuron, that
[email protected]
ZV0GDF798E neuron won't be effectively contributing to the network training. There is
another version of the ReLU activation function called the Leaky ReLU, that
solves the dying ReLU problem. It usually works better than the ReLU activation
function.
4) LeakyReLU
The Leaky ReLU activation function usually works better than ReLU, but it is not
used that much in practice.
5) Softmax
The Softmax activation function is used in neural networks when we want to
build a multi-class classifier that solves the problem of assigning an instance to
one class when the number of possible classes is larger than two (otherwise we
can simply use Sigmoid if possible classes=2).
11
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
In the output layer, the activation functions are selected based on the problem
statement, for example, if it's:
● Regression: Linear / No activation function (because the values are unbounded)
● Classification:
○ Binary classification - Sigmoid
○ Multiclass classification - Softmax
In the above explanation, we have initialized the weights randomly, but we should be
careful and cognizant about how we're defining the weights. Weight Initialization in
neural networks is an active research topic in its own right. You may refer to this link to
get a better idea about it > Weight Initialization.
[email protected]
ZV0GDF798E
12
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.