Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views14 pages

DL Unit 2a

Uploaded by

Rohit 6 World
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

DL Unit 2a

Uploaded by

Rohit 6 World
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT-2

Feed forward Networks: Multilayer Perceptron, Gradient Descent, Back


propagation, Empirical Risk Minimization.
Auto encoders: Regulraized Auto encoders, Representational Power, Layer, Size,
and Depth of Auto encoders, Stochastic Encoders and Decoders, Contractive
Encoders.
Regularization: Bias Variance Tradeoff, L2regularization, Early stopping, Data set
augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble
methods, Dropout, Greedy
_____________________________________________________________________

Multilayer Perceptron:-

Multi layer perceptron (MLP) is a supplement of feed forward neural network. It


consists of three types of layers—the input layer, output layer and hidden layer, as
shown in Fig. 3. The input layer receives the input signal to be processed.

It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is
called a deep ANN. An MLP is a typical example of a feedforward artificial neural
network. In this figure, the ith activation unit in the lth layer is denoted as ai(l).

The number of layers and the number of neurons are referred to as hyperparameters of
a neural network, and these need tuning. Cross-validation techniques must be used to
find ideal values for these.
The weight adjustment training is done via backpropagation. Deeper neural networks
are better at processing data. However, deeper layers can lead to vanishing gradient
problems. Special algorithms are required to solve this issue.

The algorithm for the MLP is as follows:


1. Just as with the perceptron, the inputs are pushed forward through the MLP by
taking the dot product of the input with the weights that exist between the input
layer and the hidden layer (WH). This dot product yields a value at the hidden
layer. We do not push this value forward as we would with a perceptron
though.
2. MLPs utilize activation functions at each of their calculated layers. There are
many activation functions to discuss: rectified linear units (ReLU), sigmoid
function, tanh. Push the calculated output at the current layer through any of
these activation functions.
3. Once the calculated output at the hidden layer has been pushed through the
activation function, push it to the next layer in the MLP by taking the dot
product with the corresponding weights.
4. Repeat steps two and three until the output layer is reached.
5. At the output layer, the calculations will either be used for
a backpropagation algorithm that corresponds to the activation function that
was selected for the MLP (in the case of training) or a decision will be made
based on the output (in the case of testing).

Notations

In the representation below:

 ai(in) refers to the ith value in the input layer


 ai(h) refers to the ith unit in the hidden layer
 ai(out) refers to the ith unit in the output layer
 ao(in) is simply the bias unit and is equal to 1; it will have the corresponding
weight w0
 The weight coefficient from layer l to layer l+1 is represented by wk,j(l)
A simplified view of the multilayer is presented here. This image shows a fully
connected three-layer neural network with 3 input neurons and 3 output neurons. A
bias term is added to the input vector.

Forward Propagation

In the following topics, let us look at the forward propagation in detail.

MLP Learning Procedure

The MLP learning procedure is as follows:

 Starting with the input layer, propagate data forward to the output layer. This step
is the forward propagation.
 Based on the output, calculate the error (the difference between the predicted and
known outcome). The error needs to be minimized.
 Backpropagate the error. Find its derivative with respect to each weight in the
network, and update the model.
Repeat the three steps given above over multiple epochs to learn ideal weights.

Finally, the output is taken via a threshold function to obtain the predicted class labels.

Forward Propagation in MLP

In the first step, calculate the activation unit al(h) of the hidden layer.
Activation unit is the result of applying an activation function φ to the z value. It must
be differentiable to be able to learn weights using gradient descent. The activation
function φ is often the sigmoid (logistic) function.

It allows nonlinearity needed to solve complex problems like image processing.

Sigmoid Curve

The sigmoid curve is an S-shaped curve.

Activation of Hidden Layer

The activation of the hidden layer is represented as:


z(h) = a(in) W(h)

a(h) =

For the output layer:

Z(out) = A(h) W(out)

A(out) =

Gradient Descent
Gradient descent is an optimization algorithm which is commonly-used to train
machine learning models and neural networks. Training data helps these models learn
over time, and the cost function within gradient descent specifically acts as a
barometer, gauging its accuracy with each iteration of parameter updates. It helps in
finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:

o If we move towards a negative gradient or away from the gradient of the


function at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the
cost function using iteration. To achieve this goal, it performs two steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or


slope of that function.
o Move away from the direction of the gradient, which means slope increased
from the current point by alpha times, where Alpha is defined as Learning Rate.
It is a tuning parameter in the optimization process which helps to decide the
length of the steps.

Cost-function

The cost function is defined as the measurement of difference or error between


actual values and expected values at the current position and present in the form of
a single real number. It helps to increase and improve machine learning efficiency by
providing feedback to this model so that it can minimize error and find the local or
global minimum. Further, it continuously iterates along the direction of the negative
gradient until the cost function approaches zero. At this steepest descent point, the
model will stop learning further. Although cost function and loss function are
considered synonymous, also there is a minor difference between them. The slight
difference between the loss function and the cost function is about the error within the
training of machine learning models, as loss function refers to the error of one training
example, while a cost function calculates the average error across an entire training
set.
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to
reduce the cost function.

Hypothesis:

Parameters:

Cost function:

Goal:

Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-
axis.

The starting point(shown in above fig.) is used to evaluate the performance as it is


considered just as an arbitrary point. At this starting point, we will derive the first
derivative or slope and then use a tangent line to calculate the steepness of this slope.
Further, this slope will inform the updates to the parameters (weights and bias).

The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are
required:

o Direction & Learning Rate

These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global
minimum. Let's discuss learning rate factors in brief;

Learning Rate:

It is defined as the step size taken to reach the minimum or lowest point. This is
typically a small value that is evaluated and updated based on the behavior of the cost
function. If the learning rate is high, it results in larger steps but also leads to risks of
overshooting the minimum. At the same time, a low learning rate shows the small step
sizes, which compromises overall efficiency but gives the advantage of more
precision.

Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into Batch gradient descent, stochastic gradient descent,
and mini-batch gradient descent. Let's understand these different types of gradient
descent:
1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training
set and update the model after evaluating all training examples. This procedure is
known as the training epoch. In simple words, it is a greedy approach where we have
to sum over all examples for each update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.


o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

2. Stochastic gradient descent

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each
example within a dataset and updates each training example's parameters one at a
time. As it requires only one training example at a time, hence it is easier to store in
allocated memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that require more
detail and speed. Further, due to frequent updates, it is also treated as a noisy gradient.
However, sometimes it can be helpful in finding the global minimum and also
escaping the local minimum.

Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it


consists of a few advantages over other gradient descent.

o It is easier to allocate in desired memory.


o It is relatively fast to compute than batch gradient descent.
o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes then
performs the updates on those batches separately. Splitting training datasets into
smaller batches make a balance to maintain the computational efficiency of batch
gradient descent and speed of stochastic gradient descent. Hence, we can achieve a
special type of gradient descent with higher computational efficiency and less noisy
gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.


o It is computationally efficient.
o It produces stable gradient descent convergence.

Back Propagation Algorithm:-

Backpropagation, or backward propagation of errors, is an algorithm that is designed


to test for errors working back from output nodes to input nodes. It is an important
mathematical tool for improving the accuracy of predictions in data mining and
machine learning.

Backpropagation is a widely used algorithm for training feedforward neural


networks. It computes the gradient of the loss function with respect to the network
weights. It is very efficient, rather than naively directly computing the gradient
concerning each weight. This efficiency makes it possible to use gradient methods to
train multi-layer networks and update weights to minimize loss; variants such as
gradient descent or stochastic gradient descent are often used.
The backpropagation algorithm works by computing the gradient of the loss function
with respect to each weight via the chain rule, computing the gradient layer by layer,
and iterating backward from the last layer to avoid redundant computation of
intermediate terms in the chain rule.

Features of Backpropagation:

1. it is the gradient descent method as used in the case of simple perceptron network
with the differentiable unit.
2. it is different from other networks in respect to the process by which the weights
are calculated during the learning period of the network.
3. training is done in the three stages :
 the feed-forward of input training pattern
 the calculation and backpropagation of the error
 updation of the weight
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input
vectors that the network operates on. It Compares generated output to the desired
output and generates an error report if the result does not match the generated output
vector. Then it adjusts the weights according to the bug report to get your desired
output.

Backpropagation Algorithm:

Step 1: Inputs X, arrive through the preconnected path.


Step 2: The input is modeled using true weights W. Weights are usually chosen
randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer
to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to
reduce the error.
Step 6: Repeat the process until the desired output is achieved.

Parameters :

 x = inputs training vector x=(x 1,x2,…………xn).


 t = target vector t=(t 1,t2……………tn).
 δk = error at output unit.
 δj = error at hidden layer.
 α = learning rate.
 V0j = bias of hidden unit j.
Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the steps stopping condition is to be false do step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and transmits the signal x i signal to
all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its
net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function z j = f(zinj) and sends this signals to all units in
the layer about i.e output units
For each output l=unit y k = (k=1 to m) sums its weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
and applies its activation function to calculate the output signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit y k (k=1 to n) receives a target pattern corresponding to an
input pattern then error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer
above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj
Updation of weight and bias :
Step 8: Each output unit y k (k=1 to m) updates its bias and weight (j=1 to a). The
weight correction term is given by :
Δ wjk = α δk zj
and the bias correction term is given by Δwk = α δk.
therefore wjk(new) = wjk(old) + Δ wjk
w0k(new) = wok(old) + Δ wok
for each hidden unit z j (j=1 to a) update its bias and weights (i=0 to n) the
weight connection term
Δ vij = α δj xi
and the bias connection on term
Δ v0j = α δj
Therefore vij(new) = vij(old) + Δvij
v0j(new) = v0j(old) + Δv0j
Step 9: Test the stopping condition. The stopping condition can be the minimization
of error, number of epochs.
Need for Backpropagation:

Backpropagation is “backpropagation of errors” and is very useful for training


neural networks. It’s fast, easy to implement, and simple. Backpropagation does not
require any parameters to be set, except the number of inputs. Backpropagation is a
flexible method because no prior knowledge of the network is required.

Types of Backpropagation

There are two types of backpropagation networks.


 Static backpropagation: Static backpropagation is a network designed to map
static inputs for static outputs. These types of networks are capable of solving
static classification problems such as OCR (Optical Character Recognition).
 Recurrent backpropagation: Recursive backpropagation is another network
used for fixed-point learning. Activation in recurrent backpropagation is feed-
forward until a fixed value is reached. Static backpropagation provides an instant
mapping, while recurrent backpropagation does not provide an instant mapping.

Advantages:

 It is simple, fast, and easy to program.


 Only numbers of the input are tuned, not any other parameter.
 It is Flexible and efficient.
 No need for users to learn any special functions.

Disadvantages:

 It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
 Performance is highly dependent on input data.
 Spending too much time training.
 The matrix-based approach is preferred over a mini-batch.

Empirical Risk Minimization:-

Empirical risk minimization (ERM) is a principle in statistical learning


theory which defines a family of learning algorithms and is used to give theoretical
bounds on their performance. The core idea is that we cannot know exactly how well
an algorithm will work in practice (the true "risk") because we don't know the true
distribution of data that the algorithm will work on, but we can instead measure its
performance on a known set of training data (the "empirical" risk).

You might also like