Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views26 pages

DLA Unit 3

This document provides an overview of deep learning, focusing on artificial neural networks, specifically the perceptron and its algorithm, as well as gradient descent and its variants. It discusses the development of perceptrons, their functioning, advantages and disadvantages of multi-layer perceptrons, and the optimization process through gradient descent. Additionally, it covers activation functions, cost functions, and types of gradient descent including batch and stochastic methods.

Uploaded by

terala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

DLA Unit 3

This document provides an overview of deep learning, focusing on artificial neural networks, specifically the perceptron and its algorithm, as well as gradient descent and its variants. It discusses the development of perceptrons, their functioning, advantages and disadvantages of multi-layer perceptrons, and the optimization process through gradient descent. Additionally, it covers activation functions, cost functions, and types of gradient descent including batch and stochastic methods.

Uploaded by

terala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DEEP LEARNING AND APPLICATIONS

MR20-1CS0158
UNIT III

Prepared by
Dr.M.Narayanan
Professor
Department of CSE
Malla Reddy University, Hyderabad
DEEP LEARNING AND APPLICATIONS
MR20-1CS0158
UNIT III
Artificial Neural Networks: Introduction, Perceptron Training Rule, Gradient Descent Rule.
Gradient Descent and Back propagation: Gradient Descent, Stochastic Gradient Descent,
Back propagation, Some problems in ANN Optimization and Regularization: Over fitting and
Capacity, Cross Validation, Feature Selection, Regularization, Hyper parameters.

Text Book
1. Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016
What is Perceptron and how it has been developed?
 The word Perceptron is first ever coined in 1943 by Warren McCulloch and Walter
Pitts. Lately, Frank Rosenblatt, an American psychologist, and computer scientist built
the first Perceptron machine when doing research at Cornell Aeronautical Laboratory
for performing image recognition.
 During this time the Perceptron machine become popular among the AI community and
is considered the fundamental part of building intelligent systems.
 The perceptron algorithm is based on the concept of a single neuron in the human brain,
a single neuron in the human brain is doing a very simple thing, just receiving some
inputs and if the inputs are high, it activates and send the signal to the next neuron.
 The perceptron was designed to mimic (copycat) this process, with the input data
serving as the input to the neuron and the weights representing the strength of the
connections between the input neurons and the output neuron.
The Perceptron Algorithm, How does it work?
 The Perceptron is a type of linear classifier (binary classifier), which means it can be used to
classify data that is linearly separable.
 A Perceptron is somehow similar to Logistic Regression at first glance, but it's different.
While Logistic Regression predicts probabilities of a data point falling in a particular class,
Perceptron will only tell whether the data point is in a particular class or not, Just like saying
"Yes" or "No". Here is the diagrammatic representation of the perceptron algorithm.
 A Perceptron is a kind of single Artificial Neuron which is also known as a Threshold
Processing Unit(TLU).
 As you can see in the above diagram, the Perceptron contains some input links X1, X2,
and, X3.
 Each input has its own corresponding weights W1, W2, and W3. These weights are
basically the hearts of Perceptrons which determine the strength of each input signal to
it.
 The Perceptron or TLU computes the weighted sum of the inputs (z = X1W1 + X2W2 +
X3W3 ...... + XnWn) and then these weighted sums of inputs are passed through an
activation function also known as the step function.
 These activation functions will determine whether the Perceptron needs to be activated
or not.
 Let's see an example to understand this,
 In the above example, we can see that there are three inputs given into a Perceptron, and
then the weighted sum of inputs is calculated and we got 0.22. This is passed through an
activation called the Heaviside activation function
 But you may be noticed that one of the inputs to the perceptron is zero, this might not be
good sometimes since it will affect the training process.
 If you try to change the weights, it will not make any effect since the input is still zero.
Here we need to add a new term to the equation which is known as bias.
 Bias will help to shift the activation to the left or right during the training of the
Perceptron algorithm.
 So the new term obtained will look like this,

z = (XW + bias)
Activation functions
 Activation functions are mathematical functions that can be used in Perceptrons to
determine the output given its input.
 As we said it determines whether the neuron(Perceptron) needs to be activated or not.
 Activation functions take in a weighted sum of the input data, called the activation, and
produce an output that can be used for prediction.
 Activation functions are an essential part of Perceptrons and neural networks because
they allow the model to learn and make decisions based on the input data.
 They also help to introduce non-linearity into the model, which is necessary for learning
more complex relationships in the data.
 Some common types of activation functions used in Perceptrons are the Sign function,
Heaviside function, Sigmoid function, ReLU function, etc.
 Here Heaviside and Sign functions are commonly used with Perceptrons, so let's
understand what these activation functions do,
Heaviside function
 The Heaviside activation function will return 0 when the weighted sum of inputs is less
than zero and return 1 if it is greater than or equal to 0.]

Sign function
 The Sign function will return 0 if the weighted sum of inputs is 0 and return +1 and -1
when the weighted sum of inputs is greater and lesser than 0 respectively.
Advantages of Multi-Layer Perceptron:
 A multi-layered perceptron model can be used to solve complex non-linear problems.
 It works well with both small and large input data.
 It helps us to obtain quick predictions after the training.
 It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:
 In Multi-layer perceptron, computations are difficult and time-consuming.
 In multi-layer Perceptron, it is difficult to predict how much the dependent variable
affects each independent variable.
 The model functioning depends on the quality of the training.
Question Bank

1. What is Perceptron and how it has been developed?


2. Explain Perceptron Algorithm, How does it work?
3. Write a Python program to convert video into frames using Hyper parameter Tuning
4. Explain Gradient Descent and Back propagation with suitable example.
5. Explain Stochastic Gradient Descent Algorithm
Gradient Descent
 Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and
expected results.
 Further, gradient descent is also used to train Neural Networks.
 In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x.
 Similarly, in machine learning, optimization is the task of minimizing the cost function
parameterized by the model's parameters.
 The main objective of gradient descent is to minimize the convex function using iteration
of parameter updates.
 Once these machine learning models are optimized, these models can be used as powerful
tools for Artificial Intelligence and various computer science applications
What is Gradient Descent or Steepest Descent?
 Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th
century.
 Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models.
 It helps in finding the local minimum of a function.
 If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
 Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.
 This entire procedure is known as Gradient Ascent, which is also known as steepest
descent.
 The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration. To achieve this goal, it performs two steps iteratively:
 Calculates the first-order derivative of the function to compute the gradient or slope
of that function.
 Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate.
 It is a tuning parameter in the optimization process which helps to decide the length of the
steps.
What is Cost-function?
 The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.
 It helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum
 Further, it continuously iterates along the direction of the negative gradient until the cost
function approaches zero.
 At this steepest descent point, the model will stop learning further. Although cost function
and loss function are considered synonymous, also there is a minor difference between
them.
 The slight difference between the loss function and the cost function is about the error
within the training of machine learning models, as loss function refers to the error of one
training example, while a cost function calculates the average error across an entire
training set.
 The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce
the cost function.
How does Gradient Descent work?
 Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple
linear regression is given as:
Y=mX+c

 Where 'm' represents the slope of the line,


and 'c' represents the intercepts on the y-axis.
 The starting point(shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point.
 At this starting point, we will derive the first derivative or slope and then use a tangent line
to calculate the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).
 The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
 The main objective of gradient descent is to minimize the cost function or the error
between expected and actual.
 To minimize the cost function, two data points are required:
Direction & Learning Rate
 These two factors are used to determine the partial derivative calculation of future iteration
and allow it to the point of convergence or local minimum or global minimum.
Let's discuss learning rate factors in brief;
Learning Rate:
 It is defined as the step size taken to reach the minimum or lowest point.
 This is typically a small value that is evaluated and updated based on the behavior of the
cost function.
 If the learning rate is high, it results in larger steps but also leads to risks of overshooting
the minimum.
 At the same time, a low learning rate shows the small step sizes, which compromises
overall efficiency but gives the advantage of more precision.
Types of Gradient Descent
 Based on the error in various training models, the Gradient Descent learning algorithm can
be divided into Batch gradient descent, stochastic gradient descent, and mini-batch
gradient descent.
 Let's understand these different types of gradient descent:
 Batch Gradient Descent:
 Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples.
 This procedure is known as the training epoch. In simple words, it is a greedy approach
where we have to sum over all examples for each update.
Advantages of Batch gradient descent:
 It produces less noise in comparison to other gradient descent.
 It produces stable gradient descent convergence.
 It is Computationally efficient as all resources are used for all training samples.
Stochastic gradient descent
 Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration.
 Or in other words, it processes a training epoch for each example within a dataset and
updates each training example's parameters one at a time.
 As it requires only one training example at a time, hence it is easier to store in allocated
memory.
 However, it shows some computational efficiency losses in comparison to batch gradient
systems as it shows frequent updates that require more detail and speed.
 Further, due to frequent updates, it is also treated as a noisy gradient.
 However, sometimes it can be helpful in finding the global minimum and also escaping the
local minimum.
Advantages of Stochastic gradient descent:
 In Stochastic gradient descent (SGD), learning happens on every example, and it consists
of a few advantages over other gradient descent.
 It is easier to allocate in desired memory.
 It is relatively fast to compute than batch gradient descent.
 It is more efficient for large datasets.
Here's a simplified step-by-step explanation of Stochastic Gradient Descent:

 Initialization: Initialize the model parameters randomly.


 Data Shuffling: Shuffle the training dataset to ensure that the optimization process
encounters a diverse range of data points in each iteration.
 Iterations: For each iteration, randomly select a small batch (subset) of data from the
shuffled dataset.
 Compute Gradient: Compute the gradient of the loss function with respect to the model
parameters using the selected batch.
 Update Parameters: Update the model parameters using the computed gradient. The
update is performed in the opposite direction of the gradient to minimize the loss.
 Repeat: Repeat steps 3-5 until a predefined number of iterations or a convergence criterion
is met
Learning gives Creativity,
Creativity leads to Thinking,
Thinking provides Knowledge,
and Knowledge makes you
great
A. P. J. Abdul Kalam

25

You might also like