Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views22 pages

Unit-1 and 2 Deep Learning

The document provides an overview of deep learning and neural networks, detailing their architecture, functioning, and applications. It covers key concepts such as the perceptron algorithm, forward and backpropagation, and various types of neural networks like CNNs and RNNs. Additionally, it discusses the advantages and disadvantages of neural networks and their significance in fields like image recognition and natural language processing.

Uploaded by

wafiyasadiq5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Unit-1 and 2 Deep Learning

The document provides an overview of deep learning and neural networks, detailing their architecture, functioning, and applications. It covers key concepts such as the perceptron algorithm, forward and backpropagation, and various types of neural networks like CNNs and RNNs. Additionally, it discusses the advantages and disadvantages of neural networks and their significance in fields like image recognition and natural language processing.

Uploaded by

wafiyasadiq5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

DEEP LEARNING

UNIT-1
Basics of neural networks - Basic concept of Neurons – Perceptron Algorithm –
Feed Forward-and Back Propagation Networks.

What is a Neural Network?

Neural networks are machine learning models that mimic the complex functions
of the human brain. These models consist of interconnected nodes or neurons
that process data, learn patterns, and enable tasks such as pattern recognition
and decision-making.
In this article, we will explore the fundamentals of neural networks, their
architecture, how they work, and their applications in various fields.
Understanding neural networks is essential for anyone interested in the
advancements of artificial intelligence.

Understanding Neural Networks in Deep Learning


Neural networks are capable of learning and identifying patterns directly from
data without pre-defined rules. These networks are built from several key
components:
1. Neurons: The basic units that receive inputs, each neuron is governed by
a threshold and an activation function.
2. Connections: Links between neurons that carry information, regulated by
weights and biases.
3. Weights and Biases: These parameters determine the strength and
influence of connections.
4. Propagation Functions: Mechanisms that help process and transfer data
across layers of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to
improve accuracy.
Importance of Neural Networks
Neural networks are pivotal in identifying complex patterns, solving intricate
challenges, and adapting to dynamic environments. Their ability to learn from
vast amounts of data is transformative, impacting technologies like natural
language processing, self-driving vehicles, and automated decision-making.
Neural networks streamline processes, increase efficiency, and support
decision-making across various industries. As a backbone of artificial
intelligence, they continue to drive innovation, shaping the future of technology.
Evolution of Neural Networks
Neural networks have undergone significant evolution since their inception in
the mid-20th century. Here’s a concise timeline of the major developments in
the field:
• 1940s-1950s: The concept of neural networks began with McCulloch and
Pitts’ introduction of the first mathematical model for artificial neurons.
However, the lack of computational power during that time posed
significant challenges to further advancements.
• 1960s-1970s: Frank Rosenblatt’s worked on
perceptrons. Perceptrons are simple single-layer networks that can solve
linearly separable problems, but can not perform complex tasks.
• 1980s: The development of backpropagation by Rumelhart, Hinton, and
Williams revolutionized neural networks by enabling the training of
multi-layer networks. This period also saw the rise of connectionism,
emphasizing learning through interconnected nodes.
• 1990s: Neural networks experienced a surge in popularity with
applications across image recognition, finance, and more. However, this
growth was tempered by a period known as the “AI winter,” during
which high computational costs and unrealistic expectations dampened
progress.
• 2000s: A resurgence was triggered by the availability of larger datasets,
advances in computational power, and innovative network architectures.
Deep learning, utilizing multiple layers, proved highly effective across
various domains.
• 2010s-Present: The landscape of machine learning has been dominated by
deep learning models such as convolutional neural networks
(CNNs) and recurrent neural networks (RNNs).
Types of Neural Networks
There are seven types of neural networks that can be used.
• Feedforward Networks: A feedforward neural network is a simple
artificial neural network architecture in which data moves from input to
output in a single direction.
• Multilayer Perceptron (MLP): MLP is a type of feedforward neural
network with three or more layers, including an input layer, one or more
hidden layers, and an output layer. It uses nonlinear activation functions.
• Convolutional Neural Network (CNN): A Convolutional Neural Network
(CNN) is a specialized artificial neural network designed for image
processing. It employs convolutional layers to automatically learn
hierarchical features from input images, enabling effective image
recognition and classification.
• Recurrent Neural Network (RNN): An artificial neural network type
intended for sequential data processing is called a Recurrent Neural
Network (RNN). It is appropriate for applications where contextual
dependencies are critical, such as time series prediction and natural
language processing, since it makes use of feedback loops, which enable
information to survive within the network.
• Long Short-Term Memory (LSTM): LSTM is a type of RNN that is
designed to overcome the vanishing gradient problem in training RNNs.
It uses memory cells and gates to selectively read, write, and erase
information.
Advantages of Neural Networks
Neural networks are widely used in many different applications because of their
many benefits:
• Adaptability: Neural networks are useful for activities where the link
between inputs and outputs is complex or not well defined because they
can adapt to new situations and learn from data.
• Pattern Recognition: Their proficiency in pattern recognition renders
them efficacious in tasks like as audio and image identification, natural
language processing, and other intricate data patterns.
• Parallel Processing: Because neural networks are capable of parallel
processing by nature, they can process numerous jobs at once, which
speeds up and improves the efficiency of computations.
• Non-Linearity: Neural networks are able to model and comprehend
complicated relationships in data by virtue of the non-linear activation
functions found in neurons, which overcome the drawbacks of linear
models.
Disadvantages of Neural Networks
Neural networks, while powerful, are not without drawbacks and difficulties:
• Computational Intensity: Large neural network training can be a
laborious and computationally demanding process that demands a lot of
computing power.
• Black box Nature: As “black box” models, neural networks pose a
problem in important applications since it is difficult to understand how
they make decisions.
• Overfitting: Overfitting is a phenomenon in which neural networks
commit training material to memory rather than identifying patterns in the
data. Although regularization approaches help to alleviate this, the
problem still exists.
• Need for Large datasets: For efficient training, neural networks frequently
need sizable, labeled datasets; otherwise, their performance may suffer
from incomplete or skewed data.
Applications of Neural Networks
Neural networks have numerous applications across various fields:
1. Image and Video Recognition: CNNs are extensively used in applications
such as facial recognition, autonomous driving, and medical image
analysis.
2. Natural Language Processing (NLP): RNNs and transformers power
language translation, chatbots, and sentiment analysis.
3. Finance: Predicting stock prices, fraud detection, and risk management.
4. Healthcare: Neural networks assist in diagnosing diseases, analyzing
medical images, and personalizing treatment plans.
5. Gaming and Autonomous Systems: Neural networks enable real-time
decision-making, enhancing user experience in video games and enabling
autonomous systems like self-driving cars.
Layers in Neural Network Architecture
1. Input Layer: This is where the network receives its input data. Each input
neuron in the layer corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy
lifting. A neural network can have one or multiple hidden layers. Each
layer consists of units (neurons) that transform the inputs into something
that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The
format of these outputs varies depending on the specific task (e.g.,
classification, regression).

What is Perceptron?
Perceptron is a type of neural network that performs binary classification that
maps input features to an output decision, usually classifying data into one of
two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a
layer of output nodes. It is particularly good at learning linearly separable
patterns. It utilizes a variation of artificial neurons called Threshold Logic Units
(TLU), which were first introduced by McCulloch and Walter Pitts in the 1940s.
This foundational model has played a crucial role in the development of more
advanced neural networks and machine learning algorithms.
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning
linearly separable patterns. It is effective for tasks where the data can be
divided into distinct categories through a straight line. While powerful in
its simplicity, it struggles with more complex problems where the
relationship between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they
consist of two or more layers, adept at handling more complex patterns
and relationships within the data.
Advantages:
• A multi-layered perceptron model can solve complex non-linear
problems.
• It works well with both small and large input data.
• Helps us to obtain quick predictions after the training.
• Helps us obtain the same accuracy ratio with big and small data.
Disadvantages:
• In multi-layered perceptron model, computations are time-consuming and
complex.
• It is tough to predict how much the dependent variable affects each
independent variable.
• The model functioning depends on the quality of training.
Perceptron Learning Rule
Perceptron Learning Rule states that the algorithm would automatically learn
the optimal weight coefficients. The input features are then multiplied with
these weights to determine if a neuron fires or not.

The Perceptron receives multiple input signals, and if the sum of the input
signals exceeds a certain threshold, it either outputs a signal or does not return
an output. In the context of supervised learning and classification, this can then
be used to predict the class of a sample.
Next up, let us focus on the perceptron function.
Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the
learned weight coefficient; an output value ”f(x)”is generated.

In the equation given above:


• “w” = vector of real-valued weights
• “b” = bias (an element that adjusts the boundary away from origin
without any dependence on the input value)
• “x” = vector of input x values

• “m” = number of inputs to the Perceptron


The output can be represented as “1” or “0.” It can also be represented as “1” or
“-1” depending on which activation function is used.
Let us learn the inputs of a perceptron in the next section.
Inputs of a Perceptron
A Perceptron accepts inputs, moderates them with certain weight values, then
applies the transformation function to output the final result. The image below
shows a Perceptron with a Boolean output.

A Boolean output is based on inputs such as salaried, married, age, past credit
profile, etc. It has only two values: Yes and No or True and False. The
summation function “∑” multiplies all inputs of “x” by weights “w” and then
adds them up as follows:

Output of Perceptron
Perceptron with a Boolean output:
Inputs: x1…xn
Output: o(x1….xn)

Weights: wi=> contribution of input xi to the Perceptron output;


w0=> bias or threshold
If ∑w.x > 0, output is +1, else -1. The neuron gets triggered only when
weighted input reaches a certain threshold value.

An output of +1 specifies that the neuron is triggered. An output of -1 specifies


that the neuron did not get triggered.
“sgn” stands for sign function with output +1 or -1.
Error in Perceptron
In the Perceptron Learning Rule, the predicted output is compared with the
known output. If it does not match, the error is propagated backward to allow
weight adjustment to happen.
Working of Neural Networks
Forward Propagation
When data is input into the network, it passes through the network in the
forward direction, from the input layer through the hidden layers to the output
layer. This process is known as forward propagation. Here’s what happens
during this phase:
1. Linear Transformation: Each neuron in a layer receives inputs, which are
multiplied by the weights associated with the connections. These products
are summed together, and a bias is added to the sum. This can be
represented mathematically as: z=w1x1+w2x2+…+wnxn+bz=w1x1+w2
x2+…+wnxn+b where ww represents the weights, xx represents the
inputs, and bb is the bias.
2. Activation: The result of the linear transformation (denoted as zz) is then
passed through an activation function. The activation function is crucial
because it introduces non-linearity into the system, enabling the network
to learn more complex patterns. Popular activation functions include
ReLU, sigmoid, and tanh.
Backpropagation
After forward propagation, the network evaluates its performance using a loss
function, which measures the difference between the actual output and the
predicted output. The goal of training is to minimize this loss. This is where
backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a
measure of error in the predictions. The loss function could vary;
common choices are mean squared error for regression tasks or cross-
entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss
function with respect to each weight and bias in the network. This
involves applying the chain rule of calculus to find out how much each
part of the output error can be attributed to each weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases
are updated using an optimization algorithm like stochastic gradient
descent (SGD). The weights are adjusted in the opposite direction of the
gradient to minimize the loss. The size of the step taken in each update is
determined by the learning rate.
UNIT-II INTRODUCTION TO DEEP LEARNING
Introduction to deep learning - Feed Forward Neural Networks – Gradient
Descent – Back Propagation Algorithm – Vanishing Gradient problem –
Mitigation – RelU Heuristics for Avoiding Bad Local Minima – Heuristics for
Faster Training – Nestors Accelerated Gradient Descent – Regularization –
Dropout

What is Deep Learning?


The definition of Deep learning is that it is the branch of machine learning that
is based on artificial neural network architecture. An artificial neural network
or ANN uses layers of interconnected nodes called neurons that work together
to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or
more hidden layers connected one after the other. Each neuron receives input
from the previous layer neurons or the input layer. The output of one neuron
becomes the input to other neurons in the next layer of the network, and this
process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of
nonlinear transformations, allowing the network to learn complex
representations of the input data.
Scope of Deep Learning
Today Deep learning AI has become one of the most popular and visible areas
of machine learning, due to its success in a variety of applications, such as
computer vision, natural language processing, and Reinforcement learning.
Deep learning AI can be used for supervised, unsupervised as well as
reinforcement machine learning. it uses a variety of ways to process these.
• Supervised Machine Learning: Supervised machine learning is
the machine learning technique in which the neural network learns to
make predictions or classify data based on the labeled datasets. Here we
input both input features along with the target variables. the neural
network learns to make predictions based on the cost or error that comes
from the difference between the predicted and the actual target, this
process is known as backpropagation. Deep learning algorithms like
Convolutional neural networks, Recurrent neural networks are used for
many supervised tasks like image classifications and recognization,
sentiment analysis, language translations, etc.
• Unsupervised Machine Learning: Unsupervised machine learning is
the machine learning technique in which the neural network learns to
discover the patterns or to cluster the dataset based on unlabeled datasets.
Here there are no target variables. while the machine has to self-
determined the hidden patterns or relationships within the datasets. Deep
learning algorithms like autoencoders and generative models are used for
unsupervised tasks like clustering, dimensionality reduction, and anomaly
detection.
• Reinforcement Machine Learning: Reinforcement Machine
Learning is the machine learning technique in which an agent learns to
make decisions in an environment to maximize a reward signal. The
agent interacts with the environment by taking action and observing the
resulting rewards. Deep learning can be used to learn policies, or a set of
actions, that maximizes the cumulative reward over time. Deep
reinforcement learning algorithms like Deep Q networks and Deep
Deterministic Policy Gradient (DDPG) are used to reinforce tasks like
robotics and game playing etc.
Advantages of Deep Learning:
1. High accuracy: Deep Learning algorithms can achieve state-of-the-art
performance in various tasks, such as image recognition and natural
language processing.
2. Automated feature engineering: Deep Learning algorithms can
automatically discover and learn relevant features from data without the
need for manual feature engineering.
3. Scalability: Deep Learning models can scale to handle large and complex
datasets, and can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of
tasks and can handle various types of data, such as images, text, and
speech.
5. Continual improvement: Deep Learning models can continually
improve their performance as more data becomes available.
Disadvantages of Deep Learning:
1. High computational requirements: Deep Learning AI models require
large amounts of data and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often
require a large amount of labeled data for training, which can be
expensive and time- consuming to acquire.
3. Interpretability: Deep Learning models can be challenging to interpret,
making it difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training
data, resulting in poor performance on new and unseen data.
4. Black-box nature: Deep Learning models are often treated as black
boxes, making it difficult to understand how they work and how they
arrived at their predictions.
What is a Feedforward Neural Network?
A Feedforward Neural Network (FNN) is a type of artificial neural network
where connections between the nodes do not form cycles. This characteristic
differentiates it from recurrent neural networks (RNNs). The network consists
of an input layer, one or more hidden layers, and an output layer. Information
flows in one direction—from input to output—hence the name "feedforward."
Structure of a Feedforward Neural Network
1. Input Layer: The input layer consists of neurons that receive the input
data. Each neuron in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input
and output layers. These layers are responsible for learning the complex
patterns in the data. Each neuron in a hidden layer applies a weighted sum
of inputs followed by a non-linear activation function.
3. Output Layer: The output layer provides the final output of the network.
The number of neurons in this layer corresponds to the number of classes
in a classification problem or the number of outputs in a regression
problem.
Each connection between neurons in these layers has an associated weight that
is adjusted during the training process to minimize the error in predictions.
Feed Forward Neural Network
Activation Functions
Activation functions introduce non-linearity into the network, enabling it to
learn and model complex data patterns. Common activation functions include:
• Sigmoid: σ(x)=σ(x)=11+e−xσ(x)=1+e−x1
• Tanh: tanh(x)=ex−e−xex+e−xtanh(x)=ex+e−xex−e−x
• ReLU (Rectified Linear Unit): ReLU(x)=max⁡(0,x)ReLU(x)=max(0,x)
• Leaky
ReLU: Leaky ReLU(x)=max⁡(0.01x,x)Leaky ReLU(x)=max(0.01x,x)
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the
neurons to minimize the error between the predicted output and the actual
output. This process is typically performed using backpropagation and gradient
descent.
1. Forward Propagation: During forward propagation, the input data
passes through the network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function
such as Mean Squared Error (MSE) for regression tasks or Cross-Entropy
Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back
through the network to update the weights. The gradient of the loss
function with respect to each weight is calculated, and the weights are
adjusted using gradient descent.

Forward Propagation

Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss
function by iteratively updating the weights in the direction of the negative
gradient. Common variants of gradient descent include:
• Batch Gradient Descent: Updates weights after computing the gradient
over the entire dataset.
• Stochastic Gradient Descent (SGD): Updates weights for each training
example individually.
• Mini-batch Gradient Descent: Updates weights after computing the
gradient over a small batch of training examples.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
• Accuracy: The proportion of correctly classified instances out of the total
instances.
• Precision: The ratio of true positive predictions to the total predicted
positives.
• Recall: The ratio of true positive predictions to the actual positives.
• F1 Score: The harmonic mean of precision and recall, providing a
balance between the two.
• Confusion Matrix: A table used to describe the performance of a
classification model, showing the true positives, true negatives, false
positives, and false negatives.

What is Vanishing Gradient?


The vanishing gradient problem is a challenge that emerges during
backpropagation when the derivatives or slopes of the activation functions
become progressively smaller as we move backward through the layers of a
neural network. This phenomenon is particularly prominent in deep networks
with many layers, hindering the effective training of the model. The weight
updates becomes extremely tiny, or even exponentially small, it can
significantly prolong the training time, and in the worst-case scenario, it can halt
the training process altogether.
Why the Problem Occurs?
During backpropagation, the gradients propagate back through the layers of the
network, they decrease significantly. This means that as they leave the output
layer and return to the input layer, the gradients become progressively smaller.
As a result, the weights associated with the initial levels, which accommodate
these small gradients, are updated little or not at each iteration of the
optimization process.
The vanishing gradient problem is particularly associated with the sigmoid
and hyperbolic tangent (tanh) activation functions because their derivatives fall
within the range of 0 to 0.25 and 0 to 1, respectively. Consequently, extreme
weights becomes very small, causing the updated weights to closely resemble
the original ones. This persistence of small updates contributes to the vanishing
gradient issue.
The sigmoid and tanh functions limit the input values to the ranges [0,1] and [-
1,1], so that they saturate at 0 or 1 for sigmoid and -1 or 1 for Tanh. The
derivatives at points becomes zero as they are moving. In these regions,
especially when inputs are very small or large, the gradients are very close to
zero. While this may not be a major concern in shallow networks with a few
layers, it is a more pronounced issue in deep networks. When the inputs fall in
saturated regions, the gradients approach zero, resulting in little update to the
weights of the previous layer. In simple networks this does not pose much of a
problem, but as more layers are added, these small gradients, which multiply
between layers, decay significantly and consequently the first layer tears very
slowly , and hinders overall model performance and can lead to convergence
failure.
How can we identify?
Identifying the vanishing gradient problem typically involves monitoring the
training dynamics of a deep neural network.
• One key indicator is observing model weights converging to 0 or
stagnation in the improvement of the model's performance metrics over
training epochs.
• During training, if the loss function fails to decrease significantly, or if
there is erratic behavior in the learning curves, it suggests that the
gradients may be vanishing.
• Additionally, examining the gradients themselves during backpropagation
can provide insights. Visualization techniques, such as gradient
histograms or norms, can aid in assessing the distribution of gradients
throughout the network.
How can we solve the issue?
• Batch Normalization : Batch normalization normalizes the inputs of
each layer, reducing internal covariate shift. This can help stabilize and
accelerate the training process, allowing for more consistent gradient
flow.
• Activation function: Activation function like Rectified Linear Unit
(ReLU) can be used. With ReLU, the gradient is 0 for negative and zero
input, and it is 1 for positive input, which helps alleviate the vanishing
gradient issue. Therefore, ReLU operates by replacing poor enter values
with 0, and 1 for fine enter values, it preserves the input unchanged.
• Skip Connections and Residual Networks (ResNets): Skip
connections, as seen in ResNets, allow the gradient to bypass certain
layers during backpropagation. This facilitates the flow of information
through the network, preventing gradients from vanishing.
• Long Short-Term Memory Networks (LSTMs) and Gated Recurrent
Units (GRUs): In the context of recurrent neural networks (RNNs),
architectures like LSTMs and GRUs are designed to address the
vanishing gradient problem in sequences by incorporating gating
mechanisms .
• Gradient Clipping: Gradient clipping involves imposing a threshold on
the gradients during backpropagation. Limit the magnitude of gradients
during backpropagation, this can prevent them from becoming too small
or exploding, which can also hinder learning.
What is the Rectified Linear Unit (ReLU) Activation Function?
The Rectified Linear Unit (ReLU) is one of the most popular activation
functions used in neural networks, especially in deep learning models. It has
become the default choice in many architectures due to its simplicity and
efficiency. The ReLU function is a piecewise linear function that outputs the
input directly if it is positive; otherwise, it outputs zero.
In simpler terms, ReLU allows positive values to pass through unchanged while
setting all negative values to zero. This helps the neural network maintain the
necessary complexity to learn patterns while avoiding some of the pitfalls
associated with other activation functions, like the vanishing gradient problem.
Mathematical Formula of ReLU Activation Function
The ReLU function can be described mathematically as follows:
f(x)=max(0,x)f(x)=max(0,x)
Where:
• x is the input to the neuron.
• The function returns x if x is greater than 0.
• If x is less than or equal to 0, the function returns 0.
In mathematical terms, the ReLU function can be written as:
f(x)={xif x>00if x≤0f(x)={x0if x>0if x≤0
This simplicity is what makes ReLU so effective in training deep neural
networks, as it helps to maintain non-linearity without complicated
transformations, allowing models to learn more efficiently.
Drawbacks of ReLU
While ReLU has many advantages, it also comes with its own set of challenges:
1. Dying ReLU Problem: One of the most significant drawbacks of ReLU
is the "dying ReLU" problem, where neurons can sometimes become
inactive and only output 0. This happens when large negative inputs
result in zero gradient, leading to neurons that never activate and cannot
learn further.
2. Unbounded Output: Unlike other activation functions like sigmoid or
tanh, the ReLU activation is unbounded on the positive side, which can
sometimes result in exploding gradients when training deep networks.
3. Noisy Gradients: The gradient of ReLU can be unstable during training,
especially when weights are not properly initialized. In some cases, this
can slow down learning or lead to poor performance.
Pros of Heuristics
• Simplicity and Speed: Heuristics offer straightforward solutions that can
be quickly implemented, providing immediate results. This rapid
deployment cycle allows for agility in responding to business needs.
• Low Resource Requirement: Unlike machine learning models that
necessitate large datasets and significant computational power, heuristics
operate with minimal resources. Their implementation requires neither
extensive data nor complex algorithms, making them a cost-effective
solution.
• Flexibility: Heuristics can be easily tailored to address a wide range of
problems without the need for substantial reprogramming. This
adaptability makes them especially valuable in environments where
business conditions and requirements frequently change.
• High Interpretability: The transparent nature of heuristics makes them
easy to understand and explain. This clarity is crucial for gaining
stakeholder buy-in and for troubleshooting and refining business rules.
Cons of Heuristics
• Inaccuracy: The simplicity of heuristics can be a double-edged sword.
While they provide quick solutions, these may not always be the most
accurate or optimal, especially in complex scenarios.
• Bias and Assumptions: Heuristics are predicated on assumptions and
experiences that may not universally apply. This reliance can introduce
biases into the decision-making process, potentially skewing outcomes.
• Scalability: As businesses grow and problems become more intricate, the
limitations of heuristics become increasingly apparent. They may struggle
to scale effectively, lacking the sophistication required to handle
complex, multifaceted challenges.
What is Regularization?¶
Before we deep dive into the topic, take a look at this image:

Have you seen this image before? As we move towards the right in this image,
our model tries to learn too well the details and the noise from the training data,
which ultimately results in poor performance on the unseen data.

In other words, while going towards the right, the complexity of the model
increases such that the training error reduces but the testing error doesn’t.
Regularization is a technique which makes slight modifications to the learning
algorithm such that the model generalizes better. This in turn improves the
model’s performance on the unseen data as well.
How does Regularization help reduce Overfitting?
Let’s consider a neural network which is overfitting on the training data.

If you have studied the concept of regularization in machine learning, you will
have a fair idea that regularization penalizes the coefficients. In deep
learning, it actually penalizes the weight matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight
matrices are nearly equal to zero. This will result in a much simpler linear
network and slight underfitting of the training data.
L2 & L1 regularization¶
L1 and L2 are the most common types of regularization. These update the
general cost function by adding another term known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization term
Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight matrices
leads to simpler models. Therefore, it will also reduce overfitting to quite an
extent.
However, this regularization term differs in L1 and L2.

In L2, we have:
Here, lambda is the regularization parameter. It is the hyperparameter whose
value is optimized for better results. L2 regularization is also known as weight
decay as it forces the weights to decay towards zero (but not exactly zero).
In L1, we have:

In this, we penalize the absolute value of the weights. Unlike L2, the weights
may be reduced to zero here. Hence, it is very useful when we are trying to
compress our model. Otherwise, we usually prefer L2 over it.

Dropout¶
This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used
regularization technique in the field of deep learning.

So what does dropout do? At every iteration, it randomly selects some nodes
and removes them along with all of their incoming and outgoing connections as
shown below.
So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine
learning.
Ensemble models usually perform better than a single model as they capture
more randomness. Similarly, dropout also performs better than a normal neural
network model.
This probability of choosing how many nodes should be dropped is the
hyperparameter of the dropout function. As seen in the image above, dropout
can be applied to both the hidden layers as well as the input layers.

Due to these reasons, dropout is usually preferred when we have a large neural
network structure in order to introduce more randomness.

You might also like