Deep Learning
Dr. Arundhati Das
Image Courtesy: Internet
Module II: Training, Optimization and
Regularization of Deep Neural Network
2.1 Training Feedforward DNN: Multi Layered Feed Forward Neural Network,
Learning Factors, Activation functions: Tanh, Logistic, Linear, Softmax, ReLU,
Leaky ReLU, Loss functions: Squared Error loss, Cross Entropy, Choosing
output function and loss function
2.2 Optimization: Learning with backpropagation, Learning Parameters: Gradient
Descent (GD), Stochastic and Mini Batch GD, Momentum Based GD, Nesterov
Accelerated GD, AdaGrad, Adam, RMSProp
2.3 Regularization: Overview of Overfitting, Types of biases, Bias Variance
Tradeoff Regularization Methods: L1, L2 regularization, Parameter sharing,
Dropout, Weight Decay, Batch normalization, Early stopping, Data
Augmentation, Adding noise to input and output
2.1 Training Feedforward DNN: Multi Layered Feed Forward Neural
Network, Learning Factors, Activation functions: Tanh, Logistic, Linear,
Softmax, ReLU, Leaky ReLU, Loss functions: Squared Error loss,
Cross Entropy, Choosing output function and loss function
Feed Forward Network
Feed Forward Network Deep Feed Forward Network
• The MLP architecture is nothing but a • A deep FFN is having more number of
layered feedforward neural network, in hidden layers, typically >=2
which the nonlinear elements (neurons)
are arranged in successive layers, and the
information flows unidirectionally, from
input layer to output layer, through the
hidden layer(s).
• A multi-layer perceptron (MLP) is a form
of feedforward neural network that
consists of multiple layers of computation
nodes that are connected in a feed-
forward way.
Training Deep FFN (or multi layered FFN)
• Training a deep feedforward neural network (FFN), also known as a deep
multi-layered perceptron (MLP), involves several steps, including data
preparation, defining the network architecture, choosing an optimizer and
loss function, and running the training process.
• Steps for Training a Deep Feedforward Neural Network
1.Data Preparation: Load and preprocess the dataset.
2.Define the Network Architecture
3.Choose Loss Function and Optimizer
4.Training Loop
5.Evaluation
1. Data Preparation: Load and preprocess
the dataset.
Step i: Import the Libraries. Eg: import numpy as np
Step ii: Import the Loaded Data.
Eg: from sklearn import datasets
iris = datasets.load_iris()
Step iii: Clean, prepare and arrange the data.
There are several useful functions for checking missing values, detecting, removing, and
replacing missing/null values.
Eg: iris.isnull().sum() #sum will return the no of missing values
• isnull(), notnull(), dropna(), fillna(), replace(), interpolate()
Textual and non-numeric data are arranged into numeric data for further processing.
• LabelEncoder(): Encode target labels with value between 0 and n_classes-1. ['cat' 'dog' 'fish'
'cat' 'dog' 'fish' 'bird’]-> [1 2 3 1 2 3 0]: sklearn.preprocessing.LabelEncoder
• OneHotEncoder(): For 3 categorical classes, weather is rainy, cloudy, sunny-> [1,0,0],
[0,1,0],[0,0,1] : sklearn.preprocessing.OneHotEncoder
1. Data Preparation: Load and preprocess
the dataset.
Step iv: Do Scaling or normalization. Feature normalization is a technique used to
transform the values of a dataset into a common scale.
1.Min-Max normalization: from sklearn.preprocessing import MinMaxScaler
2.Z-score normalization: from sklearn.preprocessing import StandardScaler
3.Constant factor normalization: x' = x / k
Step v: Data transform or dimensionality reduction: Feature Selection, Feature
Extraction
Step vi: Split Data into Training set, Test set and Validation Set.
2. Define the Network Architecture
• In this step, we define a deep feedforward neural network with
multiple layers.
• When we vary the
different parameters of a
NN/deep NN we say we
have designed different
architectures.
The parameters can be:
Learning rate
Optimizer
Loss functions
No of epochs
No of hidden layers
Activation functions etc.
3. Choose Loss Function and Optimizer
4. Training loop
• To update the weight parameters and minimize the error, we set
number of training loops, i.e. no. of epochs.
5. Evaluation
• Evaluate the trained model on the test dataset to measure its
performance.
➢Performance evaluation is done using
➢ Error
➢ Confusion matrix
➢ Accuracy
➢ Precision
➢ Recall
➢ F/F1 Score
➢ ROC curve
Activation functions: Tanh, Logistic,
Linear, Softmax, ReLU, Leaky ReLU
Activation Functions
Internet
Courtesy:
Activation Functions
• Activation functions are an extremely important
feature of the artificial neural networks.
• They basically decide whether a neuron should be
activated or not.
• Whether the information that the neuron is receiving
is relevant for the given problem or should it be
ignored.
• Can we do without an activation function?
• When we do not have the activation function the weights
and bias would simply do a linear transformation.
• A linear equation is simple to solve but is limited in its
capacity to solve complex problems.
• A neural network without an activation function is
essentially just a linear regression model.
• The activation function does the non-linear transformation
to the input making it capable to learn and perform more
complex tasks.
Popular activation
functions
Image courtesy: Internet
It is also known as logistic function.
Image courtesy: Internet
Image courtesy: Internet
Image courtesy: Internet
Rectified Linear Unit (ReLU)
f(x)
• ReLU stands for Rectified Linear Unit.
• Although it gives an impression of a linear function,
ReLU has a derivative function and allows for
backpropagation while simultaneously making it
computationally efficient.
• The main catch here is that the ReLU function does not
activate all the neurons at the same time.
• The neurons will only be deactivated if the output of x
the linear transformation is less than 0.
Computes f(x) = max(0, x)
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
Relu-Advantages and disadvantages
• Advantages
• Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
• ReLU accelerates the convergence of gradient descent towards the global minimum of
the loss function due to its linear, non-saturating property.
Disadvantages
• The negative side of the graph makes the
gradient value zero. Due to this reason,
during the backpropagation process, the
weights and biases for some neurons are
not updated. This can create dead
neurons which never get activated.
• All the negative input values become zero
immediately, which decreases the model’s
ability to fit or train from the data
properly.
The amount of leak is
determined by the value of
hyper-parameter α. It’s
α value is small and
generally varies between
0.01 to 0.1-0.2.
Derivative of Leaky Relu
Derivative of Leaky ReLU: For negative input values, the derivative is a small constant α, and for positive input values, the
derivative is 1.
Q. Apply Sigmoid, Tanh and ReLU, Leaky ReLU
on the following feature map.
Softmax function
• In the last layer of the Fully connected
layers, the output layers hold the
activation nodes/units equal to the
number of classes in the classification
problem.
• Softmax function is applied on the last
layer nodes.
Softmax function
Softmax function
Softmax
If we take an input of [1,2,3,4,1,2,3], the softmax of that is
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
The softmax function highlights the largest values and suppress
other values.
Softmax If we take
an input of
[1,2,3,4,1,
2,3], the
softmax of
that is
If we take an input of [1,2,3,4,1,2,3], the softmax of that is
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
The softmax function highlights the largest values and suppress
other values.
Remarks:
• It is a common and recommended practice to use one hot encoding, especially in classification tasks
where the neural network's output layer uses the softmax activation function
• The softmax function outputs a probability distribution over multiple classes. One-hot encoding aligns
with this by representing the true class as a probability distribution where one class has a probability of 1,
and all others have a probability of 0. This facilitates the correct calculation of the gradient during
backpropagation.
• One-hot encoding transforms categorical labels into a binary vector format, where each class is
represented by a unique vector with a 1 in the position corresponding to the class and 0s elsewhere. This
format is compatible with the categorical cross-entropy loss function, which is often used with softmax.
• In some cases, particularly in frameworks like TensorFlow and PyTorch, you might not need to manually
convert labels to one-hot encoding because the loss functions (e.g., SparseCategoricalCrossentropy in
TensorFlow) can handle integer labels directly. These functions internally convert the integer labels to
one-hot representations during loss calculation, streamlining the process.
Loss functions
• A loss function is a function that
compares the target and the
predicted output values.
• While training, we aim to minimize
the loss between the predicted and
target outputs.
•T. he loss function is also known as error function.
A loss function applies to a single training example, whereas a cost function (or
sometimes called as objective function) is an average of the loss function of an
entire training set containing several training examples
Applicability
Applicability
Loss Function to
Loss functions
to Regression
Classification
Mean Square
Error (MSE) /
• Types of loss function: Loss functions in machine L2 Loss
learning can be categorized based on the machine Mean
learning tasks to which they are applicable. Absolute Error
(MAE) / L1
Loss
1. Classification loss function (for discrete numeric
values): e.g. Binary cross entropy loss, categorical Binary Cross-
Entropy Loss /
cross entropy loss, hinge loss, log loss Log Loss
2. Regression loss function (for continuous numeric Categorical
Cross-Entropy
values): e.g. Mean squared error/L2 loss, Mean Loss
absolute error/L1 loss, Huber loss/smooth mean Hinge Loss
absolute error Huber Loss /
Smooth Mean
Absolute Error
Log Loss
Loss functions: Squared Error loss
• Squared Error loss: (y− ȳ)2; here, y=actual value, ȳ=predicted value
• Mean Square Error (MSE) / L2 Loss: MSE = (1/n) * Σ(yᵢ - ȳ)²
• Where:
• n is the number of samples in the dataset
• yᵢ is the predicted value for the i-th sample
• ȳ is the target value for the i-th sample
• It takes average of the squared difference
• MSE is a standard loss function utilized in most regression tasks
Loss functions: Squared Error loss
Loss functions: Cross Entropy Loss
• Cross-entropy loss, is a commonly used loss function in machine learning, particularly for
classification problems. It measures the performance of a classification model whose output is a
probability value between 0 and 1.
• Binary Cross Entropy: For a binary classification problem, where the target variable y can be
either 0 or 1, the cross-entropy loss is defined as:
Loss functions: Cross Entropy Loss
Log base 2
Loss functions: Cross Entropy Loss
• Categorical cross-entropy is used for multi-class classification tasks,
where each instance can belong to one of k different classes.
• The loss function is defined as:
• Q. Apply categorical cross entry loss function:
Loss functions: Cross Entropy Loss
Log base 3
Cross entropy loss
Binary cross entropy Categorical cross entropy
• Used for binary classification. • Used for multi-class classification.
• Sigmoid activation function in • Softmax activation function in
output layer. output layer.
Note: model.compile(optimizer='rmsprop’, loss=None, loss_weights=None, metrics=None, weighted_metrics=None,
run_eagerly=False, steps_per_execution=1, jit_compile='auto’, auto_scale_loss=True)
Argument loss: Loss function. May be a string name of loss function or a keras.losses.Loss instance
loss=keras.losses.BinaryCrossentropy(),CategoricalCrossentropy, CosineSimilarity, MeanAbsoluteError, MeanSquaredError,
MeanSquaredLogarithmicError
• 2.2 Optimization: Learning with backpropagation, Learning
Parameters: Gradient Descent (GD), Stochastic and Mini Batch GD,
Momentum Based GD, Nesterov Accelerated GD, AdaGrad, Adam,
RMSProp
Learning with backpropagation
• RMSProp
Batch gradient descent
• “Batch”: Each step of gradient descent uses all the training examples
Repeat until convergence{
𝑚
𝑚: Number of training examples
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃1 ≔ 𝜃1 − 𝛼 ℎ𝜃 𝑥 −𝑦 𝑥
𝑚
𝑖=1
}
But GD doesn't know how far to
travel down the gradient for
nonconvex functions
1. Initialize the weights randomly
2. Loop until convergence:
3. Pick data points from the training set
𝜕 1 𝜕
4. Compute gradient, 𝐽𝑖 𝜃 = ∑𝑚k=1 𝐽𝑘 𝜃
𝜕𝜃 2𝑚 𝜕𝜃
𝜕
5. Update weights 𝜃 = 𝜃 − 𝛼 𝜕𝜃 𝐽𝑖 𝜃
6. Return weights
Mini Batch (Stochastic) GD
• SGD with mini batch is mini-batch SGD!
• In SGD, we calculate gradient just using a single training example/sample.
• It may not represent the true gradient over the dataset, since it is computed by only
using a single data point, so it can be noisy.
• By noisy we mean that it can make us jump around the landscape of the cost
function J.
• So, a mini-batch of B data points is chosen to compute the average gradient
across the B data points and use that as estimate of our true gradient.
• B can be samples between 10 to 100 data points depending on the data sets
considered.
• Mini batch (S)GD characteristics:
• While training it gives more accurate estimation of gradient
• Smoother convergence
• Allows for larger learning rates
• Mini-batches achieve significant speed increase on GPUs, hence lead to fast training
Stochastic GD vs Mini-batch SGD
SGD Mini-batch SGD
• 1. Initialize the weights randomly • 1. Initialize the weights randomly
• 2. Loop until convergence: • 2. Loop until convergence:
• 3. Pick single data point ‘i’ • 3. Pick mini-batch of B data points
• 4. Compute gradient, 𝜕𝜃𝜕 𝐽𝑖 𝜃 • 4. Compute gradient, 𝜕𝜃𝜕 𝐽𝑖 𝜃 = 𝐵1 ∑𝐵k=1 𝜕𝜃𝜕 𝐽𝑘 𝜃
• 5. Update weights 𝜃 =𝜃 −𝛼
𝜕
𝜕𝜃
𝐽𝑖 𝜃 • 5. Update weights 𝜃 =𝜃 −𝛼
𝜕
𝜕𝜃
𝐽𝑖 𝜃
• 6. Return weights • 6. Return weights
Epoch vs Iteration
• Epoch Iteration
• An epoch refers to one complete •An iteration refers to one update of the model’s
pass through the entire training weights based on a single batch of data.
dataset. •If the dataset is divided into batches, then one
• During one epoch, every sample iteration corresponds to processing one batch
in the training dataset has been
used once to update the model’s and updating the model’s weights based on that
weights. batch.
• Training a model typically •The number of iterations per epoch is equal to
involves multiple epochs to the number of batches in the dataset. For
ensure that the model has
sufficient opportunity to learn instance, if you have a dataset with 1,000
from the data and improve its samples and a batch size of 100, then there are
performance.
10 iterations in one epoch.
Momentum Based GD
• Momentum-Based Gradient Descent is an enhancement of the
standard gradient descent algorithm.
• It aims to accelerate convergence and smooth out the updates by
incorporating a momentum term to the update rule.
• To add momentum, we introduce a velocity term
• Momentum based GD weight update rule is given below:
where,
Momentum Based GD
• GD • Momentum GD
• 1. • 1. , where
• 2. Loop until convergence: • 2. Loop until convergence:
• 3. Pick single data point ‘i’ • 3. Pick mini-batch of B data points
• 4. Compute gradient, 𝜕𝜃𝜕 𝐽𝑖 𝜃 • 4. Compute gradient, 𝜕𝜃𝜕 𝐽𝑖 𝜃 = 𝐵1 ∑𝐵k=1 𝜕𝜃𝜕 𝐽𝑘 𝜃
• 5. Update weights 𝜃 =𝜃 −𝛼
𝜕
𝜕𝜃
𝐽𝑖 𝜃 • 5. Update weights 𝜃 =𝜃 −𝛼
𝜕
𝜕𝜃
𝐽𝑖 𝜃
• 6. Return weights • 6. Return weights
Momentum Based GD
• GD • Momentum GD
• 1. Update Rule: Updates parameters • 1. Updates parameters using a
directly based on the current combination of the current gradient and a
gradient. fraction of the previous update
• 2. Speed of Convergence: May (momentum term).
converge slowly • 2. Typically converges faster by
• 3. Oscillation: Can oscillate across accelerating updates
the slopes of the cost function. • 3. Reduces oscillations and provides
smoother convergence paths
Advantages of Momentum-Based Gradient Descent
1.Faster Convergence:
Helps in accelerating the convergence, especially in scenarios where gradients
are small and updates are slow.
2.Reduced Oscillations:
Smoother updates help reduce the oscillations, especially in narrow and curved
valleys of the cost function.
3.Better Handling of Local Minima:
Momentum can help the optimization process escape local minima by carrying
the updates through shallow regions.
Nesterov Accelerated GD (NAG)
• NAG modifies the Momentum-based Gradient Descent by calculating
the gradient not at the current parameters but with a look-ahead based
on the velocity. 1. Look-Ahead: Instead of calculating the gradient at the
current parameters, NAG first performs a look-ahead step
to estimate where the parameters will be if the current
velocity were applied.
2. Gradient Calculation: The gradient is then computed
at this look-ahead point, providing a more accurate
estimate of the direction in which the parameters should
be updated.
3. Velocity Update: The velocity term is updated using
this more accurate gradient, making the updates more
informed and potentially more efficient.
4. Parameter Update: Finally, the parameters are updated
using the updated velocity.
Advantages of NAG
1. Faster Convergence: By considering the future position of the
parameters, NAG often converges faster than momentum-based gradient
descent.
2. Reduction of Oscillations: NAG reduces oscillations that may occur
during optimization, particularly in scenarios with high curvature or
noisy gradients.
3. More Informed Updates: The look-ahead mechanism provides more
informed updates, which can lead to better convergence properties.
AdaGrad
• AdaGrad is an optimization algorithm designed to adapt the learning rate
for each parameter individually based on the historical gradients.
• This adaptive nature allows AdaGrad to perform well in scenarios with
sparse data and features, where different parameters may have different
degrees of importance and frequency.
• Key Concepts
1.Adaptive Learning Rate: Unlike traditional gradient descent, which uses a
single learning rate for all parameters, AdaGrad adjusts the learning rate for
each parameter dynamically.
2.Accumulation of Squared Gradients: AdaGrad keeps track of the sum of
the squares of the gradients for each parameter. This accumulated value is
then used to adjust the learning rate.
1.Gradient Accumulation: The term
AdaGrad accumulates the sum of the squared
gradients for each parameter over
all iterations. This term grows over
time, especially for parameters that
consistently have large gradients.
2.Learning Rate Adjustment: The
learning rate is divided by the square
root of . This means that the
learning rate decreases over time for
parameters that have larger gradients,
thereby preventing overly large updates
and ensuring more stable convergence.
3.Preventing Division by Zero: The
small constant ϵ ensures that we never
divide by zero, even if a parameter has
never received a gradient update.
AdaGrad
• Advantages
1.Adaptivity: Automatically adjusts learning rates for each parameter, making it
effective for problems with sparse features.
2.Stability: Reduces the learning rate over time for frequently updated parameters,
which can help stabilize convergence.
• Disadvantages
1.Aggressive Decay: For some problems, the learning rate might decay too
aggressively, causing the learning process to stop too early before reaching the
optimal solution.
2.Improvement Needed for Dense Data: In cases where all features are dense, the
aggressive decay of the learning rate might hinder performance. Extensions like
RMSProp and Adam address this issue by modifying the way gradients are
accumulated.
RMSProp (Root Mean Square Propagation)
• RMSProp is an adaptive learning rate optimization algorithm designed to address
some of the limitations of AdaGrad, particularly the issue of rapidly decaying
learning rates.
• RMSProp aims to maintain a balance by controlling the learning rate decay, which
allows for more stable and faster convergence, especially in deep learning
applications.
• Key Concepts
1.Exponential Moving Average of Squared Gradients: Instead of accumulating all
past squared gradients like AdaGrad, RMSProp uses an exponential moving
average. This helps to give more weight to recent gradients, allowing the
algorithm to adapt more quickly to changes in the gradient direction.
2.Adaptive Learning Rate: Similar to AdaGrad, RMSProp adjusts the learning rate
for each parameter individually, but it avoids the issue of aggressively decaying
learning rates.
RMSProp (Root Mean Square Propagation)
1. Exponential Moving Average: The
term represents the exponentially
weighted moving average of the
squared gradients. This moving average
helps to smooth out the gradient
updates, reducing the influence of large
but infrequent gradients.
2. Learning Rate Adjustment: The
learning rate α is divided by the square
root of E[g2]t+ϵ. This ensures that the
learning rate for parameters with
frequently large gradients decreases,
while parameters with smaller gradients
retain a relatively larger learning rate.
3. Preventing Division by Zero: The
small constant ϵ ensures numerical
stability, preventing division by zero.
RMSProp (Root Mean Square Propagation)
RMSProp (Root Mean Square Propagation)
• Advantages
1.Stable Convergence: By controlling the decay of the learning rate,
RMSProp helps to maintain a more consistent convergence rate
compared to AdaGrad.
2.Adaptability: RMSProp adapts to the scale of the gradients, making it
suitable for non-stationary problems (problems where the data
distribution changes over time).
3.Performance in Deep Learning: RMSProp is widely used in training
deep neural networks due to its effectiveness in handling the complex
and noisy gradients often encountered in such applications.
Adam(Adaptive Moment Estimation)
• Adam is an optimization algorithm that combines the best properties of
the AdaGrad and RMSProp algorithms to provide an efficient and adaptive
learning rate.
• It is particularly well-suited for problems involving large datasets and high-
dimensional parameter spaces.
• Key Concepts
1.Adaptive Learning Rate: Like RMSProp, Adam adjusts the learning rate for
each parameter based on the first and second moments of the gradients.
2.Momentum: Adam incorporates the concept of momentum, similar to the
Momentum optimization algorithm, which helps to smooth out the
gradient updates.
3.Bias Correction: Adam includes bias correction terms to counteract the
biases introduced in the first few steps of the moment estimates.
Adam(Adaptive Moment Estimation)
Adam(Adaptive Moment Estimation)
• Advantages
1.Efficient: Adam combines the benefits of both AdaGrad and
RMSProp, making it computationally efficient.
2.Adaptive Learning Rates: Adjusts the learning rate for each
parameter individually, which helps in dealing with sparse gradients.
3.Convergence: Adam often converges faster and more reliably
compared to other optimization algorithms.
Numericals done in class regarding GD and its
variants are important.
2.3 Regularization: Overview of Overfitting,
Types of biases, Bias Variance Tradeoff
Regularization Methods: L1, L2 regularization,
Parameter sharing, Dropout, Weight Decay, Batch
normalization, Early stopping, Data
Augmentation, Adding noise to input and output
Overview of Overfitting
• Up to now, we have concentrated on minimizing the cost function through various
optimization algorithms (GD, SGD, Adam etc.).
• Deep learning models usually contain billions (109) of parameters, while the training
datasets might only include millions (106) of samples.
• This characteristic classifies them as
over-parameterized models.
• These models are susceptible to a
phenomenon known as over-fitting.
• The bias and variance of a model in
relation to its capacity give some
insights into this overfitting problem.
Underfitting and Overfitting
Overfitting and Underfitting are the two main
problems that occur in machine learning and
degrade the performance of the machine learning
models.
➢ Underfitting:
➢ Model is too simple to represent all the
relevant class characteristics
➢ High training error, high test error
➢ Low variance, high bias
➢ Overfitting: Fig: Underfitting vs Robust (Appropriate) fitting vs Overfitting
➢ Model is too complex and fits irrelevant
characteristics (noise) in the data
➢ Low training error, high test error Image courtesy: gfg
➢ High variance, Low bias
Underfitting
• Underfitting occurs when the
ML model is not able to capture
the underlying trend of the data.
• In the case of underfitting, the
model is not able to learn
enough from the training data,
and hence it reduces the
accuracy and produces Image courtesy: gfg
unreliable predictions.
• An underfitted model has high
bias and low variance.
Overfitting
• Overfitting occurs when the ML model
tries to cover all the data points or more
than the required data points present in the
given dataset.
• Because of this, the model starts caching
noise and inaccurate values present in the
dataset, and all these factors reduce the
efficiency and accuracy of the model.
• The overfitted model has low
bias and high variance.
Image courtesy: gfg
• The chances of occurrence of overfitting
increase as much we provide training to
our model. It means the more we train our
model, the more chances of occurring the
overfitted model.
Overfitting in Linear Regression
• 𝑥1 = size of house
• 𝑥2 =no. of bedrooms
• 𝑥3 =no. of floors Price ($)
in 1000’s
• 𝑥4 = age of house
• 𝑥5 = average income in neighborhood
• 𝑥6 =kitchen size
• ⋮
• 𝑥100
Size in feet^2
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
If we have too many features (i.e. complex model), the learned hypothesis may fit the 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
training set very well 𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
𝑚
1 𝑖 𝑖 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 ≈0
2𝑚
𝑖=1
but fail to generalize to new examples
(predict prices on new examples).
Overfitting in Logistic Regression
Age Age Age
Tumor Size Tumor Size Tumor Size
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 ) ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 + ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 ) 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Underfitting Overfitting
Bias-Variance trade-off
• What is a trade-off?
• giving up of one thing in return for another
• What is Bias or bias error?
• Bias is a value that represents assumptions taken while
designing a ML/DL model.
• Low Bias: A low bias model will make fewer assumptions about Age
the form of the target function.
• High Bias: A model with a high bias makes more assumptions, and
the model becomes unable to capture the important features of our
dataset.
• For some ML/DL model, we must make some assumptions
(limiting hypothesis space, limiting range of values for some
parameter, imposing ordering, imposing shifts etc.) to proceed Tumor Size
and get a better conclusion. But the amount of bias should
never be very high. ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 )
• High bias value will lead to underfitting of the ML/DL model.
• Some examples of ML algorithms with low bias are Decision
Trees, KNN, SVM. Algorithm with high bias is Linear
Regression, Logistic Regression, Linear Discriminant
Analysis. Underfitting
Bias-Variance trade-off
• What is Variance or variance error?
• Variance is a value that represents the variability in the
model prediction like how much the ML/DL target function
can adjust depending on the changes in the given data set.
• Low variance means there is a small variation in the prediction
of the target function with changes in the training data set.
• High variance shows a large variation in the prediction of the Age
target function with changes in the training dataset.
• If we suppose increase or decrease in the percentage of
training data, or change some number of features in the
training data, the amount of variability in the model
prediction is variance. The variance should always be low. Tumor Size
• High variance value leads to overfitting the ML/DL model.
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
• Some examples of ML algorithms with low variance are, 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
Linear Regression, Logistic Regression, and Linear 𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
discriminant analysis. Algorithms with high variance
are Decision tree, SVM and KNN.
Overfitting
Bias-Variance trade-off
• What is Bias-Variance trade-off?
• Bias-Variance trade-off is the conflict in trying to
simultaneously minimize these two sources of errors to
get better classification accuracy in a ML/DL model.
• While building the machine learning model, it is really
important to take care of bias and variance in order to
avoid overfitting and underfitting in the model.
• If the model is very simple with fewer parameters, it may
have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high
variance and low bias.
• So, it is required to make a balance between bias and
variance errors, and this balance between the bias error
and variance error is known as the Bias-Variance trade-
off.
Bias-Variance trade-off
• For an accurate prediction of the
model, algorithms need a low
variance and low bias. But this
is not possible because bias and
variance are related to each
other:
• If we decrease the variance, it
will increase the bias.
• If we decrease the bias, it will
increase the variance.
Solution for overfitting
• How do we deal with this?
1) Reduce number of features
• Manually select which features to keep
• But, in reducing the number of features we lose some
information
• Ideally select those features which minimize data Age
loss, but even so, some info is lost
2) Regularization
• Keep all features, but reduce magnitude
of parameters θ
• Works well when we have a lot of features, each of Tumor Size
which contributes a bit to predicting y
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
3) Dropout 𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
4) Early Stopping 𝜃6 𝑥13 𝑥2 + 𝜃7 𝑥1 𝑥23 + ⋯ )
Overfitting
Regularization
• It is a technique to prevent the model from overfitting by adding extra
information to it.
• This technique can be used in such a way that it will allow to maintain all
variables or features in the model by reducing the magnitude of the
variables.
• Hence, it maintains accuracy as well as a generalization of the model.
• It mainly regularizes or reduces the coefficient of features toward zero.
• Regularization works by adding a penalty or complexity term to the complex
model.
Regularization
• There are mainly 3 variations of regularization techniques, which are given
below:
• Ridge Regularization (L2):
• It is also called as L2 regularization.
• In this technique, the cost function is altered by adding the penalty term to it. The amount of
bias added to the model is called Ridge Regularization penalty. Ridge Regularization
(Regression) Penalty is a term with square of weights.
• Lasso Regularization (L1):
• It is also called as L1 regularization.
• It is similar to the Ridge Regularization except that the penalty term contains only the
absolute weights instead of a square of weights.
• Some of the features in this technique are completely neglected for model evaluation.
• Elastic Net Regularization:
• It linearly combines the L1 and L2 penalties of the Lasso and Ridge methods.
Intuition
Price ($) Price ($)
in 1000’s in 1000’s
Size in feet^2 Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥 3 + 𝜃4 𝑥 4
• Suppose we penalize and make 𝜃3 , 𝜃4 really small.
𝑚
1 2
min 𝐽 𝜃 = ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 1000 𝜃32 + 1000 𝜃42
𝜃 2𝑚
𝑖=1
Regularization.
• Small values for parameters 𝜃1 , 𝜃2 , ⋯ , 𝜃𝑛
• “Simpler” hypothesis
• Less prone to overfitting
• Housing:
• Features: 𝑥1 , 𝑥2 , ⋯ , 𝑥100
• Parameters: 𝜃0 , 𝜃1 , 𝜃2 , ⋯ , 𝜃100
𝑚 𝑛
1 2
𝐽 𝜃 = ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
Ridge Regularization (L2) Regularization term
𝑚 𝑛
1 2
𝐽 𝜃 = ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
+ 𝜆 𝜃𝑗2
2𝑚
𝑖=1 𝑗=1
Price ($)
in 1000’s 𝜆: Regularization parameter
• Cost function:
min 𝐽(𝜃)
𝜃
Size in feet^2
Ridge Regularized logistic regression
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥12 + 𝜃4 𝑥22 + 𝜃5 𝑥1 𝑥2 +
3 3
Age 𝜃6 𝑥1 𝑥2 + 𝜃7 𝑥1 𝑥2 + ⋯ )
Tumor Size
• Cost function:
𝑚 𝑛
1 𝜆
𝐽 𝜃 = 𝑦 𝑖 log ℎ𝜃 𝑥 𝑖 𝑖
+ (1 − 𝑦 ) log 1 − ℎ𝜃 𝑥 𝑖 + 𝜃𝑗2
𝑚 2
𝑖=1 𝑗=1
Lasso regularization (L1)
𝑚 𝑛
1 𝑖 𝑖 2
𝐽 𝜃 = ℎ𝜃 𝑥 −𝑦 + 𝜆 |𝜃𝑗 |
2𝑚
𝑖=1 𝑗=1
LASSO: stands for Least Absolute Shrinkage and Selection
Operator
• It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
Elastic Net Regularization
• It linearly combines the L1 and L2 penalties of
the lasso and ridge methods.
• By doing so, it combines the advantages of both L2 and L1.
𝑚 𝑛 𝑛
1 2
𝐽 𝜃 = ℎ𝜃 𝑥 𝑖 −𝑦 𝑖 + 𝜆2 𝜃𝑗2 + 𝜆1 |𝜃𝑗 |
2𝑚
𝑖=1 𝑗=1 𝑗=1
𝑁𝑜𝑡𝑒: 𝜆1=1- 𝜆2
Terminology
Regularization Name
function
𝑛
Tikhonov regularization
𝜃 2
2 = 𝜃𝑗2 Ridge regression
𝑗=1
𝑛
LASSO regression
𝜃 1
= |𝜃𝑗 |
𝑗=1
2 Elastic net regularization
𝛼 𝜃 1
+ (1 − 𝛼) 𝜃 2
Key Difference between Ridge (L2) and
Lasso (L1) Regularization
• Ridge regression is mostly used to reduce the overfitting in the model, and it
includes all the features present in the model.
• It reduces the complexity of the model by shrinking the coefficients.
• Shrinks them toward 0 (but never exactly 0)
• Lasso regression helps to reduce the overfitting in the model as well as
feature selection.
• It reduces the complexity of the model by shrinking the coefficients.
• Can shrink some coefficients to 0 (feature selection)
Dropout: Another way to reduce overfitting
• In the pursuit of trying too hard to learn different features from the dataset, deep learning networks sometimes learn the
statistical noise in the dataset. This definitely improves the model performance on the training dataset but fails massively
on new unseen data points (test dataset). This is the problem of overfitting.
• Also, sometimes the learning/training can be biased towards few features in a deep neural network.
• Dropout serves the same purpose of reducing the overfitting in a deep Learning network as well as biasness during
training.
• The Dropout layer is a mask that nullifies the contribution of some neurons towards the next layer and leaves unmodified
all others.
➢ As the term suggests, dropout
means dropping of some nodes
(neurons) in a neural network.
➢ By dropping means, the
connections of the nodes are
removed from that particular
dropped node.
➢ The choice of which units to drop is
random.
Figure: From the base paper by Srivastava et. al
Dropout
• Applying dropout to a neural network
amounts to sampling a “thinned”
network from it. The thinned network
consists of all the units that survived
dropout.
• Dropout has been proven to improve
the performance of neural networks on Thinned network
supervised learning tasks in vision,
speech recognition, document Figure: From the base paper by Srivastava et.
classification and computational al: Dropout: A Simple Way to Prevent Neural
Networks from Overfitting
biology.
How Dropout Works
During Training
•Random Deactivation: In each training iteration, dropout randomly sets a fraction of the
neurons in a layer to zero. This fraction is determined by the dropout rate (e.g., 0.2 means
20% of neurons are dropped).
•Independent Activation: The dropout process is independent for each training epoch,
meaning different neurons are dropped in each different epoch.
...................................
Fig.: Epoch1 Fig.: Epoch2
How Dropout Works
•During Inference (Testing) Fig.: During testing
•No Dropout: During testing, dropout is not applied. All neurons are active,
and their weights are scaled to account for the dropout applied during
training.
•Scaling Weights: To balance the fact that more neurons are active during
testing compared to training, the weights of the neurons are scaled down by
the dropout rate. For example, if the dropout rate is 0.25, then the weights are
scaled by 0.75 (or multiplied by 0.75 ->0.75*w) during testing.
• Scaling During Testing
• During inference (testing), the weights should be scaled by the keep
probability (0.75 in this case) to account for the fact that all neurons are
active. This process is sometimes referred to as "inverted dropout."
Why Scaling is Necessary
• During training, when dropout is applied, the remaining active neurons receive more "weight" to
compensate for the missing ones. For example, if only 75% of the neurons are active, their
outputs are effectively scaled up to maintain the same overall expected output level.
• To make sure the network behaves consistently during testing, we scale down the weights by the
same keep probability during training. This ensures that the expected output remains the same
even when all neurons are active.
• Mathematical Explanation
• If W represents the original weight, then during training:
• Keep Probability: 1−dropout rate=1−0.25=0.75
• The effective weight for each active neuron is W/keep probability=W/0.75
• During testing, to match the output distribution learned during training, we scale the weights by multiplying
them by the keep probability:
• Wtest=W×0.75
• This scaling ensures that the output during testing reflects the same distribution as during
training.
Early stopping: One more way to reduce
overfitting
•Early stopping is a regularization technique used in machine learning and
deep learning to prevent overfitting during the training of a model.
•Monitors the model’s performance on a validation set and stops training
when performance starts to degrade.
•Prevents the model from overfitting by halting training before it starts to
memorize the training data.
How Early Stopping Works
• Training Process
• During the training of a machine learning model, the training data is used to
update the model’s weights over several iterations (or epochs).
Simultaneously, a validation set, which is separate from the training data, is
used to evaluate the model's performance after each epoch.
• Monitoring Performance
• Validation Loss/Accuracy: The key idea behind early stopping is to monitor a
metric like validation loss or validation accuracy.
• Improvement Over Time: In the early stages of training, both the training loss
and the validation loss typically decrease, indicating that the model is
learning. However, after a certain point, the validation loss may stop
decreasing and start increasing, while the training loss continues to decrease.
How Early Stopping Works
• Detecting Overfitting
• Overfitting Sign: An increase in validation loss or a plateau in
validation accuracy is a sign of overfitting. This means the
model is becoming too specialized to the training data and is
losing its ability to generalize to new, unseen data.
• Optimal Stopping Point: Early stopping involves halting the
training process when the validation performance stops
improving. The model at this point is often the one that
generalizes best to new data.
• Stopping Criteria
• Patience: Early stopping usually involves a "patience"
parameter, which specifies how many epochs to wait for an
improvement in the validation metric before stopping. For
example, if patience is set to 10, the training will stop if
there’s no improvement in validation loss or accuracy for 10
consecutive epochs.
• Best Weights: When early stopping is triggered, the model's
weights are typically reverted to the point where the
validation performance was the best, rather than using the
weights from the final epoch.
Advantages of Early Stopping
1.Prevents Overfitting
1. By stopping the training before the model starts overfitting, early stopping
helps in maintaining the model's ability to generalize well to unseen data.
2.Efficient Training
1. Early stopping can significantly reduce training time since it avoids
unnecessary epochs where the model's performance is not improving. This
makes it computationally efficient.
3.Simple to Implement
1. Early stopping is straightforward to implement in most machine learning
frameworks, requiring only a few lines of code to monitor validation
performance and stop training when necessary.
Challenges in Early stopping
• Choosing the Patience Parameter
• Balancing Act: Setting the patience parameter too low might stop training
prematurely, leading to an underfitted model. Setting it too high might allow
overfitting. The choice of patience is often based on experimentation and
cross-validation.
• Thresholding: Some implementations of early stopping also use a threshold to
determine if the change in validation performance is significant enough to
continue training.
Data Augmentation: One more way to reduce
overfitting
When to use data augmentation?
•To prevent models from
overfitting. Adding more data
can also prevent models from
overfitting.
•If the initial training set is too
small.
•To improve the model accuracy.
•To reduce the operational cost
of labeling and cleaning the raw
dataset.
Augmented vs. Synthetic data
• Augmented data is driven from original data with some minor
changes. In the case of image augmentation, we make geometric and
color space transformations (flipping, resizing, cropping, brightness,
contrast, mirroring etc.) to increase the size and diversity of the
training set.
• Synthetic data is generated artificially without using the original
dataset. It often uses GANs (Generative Adversarial Networks) to
generate synthetic data.
Image Augmentation
• Geometric transformations: randomly flip, crop, rotate, stretch, and
zoom images. You need to be careful about applying multiple
transformations on the same images, as this can reduce model
performance.
• Color space transformations: randomly change RGB color channels,
contrast, and brightness.
• Kernel filters: randomly change the sharpness or blurring of the
image.
• Random erasing: delete some part of the initial image.
• Mixing images: blending and mixing multiple images.
Audio and text Data Augmentation
• Noise injection: add Gaussian or random noise to the audio dataset
to improve the model performance.
• Shifting: shift audio left (fast forward) or right with random seconds.
• Changing the speed: stretches times series by a fixed rate.
• Changing the pitch: randomly change the pitch of the audio.
Text Data Augmentation
• Word or sentence shuffling: randomly changing the position of a
word or sentence.
• Word replacement: replace words with synonyms.
• Syntax-tree manipulation: paraphrase the sentence using the same
word.
• Random word insertion: inserts words at random.
• Random word deletion: deletes words at random.
Augmentation
CT scan images generated by a
cycleGAN, which is a variation of
GAN.
This is how GAN-generated CT scan
images are being used in the
medical field to increase the
dataset.
Once the dataset is created, it
can be used for classification or
any other task
Neural Style Transfer-based
augmentation
• a series of convolutional layers are trained such that the images are
deconstructed where content and style can be separated.
• After separation, the content from an image is composed with the
style of another image to create an augmented style image.
• Thus, the content remains the same but the style is changed.
• This increases the robustness of the model as the model is
working independently of the style of the image.
Weight Decay
•Weight decay is a regularization technique used to prevent overfitting in machine
learning models, particularly in deep learning.
•It adds a penalty proportional to the size of the weights to the loss function.
•Helps in preventing weights from growing too large and thus controlling model
complexity.
•Modified Loss function after adding weight decay as penalty:
How Weight Decay Works
• Weight decay involves adding a regularization term to the loss
function that penalizes large weights.
• The most common form of weight decay is L2 regularization, which adds the
sum of the squared weights to the loss function.
• Relationship with L2 Regularization:
• Weight decay is mathematically equivalent to L2 regularization.
• In L2 regularization, the loss function is augmented with a term proportional to
the sum of the squared weights, which has the same effect as weight decay.
• The term "weight decay" is more commonly used in the context of neural
networks and deep learning, whereas "L2 regularization" is more
frequently used in the context of linear models like Ridge regression.
Despite this difference in terminology, the underlying concept is the same.
Batch Normalization
Why normalization is needed?
• In a deep neural network, there is a phenomenon called internal covariate shift,
which is a change in the input distribution in the network's layers due to the ever-
changing network parameters during training.
• The input layer may have certain features which dominate the process, due to
having high numerical values. This can create a bias in the network because only
those features contribute to the outcome of the training.
• Imagine feature_1 having values between 1 and 5, and feature_2 having values
between 100 and 10000.
• During training, due to the difference in scale of both features, feature_2 would
dominate the network and only that feature would have a contribution to the
outcome of the model.
Normalization
• Normalization is an approach which is applied during the preparation of data
• To change the values of numeric columns in a dataset to use a common scale
when the features in the data have different ranges.
• Benefits
• Reducing the internal covariate shift to improve training
• We define Internal Covariate Shift as the change in the distribution of network activations due to
the change in network parameters during training.
• In neural networks, the output of the first layer feeds into the second layer, the output of the
second layer feeds into the third, and so on. When the parameters of a layer change, so does the
distribution of inputs to subsequent layers.
• These shifts in input distributions can be problematic for neural networks, especially deep
neural networks that could have a large number of layers.
• Scaling each feature to a similar range to prevent or reduce bias in the network
• Speeding up the optimization process by preventing weights from exploding all over the
place and limiting them to a specific range
• Reducing overfitting in the network by aiding in regularization
Several types of normalization
• Batch normalization (In Syllabus)
• Layer normalization
• Instance normalization
• Group normalization
We will focus on Batch Normalization.
Batch normalization
• Method that normalizes activations in a network across the mini-batch of definite size.
• Batch Normalization is just another network layer that gets inserted between a hidden layer and the next
hidden layer.
• Its job is to take the outputs from the first hidden layer and normalize them before passing them on as
the input of the next hidden layer.
• For each feature, batch normalization computes the mean and variance of that feature in the mini-batch.
• It then subtracts the mean and divides the feature by its mini-batch standard deviation.
Batch Norm layer also has parameters of its own
• Two learnable parameters called beta and gamma.
• Two non-learnable parameters (Mean Moving Average and Variance
Moving Average) are saved as part of the ‘state’ of the Batch Norm
layer.
• If there are multiple hidden layers, then batchnorm is included
• During training, we feed the network one mini-batch of data at a time. During the
forward pass, each layer of the network processes that mini-batch of data
Steps:
1. Activations
• The activations from the previous layer are passed as input to the Batch
Norm. There is one activation vector for each feature in the data.
2. Calculate Mean and Variance
• For each activation vector separately, calculate the mean and variance of all
the values in the mini-batch.
3. Normalize
• Calculate the normalized values for each activation feature vector using the
corresponding mean and variance. These normalized values now have zero
mean and unit variance.
4. Scale and Shift
• This step is the huge innovation introduced by Batch Norm that gives it
its power.
• Unlike the input layer, which requires all normalized values to have
zero mean and unit variance, Batch Norm allows its values to be shifted
(to a different mean) and scaled (to a different variance).
• It does this by multiplying the normalized values by a factor, gamma,
and adding it to a factor, beta.
• Note that this is an element-wise multiply, not a matrix multiply.
• Each Batch Norm layer is able to optimally find the best factors for
itself, and can thus shift and scale the normalized values to get the best
predictions.
5. Moving Average
• In addition, Batch Norm also keeps a running count of
the Exponential Moving Average (EMA) of the mean and
variance.
• During training, it simply calculates this EMA but does
not do anything with it.
• At the end of training, it simply saves this value as part
of the layer’s state, for use during the Inference phase
(Test phase).
• The Moving Average calculation uses a scalar
‘momentum’ denoted by alpha
• This is a hyperparameter that is used only for Batch
Norm moving averages and should not be confused with
the momentum that is used in the Optimizer.
• All feature vectors are computed in a single matrix operation.
• After the forward pass, we do the backward pass as normal.
• Gradients are calculated and updates are done for all layer weights, as
well as for all beta and gamma parameters in the Batch Norm layers.
Batch Norm during Inference (Test phase)
• During Training, Batch Norm starts by calculating the mean and
variance for a mini-batch.
• However, during Inference, we have a single sample, not a mini-batch.
• How do we obtain the mean and variance in that case?
• The two Moving Average parameters come into the picture — the ones that
we calculated during training and saved with the model.
• We use those saved mean and variance values for the Batch Norm during
Inference.
• Moving Average acts as a good proxy for the mean and variance of
the data.
Benefits
• Batch normalization reduces the internal covariate shift (ICS) and
accelerates the training of a deep neural network
• This approach reduces the dependence of gradients on the scale of
the parameters or of their initial values which result in higher learning
rates without the risk of divergence
Adding Noise to Input and Output
•Robust Feature Learning: By adding noise to the input data, the model is
forced to learn robust features that can generalize better to unseen data. The
model learns to focus on the essential patterns and structures in the data rather
than memorizing the exact input values.
•Denoising: In denoising autoencoders, noise is added to the input data, and
the model is trained to reconstruct the original, noise-free data. This process
teaches the model to remove noise from data, which can be useful in tasks
like image denoising or speech enhancement.
•Regularization: Adding noise to the input acts as a form of regularization,
reducing the risk of overfitting by making the model less sensitive to small
variations in the input data.
Adding Noise to Input
• Gaussian Noise: Adding Gaussian noise involves perturbing the input data
with values drawn from a normal distribution. For example, if x is an input
feature, the noisy input x’ can be defined as:
X’=x+ϵ
where ϵ is a random variable sampled from a Gaussian distribution with
mean 0 and a specified standard deviation.
• Salt-and-Pepper Noise: This type of noise randomly flips some of the pixels
in an image (or elements in the data) to either the minimum or maximum
value.
• Dropout Noise: In some contexts, such as neural networks, dropout can be
viewed as adding noise to the inputs by randomly setting some of the input
units to zero during training.
Adding Noise to Output
• Robustness in Output Predictions: Adding noise to the output during
training can help the model become more tolerant to small deviations
in the output, making the predictions more stable.
• Data Augmentation: In some cases, adding noise to the output data
can be a form of data augmentation, increasing the diversity of the
training set.
End of Module 2