⛱️
Module 2
1. Explain batch normalization with relevant example.
2. Write snippet code for transfer learning with Keras.
3. Explain with relevant illustration, in which cases would you
want to use each of the following activation functions: ELU,
leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?
Module 2 1
Detailed Answer
1. ELU (Exponential Linear Unit):
The Exponential Linear Unit (ELU) addresses the issue of dying
neurons in ReLU by allowing small negative outputs for negative
inputs. It helps improve learning speed and leads to smoother
convergence.
When to Use:
When training deep networks to achieve faster convergence.
In cases where small negative outputs are needed to push the
mean activation closer to zero.
Formula:
f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0
where αis a positive constant.
Use Case:
Image classification and convolutional neural networks (CNNs)
where fast convergence is crucial.
Module 2 2
Works well with batch normalization.
2. Leaky ReLU (and Variants: Parametric ReLU, Randomized Leaky
ReLU):
Leaky ReLU addresses the dying ReLU problem by allowing a small,
non-zero slope for negative values of x. This keeps neurons active and
learning even when the input is negative.
Variants:
Parametric ReLU (PReLU): Allows the slope of negative values to be
learned.
Randomized Leaky ReLU (RReLU): Randomizes the slope during
training for regularization.
When to Use:
When dealing with deep networks prone to the dying ReLU
problem.
For time-series data or speech recognition tasks where some
negative values can hold important information.
f(x) = {
x if x > 0
αx if x ≤ 0
Where αis a small positive constant (e.g., α = 0.01).
3. ReLU (Rectified Linear Unit):
ReLU is the most commonly used activation function due to its
simplicity and effectiveness. It outputs 0 for negative inputs and the
input itself for positive inputs.
When to Use:
In hidden layers of deep neural networks.
Works well for image-related tasks and object detection.
Formula:
f(x) = max(0, x)
Limitations:
Module 2 3
Prone to the dying ReLU problem (neurons stop learning when
stuck in the negative region).
Use Case:
Convolutional Neural Networks (CNNs) for image recognition.
Feedforward networks where computational efficiency is crucial.
4. tanh (Hyperbolic Tangent):
The tanh activation function outputs values between -1 and 1, making it
zero-centered. This helps to avoid shifting gradients in one direction
during backpropagation.
When to Use:
When dealing with classification tasks that require negative values
as well as positive values.
In hidden layers of networks where zero-centered outputs are
beneficial.
Formula:
ex − e−x
f(x) = x
e + e−x
Use Case:
Recurrent Neural Networks (RNNs) for time-series predictions.
Binary classification tasks where outputs need to be zero-
centered.
5. Logistic (Sigmoid) Function:
The sigmoid function outputs values between 0 and 1, making it
suitable for binary classification problems.
When to Use:
For binary classification tasks where the output is a probability
(e.g., predicting whether an email is spam or not).
In output layers when a probability score is required.
Formula:
Module 2 4
1
f(x) =
1 + e−x
Limitations:
Causes vanishing gradients for very large or very small inputs.
The output is not zero-centered, which can slow down learning.
Use Case:
Logistic regression and binary classification tasks.
Probability-based models where outputs need to be interpreted as
probabilities.
6. Softmax Function:
The softmax function is used to convert the outputs of a neural network
into a probability distribution over multiple classes.
When to Use:
In the output layer of a multiclass classification model.
When the task requires predicting one class out of multiple
possible classes.
Formula:
ez i
σ(zi ) =
K
∑j=1 ez j
where zi is the output for a specific class, and K is the total number of
classes.
Use Case:
Multiclass classification problems, such as identifying handwritten
digits (0-9) in MNIST dataset.
Natural language processing (NLP) tasks like named entity
recognition.
4. Discuss how to Train the DNN on this training set. For each
image pair, you can simultaneously feed the first image to DNN
A and the second image to DNN B. The whole network will
Module 2 5
gradually learn to tell whether two images belong to the same
class or not.
5. Explain the vanishing gradients problem in neural network.
OR
Explain the vanishing and exploding gradients problem in
neural network.
Vanishing Gradients:
When training a neural network using backpropagation, the error gradients
computed during backpropagation tend to decrease as they are propagated
backward through the network's layers.
In deep networks, especially when using activation functions like the
sigmoid or hyperbolic tangent, this can cause the gradients to shrink
exponentially, making them too small to cause meaningful updates to the
weights in the earlier layers.
As a result:
The earlier layers learn very slowly, or not at all.
The model fails to converge to a good solution.
Exploding Gradients:
In contrast, the exploding gradients problem occurs when the gradients
become excessively large during backpropagation, resulting in large updates to
the network weights. This can cause:
The model parameters to become unstable.
Divergence in the training process, where the loss function increases
instead of decreasing.
These problems are more pronounced in very deep networks and recurrent
neural networks (RNNs).
Causes:
The vanishing/exploding gradients problem can be attributed to factors such
as:
Poor initialization of weights.
Module 2 6
Use of saturating activation functions like sigmoid and tanh.
The accumulation of small gradients through many layers.
Solutions:
Several techniques can mitigate these issues:
1. Use of Non-saturating Activation Functions: Functions like ReLU (Rectified
Linear Unit) do not saturate for positive inputs, reducing the risk of
vanishing gradients.
2. Proper Weight Initialization: Xavier and He initialization methods help
maintain a balance in gradient propagation.
3. Batch Normalization: Normalizing inputs within the network can reduce
internal covariate shift, stabilizing gradients during training.
4. Gradient Clipping: This technique caps the gradients during
backpropagation to avoid them from becoming too large, addressing
exploding gradients.
These methods have significantly improved the training of deep neural
networks, enabling them to handle more complex tasks effectively.
6. Discuss the problem that Glorot initialization and He
initialization aim to fix.
Both Glorot initialization and He initialization were proposed to address
vanishing and exploding gradients, especially in deep neural networks.
These problems occur because weights are poorly initialized, causing
gradients to shrink (vanish) or grow (explode) as they propagate backward
through the layers.
The vanishing and exploding gradients problem is particularly severe
when:
The network is deep (many layers).
Activations are not properly scaled, leading to either:
Vanishing gradients: When gradients become very small and fail to
update the weights of earlier layers.
Exploding gradients: When gradients grow too large and make the
network unstable.
Module 2 7
The root cause is how weights are initialized. If weights are too small or too
large, the signal passing through layers is either diminished or amplified
exponentially, causing instability.
Xavier Initialization:
Proposed by Xavier Glorot and Yoshua Bengio, Xavier initialization works
well for sigmoid and tanh activation functions, which are prone to
saturation (leading to vanishing gradients).
Balances the variance of the activations and gradients to keep them within
a reasonable range across layers.
Formula for weight initialization:
He Initialization:
Proposed by Kaiming He et al. for ReLU and variants of ReLU (like Leaky
ReLU).
ReLU does not saturate like sigmoid or tanh, but it can suffer from dying
neurons if weights are not initialized properly.
The scaling factor n2in accounts for the fact that ReLU only activates half of
the neurons on average, preventing the gradients from vanishing.
Formula for weight initialization:
2
W ∼ N (0, )
nin
Where:
nin = number of input neurons.
7. Differentiate Non-saturating and Saturating activation
functions with example.
Module 2 8
8. Explain the variants of ReLU activation function.
1. ReLU (Rectified Linear Unit):
The original ReLU is the most widely used activation function in deep
learning. It outputs 0 for negative inputs and x for positive inputs.
Formula:
f(x) = max(0, x)
Pros:
Simple and efficient.
Helps reduce vanishing gradient problems.
Computationally inexpensive.
Cons:
Dying ReLU problem: Neurons can stop learning if they get stuck in the
negative region and always output zero.
2. Leaky ReLU:
Module 2 9
Leaky ReLU fixes the dying ReLU problem by allowing a small, non-zero
slope for negative inputs.
Formula:
f(x) = {
x if x > 0
αx if x ≤ 0
Where αis a the hyperparameter (slope).
Pros:
Prevents neurons from dying.
Allows negative values to propagate through the network.
Cons:
Choosing the right value of αcan be tricky.
Use Case:
Used in GANs (Generative Adversarial Networks), RNNs, and
networks prone to the dying ReLU problem.
3. Parametric ReLU (PReLU):
Parametric ReLU is a variant of Leaky ReLU where the slope for negative
inputs is learnable during training.
Formula:
f(x) = {
x if x > 0
ax if x ≤ 0
Where ais a learnable parameter.
Pros:
The network can adapt the slope for negative values based on the data.
Reduces the risk of dying neurons.
Cons:
Can lead to overfitting if not regularized.
Use Case:
Module 2 10
Effective in large-scale image classification tasks and deep neural
networks.
4. Randomized Leaky ReLU (RReLU):
Randomized Leaky ReLU randomly chooses the slope αfor negative inputs
from a given range during training.
Formula:
f(x) = {
x if x > 0
αx if x ≤ 0
Where αis randomly sampled from a range [l, u]during training and fixed
during testing.
Pros:
Helps with regularization.
Reduces the risk of overfitting.
Cons:
The choice of the range [l, u]can impact performance.
Use Case:
Useful in low-resource environments or for noise-tolerant networks.
5. Exponential Linear Unit (ELU):
ELU allows small negative outputs, which helps push the mean activation
closer to zero, improving learning.
Formula:
f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0
Where αis a constant.
Pros:
Zero-centered output, which improves learning speed.
Reduces vanishing gradients more effectively than ReLU.
Module 2 11
Cons:
Computationally expensive compared to ReLU.
Use Case:
Used in convolutional neural networks (CNNs) and deep residual
networks.
9. Discuss the different strategies to fix the vanishing gradient
issue.
The vanishing gradients problem poses a significant challenge in training deep
neural networks, hindering the effective learning of lower layers as gradients
diminish during backpropagation.
Here are several strategies to mitigate this issue:
1. Xavier and He Initialization:
These weight initialization techniques aim to maintain consistent variance
of both outputs and gradients throughout the network.
Xavier Initialization: Suitable for the logistic activation function, it initializes
weights randomly, ensuring the variance of outputs matches the variance
of inputs.
Formula for weight initialization:
He Initialization: Designed for the ReLU activation function and its variants,
it accounts for the fact that ReLU only activates for positive values.
It typically uses a normal distribution with a mean of 0 and standard
2
deviation of σ = ninputs
.
By employing these initialization methods, training can be accelerated
significantly, and deeper networks can be trained effectively.
2. Non-saturating Activation Functions:
Module 2 12
The choice of activation function plays a crucial role in mitigating vanishing
gradients.
The sigmoid activation function, once popular, suffers from saturation for
large input values, leading to gradients close to zero.
ReLU and its variants (Leaky ReLU, ELU, RReLU, PReLU) address this issue
by not saturating for positive values.
The ELU activation function, in particular, has shown promising results in
speeding up training and improving model performance, although it may be
computationally slower than ReLU at test time.
3. Batch Normalization:
This technique tackles the issue of internal covariate shift, where the
distribution of each layer's inputs changes during training.
It normalizes the inputs of each layer, stabilizing the gradients and allowing
the use of higher learning rates.
Batch Normalization has been shown to significantly reduce vanishing
gradients, speed up training, and improve the overall performance of deep
neural networks.
4. Gradient Clipping:
Primarily used for recurrent neural networks, gradient clipping involves
capping gradients during backpropagation to prevent them from exceeding
a certain threshold.
This helps prevent exploding gradients, a problem where gradients become
excessively large.
10. Discuss the various ways of how we can reuse a pre-
trained model.
11. Describe pretraining on an auxiliary task.
Module 2 13