Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views10 pages

Activation Function

Uploaded by

220701130
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Activation Function

Uploaded by

220701130
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Activation Function

An activation function in a neural network is a mathematical function applied to


each neuron’s output to determine whether it should be activated or not. It takes
the weighted sum of inputs plus bias, applies a transformation (often non-
linear), and passes the result to the next layer.
Uses of activation function
• Introduces Non-Linearity – Helps neural networks learn complex, non-
linear patterns.
• Enables Learning of Complex Patterns – Makes the network capable of
approximating any function.
• Transforms Signal Between Layers – Converts a neuron’s input into
output for the next layer.
• Controls Neuron Activation – Decides whether a neuron should be
active or not.
• Allows Learning of Arbitrary Mappings – Essential for handling tasks
like image, speech, and video processing.
• Supports Backpropagation – Facilitates gradient flow during training.
• Adds Flexibility to the Model – Allows networks to adapt to various
types of data and tasks.
• Improves Model Accuracy and Performance
Types of Activation Functions
The Binary Step activation function is one of the simplest activation functions
used in neural networks. It outputs only two possible values typically 0 or 1,
based on whether the input is less than or greater than a threshold

Gradient of the Binary Step Function:

 Derivative of f(x) with respect to x is 0 f’(x) = 0, for all x


 Gradients calculate the weights and biases and since the gradient of the
function is zero, the weights and biases don’t update.
 This function can be used as an activation function while creating a
binary classifier.
Limitation:
• This function will not be useful when there are multiple classes in the
target variable.
Activation Function: Sigmoid
The sigmoid activation function maps any real-valued input to a value between 0
and 1, making it useful for binary classification and as a squashing function in

neural networks.

Takes a real-valued number and “squashes” it into range between 0 and 1.


𝑅𝑛 → 0,1
Nice interpretation as the firing rate of a neuron
• 0 = not firing at all
• 1 = fully firing
Advantages:
• Smooth gradient
• Output can be treated as probability
• Good for binary classification
Disadvantages:

- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
• gradient at these regions almost zero
• almost no signal will flow to its weights
• if initial weights are too large then most neurons would saturate
• Not zero-centered
• Slow convergence in deep networks
Activation Function: Tanh
The tanh (hyperbolic tangent) activation function is a nonlinear activation that
maps input values to the range (-1, 1). It is zero-centered, which helps
optimization converge faster compared to sigmoid.

Takes a real-valued number and


“squashes” it into range between -1 and 1.
𝑅𝑛 → −1,1

- Like sigmoid, tanh neurons saturate


- Unlike sigmoid, output is zero-centered
- Tanh is a scaled sigmoid: tanh 𝑥 = 2𝑠𝑖𝑔𝑚 2𝑥 −1
Advantages:
• Zero-centered output
• Stronger gradients than sigmoid
Disadvantages:
• Still suffers from vanishing gradients
• Saturates at large values
Activation Function: ReLU
ReLU is the most commonly used activation function in modern neural
networks. It outputs the input directly if it is positive; otherwise, it outputs zero.
 ReLU is the most commonly used activation function in CNNs and
ANNs.
 Its output range is from 0 to infinity [0,∞)
 It returns the input x if x>0 , otherwise returns 0.
 Though it appears linear in the positive region, ReLU is a non-linear
function overall.
 A combination of ReLU functions is also non-linear and can approximate
any function.
 ReLU is considered a good function approximator in neural networks.
 It performs six times faster than the hyperbolic tangent (tanh) function
during training.
 ReLU should only be used in the hidden layers of a neural network.
 For classification problems, the softmax function should be used in the
output layer.
 For regression problems, a linear activation function is preferred in the
output layer.
 ReLU has a drawback called the “dying ReLU” problem, where some
neurons become inactive and always output 0.
 This occurs due to weight updates that prevent the neuron from activating
on any input again.
Advantages:
 Trains much faster
1. accelerates the convergence of SGD
2. due to linear, non-saturating form
 Less expensive operations

1. compared to sigmoid/tanh (exponentials etc.)

2. implemented by simply thresholding a matrix at zero

 More expressive
 Sparse activation
 Works well in CNNs, MLPs, RNNs
 Prevents the gradient vanishing problem for +ive inputs

Disadvantages:
• Dead ReLU problem: neurons can die (always output 0 if stuck in
negative region)
• Not zero-centered
Activation Function: LeakyReLU
Leaky ReLU is a modified version of ReLU that allows a small negative slope
for inputs less than 0.It was designed to fix the "dying ReLU" problem where
neurons output 0 and stop learning.
• x > 0 → behaves like normal ReLU (returns x)
• x ≤ 0 → returns a small negative value instead of
Advantages:
• Fixes dead neuron issue in ReLU
• Allows small gradient when x < 0
Disadvantages:
• Slope value (0.01) is arbitrary and not learned
Activation Function: Swish
Swish is a smooth, non-monotonic activation function that often performs
better than ReLU in deep networks.
• For large positive x, Swish behaves like ReLU
• For negative x, Swish is smoothly negative, unlike ReLU
• It’s non-monotonic: meaning it curves up and down once, helping deeper
models learn better
Advantages:
• Smooth and non-monotonic
• Performs better than ReLU in many deep models
Disadvantages:
• Slightly slower than ReLU
GELU (Gaussian Error Linear Unit)
GELU is an advanced activation function that is widely used in transformers
and large language models like BERT and GPT. GELU blends ideas from ReLU
and probability theory. GELU weights inputs based on how likely they are to be
positive, assuming they come from a normal distribution.

Advantages:
• Used in BERT, GPT (transformers)
• Improves gradient flow
• Combines linear & nonlinear(Keeps small x and shrinks large negatives)
• Smooth, differentiable, and better than ReLU
• Probabilistic weighting
Disadvantages:
• More complex to compute
• Nonlinear, non-monotonic
ELU (Exponential Linear Unit)

SoftMax

You might also like