Introduction to Neural Network
Lecture 10-11: Data Science
Outlines
• Introduction to Neural Network
• Mathematical Model for Neural Network
• Differentiation and its Application to Train Neural Network
11/8/2024 1
Artificial Neural Network
• An Artificial Neural Network (ANN) is a
mathematical model that loosely simulates the
structure and functionality of Biological nervous
system to map the inputs to outputs.
Block Diagram of Biological Nervous System
Neural
Stimulus Receptors
Network
Effectors Response
Or
Brain
Typical Human Brain
Cell Body
Human Brain Neuron vs Artificial Neuron
Artificial Neuron
x1 bk
Wk1
x2 Wk2
Wk3 Vk
x3
:
∑
: Wkn
: Kth
xn Neuron
𝑽𝒌 = 𝑾𝒌𝟏 ∗ 𝒙𝟏 + 𝑾𝒌𝟐 ∗ 𝒙𝟐 + 𝑾𝒌𝟑 ∗ 𝒙𝟑 + ⋯ + 𝑾𝒌𝒏 ∗ 𝒙𝒏 + 𝒃𝒌
Artificial Neuron
x1 bk
Wk1
x2 Wk2
Wk3 Vk Activation
x3
:
∑ Function yk
: Wkn
: Kth
xn Neuron
𝒏
𝑽𝒌 = (𝑾𝒌𝒋 ∗ 𝒙𝒋) + 𝒃𝒌
𝒋=𝟏
𝒚𝒌 = 𝒇(𝑽𝒌)
Single Neuron Model
bk
Output is Vk
Wk1
Linearly
Dependent
x1 ∑ yk
on Input
Parameters
𝑽𝒌 = 𝑾𝒌𝟏 ∗ 𝒙𝟏 + 𝒃𝒌
𝒚𝒌 = 𝒇(𝑽𝒌) = 𝑾𝒌𝟏 ∗ 𝒙𝟏 + 𝒃𝒌
Single Neuron Model
• Application
– For data Fitting applications where we have to fit a
ystraight
=mx+cline to a large data set.
Where m=Slope Of Straight Line
X=Height c=Intercept y=Weight
W 80 v
E c
I 60 v
G
H 40
T
20
0
1 2 3Height 4 5 6
Single Neuron Model
Error Calculation
• The error Ei=(Actual Value – Predicted value)=(𝑇𝑖 − 𝑦𝑖)
• For making +ve= 𝐸𝑖 = (𝑇𝑖 − 𝑦𝑖)2 [Error for ith input instance]
W 80 v
E c
I 60 v
G
H 40
T
20
0
1 2 3Height 4 5 6
Linear Neural Network
• Error Calculation
– It is done to adjust the slope(m) and intercept for better
fitting next time.
80
W
E 60
I
G 40
H
T 20
0
1 2 3 4 5 6
Height
Linear Neural Network
yk= Wk1*x1 + bk
y=m*x+c
bk
Output is Vk
Wk1
Linearly
Dependent on
x1 ∑ yk
Input
Parameters
Vk= Wk1*x1 + bk
yk= f(Vk)= Wk1*x1 + bk
Plotting Error
Error
Wk1
Differentiation…
𝑦 = 𝑓(𝑥)
𝑦 = 𝑓(𝑥)
y2 (x2,y2)
𝑑𝑦 𝑑𝑓 (x1,y1) 𝜃
= = 𝑦′ = 𝑓 ′ y1
𝑑𝑥 𝑑𝑥
𝑥 ------ x1 x2
Δ𝑦 𝑦2 − 𝑦1 𝑝
How much does y change as x changes= = = = tan
(𝜃)
Δ𝑥 𝑥2 − 𝑥1 𝑏
𝑑𝑦 Δ𝑦
= lim
𝑑𝑥 Δ𝑥→0 Δ𝑥
Differentiation…
𝑦 = 𝑓(𝑥)
𝑑𝑦 Δ𝑦
= lim
𝑑𝑥 Δ𝑥→0 Δ𝑥
As Δ𝑥 → 0 we obtain a y2 y1
tangent at x.
𝜃
𝑑𝑦 𝑥 x1 x2
= tan(𝜃)=slope of the tangent at x=x1
𝑑𝑥
𝑑𝑦
= Slope of the tangent to x-axis at x=x1
𝑑𝑥
Differentiation…
𝑦 = 𝑓(𝑥)
0 < 𝜃 < 90 tan(𝜃)= +ve
90 tan(90)=Undefined
𝜃
𝑥 x1
Differentiation…
𝑦 = 𝑓(𝑥)
𝜃 > 90 tan(𝜃)= - ve
𝜃
𝑥 x1
Differentiation…
𝑦 = 𝑓(𝑥)
𝜃 = 0 tan(𝜃)= 0
𝑥 x1
Differentiation…
𝑦 = 𝑓(𝑥)
𝜃 = 0 tan(𝜃)= 0
Maxima
𝑥 x1
Minima
Note: At minima and Maxima the
𝑑𝑦
Slope is 0 tan(𝜃)=0 = 0
𝑑𝑥
Differentiation…
Distinguishing between a Minima & Maxima
Let f(x)= X2 - 3X + 2
𝑑𝑓
=0
𝑑𝑥
2X -3 =0
X=1.5
f(1.5)=-0.25
Take a point near 1.5, let X=1
f(1)=1-3+2=0
X=1.5 can’t be maxima. It is a minima.
Error Function with Minima and No Maxima
𝑦 = 𝑓(𝑥)
𝑥
Minima
12
Error Function with a Maxima and No Minima
𝑦 = 𝑓(𝑥)
Maxima
𝑥
12
Error Function without a Maxima and Minima
𝑦 = 𝑓(𝑥)
𝑥
12
Error Function with multiple Maxima and Minima
𝑦 = 𝑓(𝑥)
𝑥
Global
Minima
Local
Minima
12
TRAINING A SINGLE-NEURON MODEL
xi1 bk
Wk1
xi2 Wk2
Wk3 Vk Activation
xi3
:
∑ Function y’k
: Wkd
: Kth
xid Neuron
Vk= ∑j=1 d (W
kj*xij) + bk y’k= f(Vk) L= ∑i=1n (yi - f(wTxi +b)2
• Step-1: Define the loss function
• Step-2: Define the optimization
13
TRAINING A SINGLE-NEURON MODEL
bk y’k= f(Vk)
xi1
W1 Vk= ∑j=1d (Wj*xij) + bk
xi2 W2 L=(y-y’)2
W3 Vk Activation
xi3
:
∑ Function y’k
: Wn
:
xin
• Step-3: Solve the optimization problem
– Randomly initialize the weights -2(y-y’))
– Feed forward the inputs and compute the loss function
– Update the weights -2(y-y’)*x x
27
TYPES OF NEURAL NETWORK
28
WHY MULTILAYER NEURAL NETWORK?
• Biological Inspiration
• Universal Approximators: Can approximate any nonlinear
function to any desired level of accuracy.
• Results in Powerful Models
29
TRAINING MULTILAYER NEURAL NETWORK
Randomly Forward it Back-
Sample Update the
through the
labeled data Initialize the propagate the network
network, get
Weights errors weights
predictions
• Back-Propagation: Chain Rule + Memoization
– In Stochastic Gradient Descent (SGD) U take one point (Input Vector)
– In Mini-Batch SGD, U take a set of points(input vectors)
– In Gradient Descent, U take all the input vectors
30
AI vs Machine Learning vs Deep Learning
Deep Learning
• A type of machine learning based on artificial
neural networks in which multiple layers of
processing are used to extract progressively higher
level features from data.
- ―Deep Learning with Python‖ Francois Chollet
32
DEEP LEARNING APPROACH
• Standard Approach (Mathematicians)
– Build new theories
– Perform Experiments
• New Deep Learning (Engineers way)
– Given huge amount of computational power
– People First Experiment and then try to build a theory
33
Why Deep Learning ? Why Now ?
• Computer Vision- Convolutional Neural Networks
and Backpropagation —well understood since 1989
• Time Series Forecasting- Long Short-Term
Memory — well understood since 1997
- ―Deep Learning with Python‖ Francois Chollet
34
Why Deep Learning ? Why Now ?
35
Algorithmic Advancements…
• Better Activation Functions for neural layers.
• Better Weight Initialization Schemes starting with
layer-wise pretraining.
• To avoid Overfitting the Concepts like Dropout is
Introduced.
• Better optimization schemes, such as RMSProp and
Adam.
36
Activation Functions…
• An Activation Function (Transfer Function) maps the
weighted summation of inputs to output.
• An Activation function is used to add Nonlinearity so
that the network can learn complex patterns.
37
Sigmoid Activation Functions
• Characteristics:
– Differentiable
– Nonlinear
– O/P lies in [0-1]
– Fast
– Vanishing Gradient
Problem
38
VANISHING GRADIENT PROBLEM
• Because of sigmoid activation function the derivative is
less than 1 and when the derivatives are multiplied it
gives a very small number which ultimately changes the
weight very less.
• Usually occurs when the derivative is less than 1.
• In case of sigmoid and tanh activation function it occurs
frequently.
𝑑𝐿 𝑑𝐿 𝑑𝑓1 𝑑𝑓2 𝑑𝑓𝑛
= × × × ⋯……………×
𝑑𝑤 𝑑𝑓1 𝑑𝑓2 𝑑𝑓3 𝑑𝑤
39
ReLU Activation Function
• f(x)= x, when x>0
= 0, when x<=0
• Avoids Vanishing Gradient Problem.
• Derivative is Simple
– f’(x)= 1 for x>=0
= 0 for x<0
• Problem:
– Dead ReLU Units
40
Leaky ReLU Activation Function
• f(x)= x, when x>0
= 0.1x, when x<=0
• The advantages of Leaky ReLU are same as that
of ReLU.
• In addition, it enables Backpropagation, even for
negative input values.
• Avoids Dead ReLU
• Simple Derivative
– f’(x)= 1 for x>=0
= 0.1 for x<0
41
WEIGHT INITIALIZATION
Error----
Weights----
42
WEIGHT INITIALIZATION
• Mostly used
– We should never initialize to same values.
• Asymmetry is necessary
– We should not initialize to large –ve values
• Vanishing Gradient problems
– Weights should be small (not too small)
– Weights should have good variance
– Weights should come from a Normal distribution with
mean zero and small variance
– Should have some +ve and Some –ve values
43