Backpropagation
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
MuCulloch Pitts Neuron and Perceptron
𝑛 𝑛
𝑦 = 1 𝑖𝑓 𝑥𝑖 ≥ 0 𝑦 = 1 𝑖𝑓 𝒘𝒊 ∗ 𝑥𝑖 ≥ 0
𝑖=0 𝑖=0
𝑛 𝑛
= 0 𝑖𝑓 𝑥𝑖 < 0 = 0 𝑖𝑓 𝒘𝒊 ∗ 𝑥𝑖 < 0
𝑖=0 𝑖=0
➢ A perceptron separates the input space into two halves
➢ In other words, a single perceptron can only be used to implement linearly separable functions
➢ The weights (including threshold) can be learned and the inputs can be real valued
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Perceptron Learning
➢ Consider two vectors w and x
w = [w0,w1,w2,...,wn]
x = [1,x1,x2,...,xn]
➢ We can rewrite the perceptron rule as
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Perceptron Learning
➢ Consider some points (vectors) which lie in the positive
half space of this line (i.e., wTx>0).
➢ What will be the angle between any such vector and w?
▪ Obviously, less than 90∘
➢ Consider some points (vectors) which lie in the negative
half space of this line (i.e., wTx<0).
➢ What will be the angle between any such vector and w ?
▪ Obviously, greater than 90∘
➢ The algorithm has converge Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Proof of Perceptron Learning
➢ So far we made corrections for 𝑤 𝑇 𝑃𝑖 < 0
➢ 𝑐𝑜𝑠𝛽 thus grows proportional to 𝑘
➢ Thus, there can only be a finite number of
corrections (k) to w and the algorithm will converge!
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Multilayered Network of Perceptron's
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Sigmoid Neuron
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Sigmoid Neuron
Not smooth smooth
not continuous continuous
not differentiable differentiable
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Sigmoid Neuron
➢ Typical supervised machine learning setup :
▪ Data: {𝑥𝑖 , 𝑦𝑖 }𝑛𝑖=1
▪ Model: Approximation of the relation between x and y
𝟏
ෝ=
𝒚 𝑻𝒙
𝟏 + 𝒆−𝒘
▪ Parameters: In all the above cases, w is a parameter which needs to be learned from the data
▪ Learning algorithm: An algorithm for learning the parameters w of the model (for example,
perceptron learning algorithm, gradient descent, etc.)
▪ Objective/Loss/Error function: To guide the learning algorithm
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Typical Supervised Learning : Guess work
➢ Training the Network: Finding w* and b* manually
▪ With some guess work, we are able to find out the optimal values for w and b.
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Error surface for Guess work
➢ Geometric interpretation of our “guess work” algorithm in terms of this error surface
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Gradient Descent: More Efficient and principled way
➢ Thus, ℒ 𝜃 + 𝜂𝑢 − ℒ 𝜃 = 𝑢𝑇 ∇𝜃 ℒ 𝜃 = 𝑘 ∗ cos 𝛽 is most negative when cos 𝛽 = −1 𝑖. 𝑒., 𝛽 = 1800
➢ The direction u that we intend to move in should be at 180° w.r.t. the gradient
➢ In other words, move in a direction opposite to the gradient
Credit: Mitesh Khapra, IITM
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Gradient Descent
➢ Algorithm
➢ For two points,
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Feed-forward Neural Network
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Feed-forward Neural Network
➢ The input layer can be called the 0th layer and the
output layer can be called the (L)th layer
➢ 𝑊𝑖 ∈ ℝ𝑛×𝑛 and 𝑏𝑖 ∈ ℝ𝑛 are the weights and bias
between layers 𝑖 − 1 and 𝑖(0 < 𝑖 < 𝐿)
➢ 𝑊𝑖 ∈ ℝ𝑘×𝑛 and 𝑏𝑖 ∈ ℝ𝑘 are the weight and bias
between the last hidden layer and the output layer
(L=3 in this case)
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Feed-forward Neural Network
➢ The pre-activation at layer 𝑖 is given by,
𝑎𝑖 = 𝑏𝑖 + 𝑊𝑖 ℎ𝑖−1
➢ The activation at layer 𝑖 is given by,
ℎ𝑖 = 𝑔(𝑎𝑖 )
▪ where, g is called the activation function (for
example, logistic, tanh, linear, etc.)
➢ The activation at the output layer is given by,
𝑓 𝑥 = ℎ𝐿 = 𝑂(𝑎𝐿 )
▪ Where, O is the output activation function
(for example, softmax, linear, etc.)
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
➢ Now 𝜵𝜽 looks much more nasty,
➢ 𝛻𝜃 is this composed of,
𝛻𝑊1 , 𝛻𝑊2 , … , 𝛻𝑊𝐿−1 ∈ ℝ𝑛×𝑛 , 𝛻𝑊𝐿 ∈ ℝ𝑘×𝑛
𝛻𝑏1 , 𝛻𝑏2 , … , 𝛻𝑏𝐿−1 ∈ ℝ𝑛 , 𝛻𝑏𝐿 ∈ ℝ𝑘
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
➢ We need to answer two equations?
▪ How to choose the loss function ℒ 𝜃 ?
▪ How to compute 𝛻𝜃 which is composed of
𝛻𝑊1 , 𝛻𝑊2 , … , 𝛻𝑊𝐿−1 ∈ ℝ𝑛×𝑛 , 𝛻𝑊𝐿 ∈ ℝ𝑘×𝑛
𝛻𝑏1 , 𝛻𝑏2 , … , 𝛻𝑏𝐿−1 ∈ ℝ𝑛 , 𝛻𝑏𝐿 ∈ ℝ𝑘
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Output Functions and Loss Functions
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ The choice of loss function depends on the problem at hand.
➢ We will illustrate this with the help of two examples
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Consider our movie example again but this time we are
interested in predicting ratings
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Consider our movie example again but this time we are
interested in predicting ratings
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Consider our movie example again but this time we are
interested in predicting ratings
➢ Here, 𝑦𝑗 ∈ ℝ3
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Consider our movie example again but this time we are
interested in predicting ratings
➢ Here, 𝑦𝑗 ∈ ℝ3
➢ The loss function should capture how much 𝑦ො𝑗 deviates
from 𝑦𝑗
➢ If 𝑦𝑗 ∈ ℝ𝑘 then the squared error loss can capture this
deviation
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Consider our movie example again but this time we are
interested in predicting ratings
➢ Here, 𝑦𝑗 ∈ ℝ3
➢ The loss function should capture how much 𝑦ො𝑗 deviates
from 𝑦𝑗
➢ If 𝑦𝑗 ∈ ℝ𝑘 then the squared error loss can capture this
deviation
𝟏
𝓛 𝜽 = σ𝑵 𝒌
𝒚𝒊𝒋 − 𝒚𝒊𝒋 )𝟐
𝒊=𝟏 σ𝒋=𝟏(ෝ
𝑵
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?
➢ More specifically, can it be the logistic function?
▪ No, because it restricts 𝑦𝑗 to a value between 0 &
1 but we want 𝑦𝑗 ∈ ℝ
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?
➢ More specifically, can it be the logistic function?
▪ No, because it restricts 𝑦𝑗 to a value between 0 &
1 but we want 𝑦𝑗 ∈ ℝ
➢ So, in such cases it makes sense to have 'O' as
linear function
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ A related question: What should the output function
'O' be if 𝑦𝑗 ∈ ℝ ?
➢ More specifically, can it be the logistic function?
▪ No, because it restricts 𝑦𝑗 to a value between 0 &
1 but we want 𝑦𝑗 ∈ ℝ
➢ So, in such cases it makes sense to have 'O' as
linear function
➢ 𝑦ො𝑗 is no longer bounded between 0 and 1
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Entropy
➢ Expectations 𝐸𝑥𝑝 = 𝑃𝑖 𝑉(𝑖)
𝑖=1
➢ Entropy 𝐸𝑛𝑡 = 𝑃𝑖 log(𝑃𝑖 )
𝑖=1
➢ Cross Entropy 𝐶𝑟𝑜𝑠𝐸 = 𝑃𝑖 log(𝑞𝑖 )
𝑖=1
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate
➢ Suppose we want to classify an image into 1 of k classes
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate
➢ Suppose we want to classify an image into 1 of k classes
➢ Here, again we could use the squared error loss to
capture the deviation
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now let us consider another problem for which a
different loss function would be appropriate
➢ Suppose we want to classify an image into 1 of k classes
➢ Here, again we could use the squared error loss to
capture the deviation
➢ But, can you think of a better function?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution
➢ Therefore, we should also ensure that 𝑦𝑗 is a probability
distribution
➢ What choice of the output activation 'O' will ensure this ?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution
➢ Therefore, we should also ensure that 𝑦𝑗 is a probability
distribution
➢ What choice of the output activation 'O' will ensure this ?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution
➢ Therefore, we should also ensure that 𝑦𝑗 is a probability
distribution
➢ What choice of the output activation 'O' will ensure this ?
𝑎𝐿 = 𝑊𝐿 ℎ𝐿−1 + 𝑏𝐿
➢ 𝑂(𝑎𝐿 )𝑗 is the 𝑗𝑡ℎ element of 𝑦ො and 𝑎𝐿,𝑗 is the 𝑗𝑡ℎ element
of vector 𝑎𝐿
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Notice that y is a probability distribution
➢ Therefore, we should also ensure that 𝑦𝑗 is a probability
distribution
➢ What choice of the output activation 'O' will ensure this ?
𝑎𝐿 = 𝑊𝐿 ℎ𝐿−1 + 𝑏𝐿
➢ 𝑂(𝑎𝐿 )𝑗 is the 𝑗𝑡ℎ element of 𝑦ො and 𝑎𝐿,𝑗 is the 𝑗𝑡ℎ element
of vector 𝑎𝐿
➢ This function is called the softmax function
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?
➢ Cross-entropy 𝑘
ℒ 𝜃 = − 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?
➢ Cross-entropy 𝑘
ℒ 𝜃 = − 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1
➢ So, for classification problem (where you have to choose
1 of K classes), we use the following objective function
min ℒ 𝜃 = −𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝜃
max −ℒ 𝜃 = 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝜃
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?
➢ Cross-entropy 𝑘
ℒ 𝜃 = − 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1
➢ So, for classification problem (where you have to choose
1 of K classes), we use the following objective function
min ℒ 𝜃 = −𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐= − 𝑙𝑜𝑔 𝑦ො 𝑙
𝜃
max −ℒ 𝜃 = 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐 = 𝑙𝑜𝑔 𝑦ො 𝑙
𝜃
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Now that we have ensured that both y & 𝑦ො are probability
distributions can you think of a function which captures
the difference between them?
➢ Cross-entropy 𝑘
ℒ 𝜃 = − 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐
𝑐=1
➢ So, for classification problem (where you have to choose
1 of K classes), we use the following objective function
min ℒ 𝜃 = −𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐= − 𝑙𝑜𝑔 𝑦ො 𝑙
𝜃
max −ℒ 𝜃 = 𝑦𝑐 𝑙𝑜𝑔 𝑦ො 𝑐 = 𝑙𝑜𝑔 𝑦ො 𝑙
𝜃
ෝ𝒍 is called log-likelihood of the data
𝒍𝒐𝒈 𝒚
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?
➢ Yes, it is indeed a function of 𝜃
𝑦ො𝑙 = [𝑂(𝑊3 𝑔 𝑊2 𝑔 𝑊1 𝑥𝑖 + 𝑏1 + 𝑏2 + 𝑏1 )] 𝑙
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?
➢ Yes, it is indeed a function of 𝜃
𝑦ො𝑙 = [𝑂(𝑊3 𝑔 𝑊2 𝑔 𝑊1 𝑥𝑖 + 𝑏1 + 𝑏2 + 𝑏1 )] 𝑙
➢ What does 𝑦ො𝑙 encode?
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
➢ Is 𝑦ෝ𝑙 a function of 𝜃 = [𝑊1 , 𝑊2 , … , 𝑊𝐿 , 𝑏1 , 𝑏2 , … , 𝑏𝐿 ]?
➢ Yes, it is indeed a function of 𝜃
𝑦ො𝑙 = [𝑂(𝑊3 𝑔 𝑊2 𝑔 𝑊1 𝑥𝑖 + 𝑏1 + 𝑏2 + 𝑏1 )] 𝑙
➢ What does 𝑦ො𝑙 encode?
➢ It is the probability that 𝑥 belongs to the 𝑙 𝑡ℎ class
(bring it as close to 1).
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
Real Values Probabilities
Output Activation
Loss Function
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
Real Values Probabilities
Output Activation Linear
Loss Function Squared Error
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
➢ Of course, there could be other loss functions depending on the problem at hand but the two loss
functions that we just saw are encountered very often
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Loss function 𝓛 𝜽 ?
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
➢ Of course, there could be other loss functions depending on the problem at hand but the two loss
functions that we just saw are encountered very often
➢ For the rest of this lecture, we will focus on the case where the output activation is a softmax function
and the loss function is cross entropy
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI
Any questions….
DA322M: Deep Learning by Dr. P. W. Patil, MFSDSAI