Recap
• We have been considering linear discriminant
functions.
PR NPTEL course – p.1/122
Recap
• We have been considering linear discriminant
functions.
• Such a linear classifier is given by
Xd′
h(X) = 1 if wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.
PR NPTEL course – p.2/122
Recap
• We have been considering linear discriminant
functions.
• Such a linear classifier is given by
Xd′
h(X) = 1 if wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.
• We have been considering the case φi (X) = xi for
simplicity.
PR NPTEL course – p.3/122
Perceptron
• Perceptron is the earliest such classifier.
PR NPTEL course – p.4/122
Perceptron
• Perceptron is the earliest such classifier.
• Assuming augumented feature vector,
h(X) = sgn(W T X).
PR NPTEL course – p.5/122
Perceptron
• Perceptron is the earliest such classifier.
• Assuming augumented feature vector,
h(X) = sgn(W T X).
• ‘find weighted sum and threshold’
PR NPTEL course – p.6/122
Perceptron
• Perceptron is the earliest such classifier.
• Assuming augumented feature vector,
h(X) = sgn(W T X).
• ‘find weighted sum and threshold’
PR NPTEL course – p.7/122
Perceptron Learning Algorithm
• A simple iterative algorithm.
PR NPTEL course – p.8/122
Perceptron Learning Algorithm
• A simple iterative algorithm.
• Each iteration, we locally try to correct errors.
PR NPTEL course – p.9/122
Perceptron Learning Algorithm
• A simple iterative algorithm.
• Each iteration, we locally try to correct errors.
Let ∆W (k) = W (k + 1) − W (k). Then
∆W (k) =0 if W (k)T X(k) > 0 & y(k) = 1, or
W (k)T X(k) < 0 & y(k) = 0
= X(k) if W (k)T X(k) ≤ 0 & y(k) = 1
= − X(k) if W (k)T X(k) ≥ 0 & y(k) = 0
PR NPTEL course – p.10/122
Perceptron: Geometric view
The algorithm has a simple geometric view. Consider the
following data set.
PR NPTEL course – p.11/122
• Suppose W (k) misclassifies a pattern.
PR NPTEL course – p.12/122
• Now the correction made to W (k) can be seen as
PR NPTEL course – p.13/122
• We showed that: if the training set is linearly
separable, then the algorithm would find a separating
hyperplane in finitely many iterations.
PR NPTEL course – p.14/122
• We showed that: if the training set is linearly
separable, then the algorithm would find a separating
hyperplane in finitely many iterations.
• We also saw the ‘batch’ version of the algorithm. It is
shown to be a gradient descent on a reasonable cost
function.
PR NPTEL course – p.15/122
Perceptron
• A simple ‘device’: Weighted sum and threshold.
PR NPTEL course – p.16/122
Perceptron
• A simple ‘device’: Weighted sum and threshold.
• A simple learning machine. (A neuron model).
PR NPTEL course – p.17/122
Perceptron
• A simple ‘device’: Weighted sum and threshold.
• A simple learning machine. (A neuron model).
PR NPTEL course – p.18/122
• Perceptron is an interesting algorithm to learn linear
classifiers.
PR NPTEL course – p.19/122
• Perceptron is an interesting algorithm to learn linear
classifiers.
• Works only when data is linearly separable.
PR NPTEL course – p.20/122
• Perceptron is an interesting algorithm to learn linear
classifiers.
• Works only when data is linearly separable.
• In general, not possible to know beforehand whether
data is linearly separable.
PR NPTEL course – p.21/122
• Perceptron is an interesting algorithm to learn linear
classifiers.
• Works only when data is linearly separable.
• In general, not possible to know beforehand whether
data is linearly separable.
• We next look at other linear methods in classification
and regression.
PR NPTEL course – p.22/122
Regression Problems
• Recall that the regression or function learning
problem is closely related to learning classifiers.
PR NPTEL course – p.23/122
Regression Problems
• Recall that the regression or function learning
problem is closely related to learning classifiers.
• The training set would be
{(Xi , yi ), i = 1, · · · , n} with Xi ∈ ℜd , yi ∈ ℜ, ∀i.
PR NPTEL course – p.24/122
Regression Problems
• Recall that the regression or function learning
problem is closely related to learning classifiers.
• The training set would be
{(Xi , yi ), i = 1, · · · , n} with Xi ∈ ℜd , yi ∈ ℜ, ∀i.
• The main difference is that the ‘targets’ or the ‘output’,
yi , is continuous valued in regression problem while it
can take only finitely many distinct values in a
classifier.
PR NPTEL course – p.25/122
• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
PR NPTEL course – p.26/122
• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
• Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
PR NPTEL course – p.27/122
• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
• Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
• We search over a suitably parameterized class of
functions to find the best one.
PR NPTEL course – p.28/122
• In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
• Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
• We search over a suitably parameterized class of
functions to find the best one.
• Once again, the problem is that of learning the best
parameters.
PR NPTEL course – p.29/122
Linear Regression
• We will now consider learning a linear function
d
X
f (X) = wi xi + w0
j=1
where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the
parameters.
PR NPTEL course – p.30/122
Linear Regression
• We will now consider learning a linear function
d
X
f (X) = wi xi + w0
j=1
where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the
parameters.
• Thus a linear model can be expressed as
f (X) = W T X + w0 .
PR NPTEL course – p.31/122
Linear Regression
• We will now consider learning a linear function
d
X
f (X) = wi xi + w0
j=1
where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the
parameters.
• Thus a linear model can be expressed as
f (X) = W T X + w0 .
• As earlier, by using an augumented vector X , we can
write this as f (X) = W T X .
PR NPTEL course – p.32/122
• Now, to learn ‘optimal’ W , we need a criterion
function.
PR NPTEL course – p.33/122
• Now, to learn ‘optimal’ W , we need a criterion
function.
• The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
PR NPTEL course – p.34/122
• Now, to learn ‘optimal’ W , we need a criterion
function.
• The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
• Then the optimal W would be one that optimizes the
criterion function.
PR NPTEL course – p.35/122
• Now, to learn ‘optimal’ W , we need a criterion
function.
• The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
• Then the optimal W would be one that optimizes the
criterion function.
• A criterion function that is most often used is the sum
of squares of errors.
PR NPTEL course – p.36/122
Linear Least Squares Regression
• We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
PR NPTEL course – p.37/122
Linear Least Squares Regression
• We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
• Consider a function J : ℜd+1 → ℜ defined by
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
PR NPTEL course – p.38/122
Linear Least Squares Regression
• We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
• Consider a function J : ℜd+1 → ℜ defined by
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• We take the ‘optimal’ W to be the minimizer of J(·).
PR NPTEL course – p.39/122
Linear Least Squares Regression
• We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
• Consider a function J : ℜd+1 → ℜ defined by
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• We take the ‘optimal’ W to be the minimizer of J(·).
• Known as linear least squares method.
PR NPTEL course – p.40/122
• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
PR NPTEL course – p.41/122
• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• If we are learning a classifier we can have
yi ∈ {−1, +1}.
PR NPTEL course – p.42/122
• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• If we are learning a classifier we can have
yi ∈ {−1, +1}.
• Note that finally we would use sign of W T X as the
classifier output.
PR NPTEL course – p.43/122
• We want to find W to minimize
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• If we are learning a classifier we can have
yi ∈ {−1, +1}.
• Note that finally we would use sign of W T X as the
classifier output.
• Thus minimizing J is a good way to learn linear
discriminant functions also.
PR NPTEL course – p.44/122
• We want to find minimizer of
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
PR NPTEL course – p.45/122
• We want to find minimizer of
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• This is a quadratic function and we can analytically
find the minimizer.
PR NPTEL course – p.46/122
• We want to find minimizer of
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• This is a quadratic function and we can analytically
find the minimizer.
• For this we rewrite J(W ) into a more convenient form.
PR NPTEL course – p.47/122
• Recall that we take all vectors to be column vectors.
PR NPTEL course – p.48/122
• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
PR NPTEL course – p.49/122
• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
• Let A be a matrix given by
T
A = [X1 · · · Xn ]
PR NPTEL course – p.50/122
• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
• Let A be a matrix given by
T
A = [X1 · · · Xn ]
• A is a n × (d + 1) matrix whose ith row is given by XiT .
PR NPTEL course – p.51/122
• Recall that we take all vectors to be column vectors.
• Hence each training sample Xi is a (d + 1) × 1 matrix.
• Let A be a matrix given by
T
A = [X1 · · · Xn ]
• A is a n × (d + 1) matrix whose ith row is given by XiT .
• Hence, AW would be a n × 1 vector whose ith
element is XiT W .
PR NPTEL course – p.52/122
• Let Y be a n × 1 vector whose ith element is yi .
PR NPTEL course – p.53/122
• Let Y be a n × 1 vector whose ith element is yi .
• Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
PR NPTEL course – p.54/122
• Let Y be a n × 1 vector whose ith element is yi .
• Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
• Hence we have
n
1 X ¡ T
¢2 1
J(W ) = X W − yi
i = (AW −Y )T (AW −Y )
2 i=1
2
PR NPTEL course – p.55/122
• Let Y be a n × 1 vector whose ith element is yi .
• Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
• Hence we have
n
1 X ¡ T
¢2 1
J(W ) = X W − yi
i = (AW −Y )T (AW −Y )
2 i=1
2
• To find minimizer of J(·) we need to equate its
gradient to zero
PR NPTEL course – p.56/122
• We have
∇ J(W ) = AT (AW − Y )
PR NPTEL course – p.57/122
• We have
∇ J(W ) = AT (AW − Y )
• Equating the gradient to zero, we get
(AT A)W = AT Y
PR NPTEL course – p.58/122
• We have
∇ J(W ) = AT (AW − Y )
• Equating the gradient to zero, we get
(AT A)W = AT Y
• The optimal W satisfies this system of linear
equations. (Called normal equations).
PR NPTEL course – p.59/122
• AT A is a (d + 1) × (d + 1) matrix.
PR NPTEL course – p.60/122
• AT A is a (d + 1) × (d + 1) matrix.
• AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
PR NPTEL course – p.61/122
• AT A is a (d + 1) × (d + 1) matrix.
• AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
• Rows of A are the training samples Xi .
PR NPTEL course – p.62/122
• AT A is a (d + 1) × (d + 1) matrix.
• AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
• Rows of A are the training samples Xi .
• Hence j th column of A would give the values of j th
feature in all the examples.
PR NPTEL course – p.63/122
• Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
PR NPTEL course – p.64/122
• Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
• If we assume features are linearly independent then
A would have linearly independent columns and
hence AT A would be invertible.
PR NPTEL course – p.65/122
• Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
• If we assume features are linearly independent then
A would have linearly independent columns and
hence AT A would be invertible.
• This is a reasonable assumption.
PR NPTEL course – p.66/122
• The optimal W is a solution of (AT A)W = AT Y .
PR NPTEL course – p.67/122
• The optimal W is a solution of (AT A)W = AT Y .
• When AT A is invertible, we get the optimal W as
W ∗ = (AT A)−1 AT Y = A† Y
where A† = (AT A)−1 AT , is called the generalized
inverse of A.
PR NPTEL course – p.68/122
• The optimal W is a solution of (AT A)W = AT Y .
• When AT A is invertible, we get the optimal W as
W ∗ = (AT A)−1 AT Y = A† Y
where A† = (AT A)−1 AT , is called the generalized
inverse of A.
• The above W ∗ is the linear least squares solution for
our regression (or classification) problem.
PR NPTEL course – p.69/122
Geometry of Least Squares
• Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
PR NPTEL course – p.70/122
Geometry of Least Squares
• Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
PR NPTEL course – p.71/122
Geometry of Least Squares
• Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
• Consider the (over-determined) system of linear
equations AW = Y .
PR NPTEL course – p.72/122
Geometry of Least Squares
• Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
• Consider the (over-determined) system of linear
equations AW = Y .
• The system may or may not be consistent. But, we
seek to find W ∗ to minimize squared error.
PR NPTEL course – p.73/122
Geometry of Least Squares
• Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
• A is a n × (d + 1) matrix and normally n >> d.
• Consider the (over-determined) system of linear
equations AW = Y .
• The system may or may not be consistent. But, we
seek to find W ∗ to minimize squared error.
• As we saw, the solution is W ∗ = A† Y and hence the
name generalized inverse for A† .
PR NPTEL course – p.74/122
• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
PR NPTEL course – p.75/122
• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
PR NPTEL course – p.76/122
• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
• Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
PR NPTEL course – p.77/122
• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
• Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
• Thus, for any W , AW is a linear combination of
columns of A.
PR NPTEL course – p.78/122
• The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
• Let C0 , C1 , · · · , Cd be the columns of A.
• Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
• Thus, for any W , AW is a linear combination of
columns of A.
• Hence, if Y is in the space spanned by columns of A,
there is an exact solution.
PR NPTEL course – p.79/122
• Otherwise, we want the projection of Y onto the
column space of A.
PR NPTEL course – p.80/122
• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
PR NPTEL course – p.81/122
• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
• Any vector in the column space of A can be written as
Z = AW for some W .
PR NPTEL course – p.82/122
• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
• Any vector in the column space of A can be written as
Z = AW for some W .
• Hence we want to find Z to minimize ||Z − Y ||2
subject to the constraint that Z = AW for some W .
PR NPTEL course – p.83/122
• Otherwise, we want the projection of Y onto the
column space of A.
• That is, we want to find a vector Z in the column
space of A that is closest to Y .
• Any vector in the column space of A can be written as
Z = AW for some W .
• Hence we want to find Z to minimize ||Z − Y ||2
subject to the constraint that Z = AW for some W .
• That is the least squares solution.
PR NPTEL course – p.84/122
• Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
PR NPTEL course – p.85/122
• Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
• Now we have
n
1 X ¡ T ¢2
J(W ) = W Xi + w0 − yi
2 i=1
PR NPTEL course – p.86/122
• Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
• Now we have
n
1 X ¡ T ¢2
J(W ) = W Xi + w0 − yi
2 i=1
• For any given W we can find best w0 by equating the
partial derivative to zero.
PR NPTEL course – p.87/122
We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1
PR NPTEL course – p.88/122
We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1
Equating the partial derivative to zero, we get
PR NPTEL course – p.89/122
We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1
Equating the partial derivative to zero, we get
Pn T
i=1 (W Xi + w0 − yi ) = 0
PR NPTEL course – p.90/122
We have
n
∂J X
= (W T Xi + w0 − yi )
∂w0 i=1
Equating the partial derivative to zero, we get
Pn T
i=1 (W Xi + w0 − yi ) = 0
X n
T
Pn
⇒ nw0 + W i=1 Xi = yi
i=1
PR NPTEL course – p.91/122
This gives us
n
à n
!
1 X 1 X
w0 = yi − W T Xi
n i=1
n i=1
PR NPTEL course – p.92/122
This gives us
n
à n
!
1 X 1 X
w0 = yi − W T Xi
n i=1
n i=1
• Thus, w0 accounts for difference in the average of
W T X and average of y .
PR NPTEL course – p.93/122
This gives us
n
à n
!
1 X 1 X
w0 = yi − W T Xi
n i=1
n i=1
• Thus, w0 accounts for difference in the average of
W T X and average of y .
• So, w0 is often called the bias term.
PR NPTEL course – p.94/122
• We have taken our linear model to be
d
X
ŷ(X) = f (X) = wj xj
j=0
PR NPTEL course – p.95/122
• We have taken our linear model to be
d
X
ŷ(X) = f (X) = wj xj
j=0
• As mentioned earlier, we could instead choose any
fixed set of basis functions φi .
PR NPTEL course – p.96/122
• We have taken our linear model to be
d
X
ŷ(X) = f (X) = wj xj
j=0
• As mentioned earlier, we could instead choose any
fixed set of basis functions φi .
• Then the model would be
d′
X
ŷ(X) = f (X) = wj φj (X)
j=0
PR NPTEL course – p.97/122
• We can use the same criterion of minimizing sum of
squares of errors.
n
1 X ¡ T ¢2
J(W ) = W Φ(Xi ) − yi
2 i=1
where Φ(Xi ) = (φ0 (Xi ), · · · , φd′ (Xi ))T .
PR NPTEL course – p.98/122
• We can use the same criterion of minimizing sum of
squares of errors.
n
1 X ¡ T ¢2
J(W ) = W Φ(Xi ) − yi
2 i=1
where Φ(Xi ) = (φ0 (Xi ), · · · , φd′ (Xi ))T .
• We want the minimizer of J(·).
PR NPTEL course – p.99/122
• We can learn W using the same method as earlier.
PR NPTEL course – p.100/122
• We can learn W using the same method as earlier.
• Thus, we will again have
W ∗ = (AT A)−1 AT Y
PR NPTEL course – p.101/122
• We can learn W using the same method as earlier.
• Thus, we will again have
W ∗ = (AT A)−1 AT Y
• The only difference is that now the ith row of matrix A
would be
[φ0 (Xi ) φ1 (Xi ) · · · φd′ (Xi )]
PR NPTEL course – p.102/122
• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
PR NPTEL course – p.103/122
• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
PR NPTEL course – p.104/122
• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
• Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
PR NPTEL course – p.105/122
• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
• Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
• The model is: y is a mth degree polynomial in X .
PR NPTEL course – p.106/122
• As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
• Take φj (X) = X j , j = 0, 1, · · · m.
• Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
• The model is: y is a mth degree polynomial in X .
• All such problems are tackled in a uniform fashion
using the least squares method we presented.
PR NPTEL course – p.107/122
LMS algorithm
• We are finding W ∗ that minimizes
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
PR NPTEL course – p.108/122
LMS algorithm
• We are finding W ∗ that minimizes
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• We could have found the minimum through an
iterative scheme using gradient descent.
PR NPTEL course – p.109/122
LMS algorithm
• We are finding W ∗ that minimizes
n
1 X ¡ T ¢2
J(W ) = Xi W − yi
2 i=1
• We could have found the minimum through an
iterative scheme using gradient descent.
• The gradient of J is given by
n
X ¡ T
¢
∇J(W ) = Xi X W − yi
i
i=1
PR NPTEL course – p.110/122
• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1
PR NPTEL course – p.111/122
• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1
• In analogy with what we saw in Perceptron algorithm,
this can be viewed as a ‘batch’ version.
PR NPTEL course – p.112/122
• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1
• In analogy with what we saw in Perceptron algorithm,
this can be viewed as a ‘batch’ version.
• We use the current W to find the errors on all training
data and then do all the ‘corrections’ together.
PR NPTEL course – p.113/122
• The iterative gradient descent scheme would be
n
X ¡ T
¢
W (k + 1) = W (k) − η Xi X W (k) − yi
i
i=1
• In analogy with what we saw in Perceptron algorithm,
this can be viewed as a ‘batch’ version.
• We use the current W to find the errors on all training
data and then do all the ‘corrections’ together.
• We can instead have an incremental version of this
algorithm.
PR NPTEL course – p.114/122
• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
PR NPTEL course – p.115/122
• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
• The error on this sample would be
1 T 2
2
(X(k) W (k) − y(k)) .
PR NPTEL course – p.116/122
• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
• The error on this sample would be
1 T 2
2
(X(k) W (k) − y(k)) .
• Using the gradient of only this term, we get the
incremental version as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
PR NPTEL course – p.117/122
• For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
• The error on this sample would be
1 T 2
2
(X(k) W (k) − y(k)) .
• Using the gradient of only this term, we get the
incremental version as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
• This is called the LMS algorithm.
PR NPTEL course – p.118/122
• In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
PR NPTEL course – p.119/122
• In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
• Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
PR NPTEL course – p.120/122
• In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
• Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
• We do not need to have all training examples together
with us. We can learn W from a stream of examples
without needing to store them.
PR NPTEL course – p.121/122
• In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
• Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
• We do not need to have all training examples together
with us. We can learn W from a stream of examples
without needing to store them.
• If η is sufficiently small this algorithm also converges
to the minimizer of J(W ).
PR NPTEL course – p.122/122