• We have been discussing SVM method for learning
classifiers.
PR NPTEL course – p.1/135
• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
PR NPTEL course – p.2/135
• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.
PR NPTEL course – p.3/135
• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.
• Thus Kernels give us an elegant method to learn
nonlinear classifiers.
PR NPTEL course – p.4/135
• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.
• Thus Kernels give us an elegant method to learn
nonlinear classifiers.
• We can use the same idea in regression problems
also.
PR NPTEL course – p.5/135
Kernel Trick
• We use φ : ℜm → H to map pattern vectors into
appropriate high dimensional space.
PR NPTEL course – p.6/135
Kernel Trick
• We use φ : ℜm → H to map pattern vectors into
appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.
PR NPTEL course – p.7/135
Kernel Trick
• We use φ : ℜm → H to map pattern vectors into
appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.
• Through kernel functions, many algorithms that use
only innerproducts can be implicitly executed in a high
dimensional H.
( e.g., Fisher discriminant, regression etc).
PR NPTEL course – p.8/135
Kernel Trick
• We use φ : ℜm → H to map pattern vectors into
appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.
• Through kernel functions, many algorithms that use
only innerproducts can be implicitly executed in a high
dimensional H.
( e.g., Fisher discriminant, regression etc).
• We can elegantly construct non-linear versions of
linear techniques.
PR NPTEL course – p.9/135
Support Vector Regression
• Now we consider the regression problem.
• Given training data
{(X1 , y1 ), . . . , (Xn , yn )}, Xi ∈ ℜm , yi ∈ ℜ,
want to find ‘best’ function to predict y given X .
PR NPTEL course – p.10/135
Support Vector Regression
• Now we consider the regression problem.
• Given training data
{(X1 , y1 ), . . . , (Xn , yn )}, Xi ∈ ℜm , yi ∈ ℜ,
want to find ‘best’ function to predict y given X .
• We search in a parameterized class of functions
g(X, W ) = w1 φ1 (X) + · · · + wm′ φm′ (X) + b
= W T Φ(X) + b,
where φi : ℜm → ℜ are some chosen functions.
PR NPTEL course – p.11/135
• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
PR NPTEL course – p.12/135
• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
PR NPTEL course – p.13/135
• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.
PR NPTEL course – p.14/135
• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.
• We want to formulate the problem so that we can use
the Kernel idea.
PR NPTEL course – p.15/135
• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.
• We want to formulate the problem so that we can use
the Kernel idea.
• Then, by using a kernel function, we never need to
compute or even precisely specify the mapping Φ.
PR NPTEL course – p.16/135
Loss function
• As in a general regression problem, we need to find
W to minimize
X
L(yi , g(Xi , W ))
i
where L is a loss function.
PR NPTEL course – p.17/135
Loss function
• As in a general regression problem, we need to find
W to minimize
X
L(yi , g(Xi , W ))
i
where L is a loss function.
• This is the general strategy of empirical risk
minimization.
PR NPTEL course – p.18/135
Loss function
• As in a general regression problem, we need to find
W to minimize
X
L(yi , g(Xi , W ))
i
where L is a loss function.
• This is the general strategy of empirical risk
minimization.
• We consider a special loss function that allows us to
use the kernel trick.
PR NPTEL course – p.19/135
ǫ-insensitive loss
• We employ ǫ-insensitive loss function:
Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ
= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
PR NPTEL course – p.20/135
ǫ-insensitive loss
• We employ ǫ-insensitive loss function:
Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ
= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.
PR NPTEL course – p.21/135
ǫ-insensitive loss
• We employ ǫ-insensitive loss function:
Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ
= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.
• Using absolute value of error rather than square of
error allows for better robustness.
PR NPTEL course – p.22/135
ǫ-insensitive loss
• We employ ǫ-insensitive loss function:
Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ
= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.
• Using absolute value of error rather than square of
error allows for better robustness.
• Also gives us optimization problem with the right
structure.
PR NPTEL course – p.23/135
• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.
PR NPTEL course – p.24/135
• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.
• Hence empirical risk minimization under the
ǫ-insensitive loss function would minimize
n
X
T
¡ ¢
max |yi − Φ(Xi ) W − b| − ǫ, 0
i=1
PR NPTEL course – p.25/135
• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.
• Hence empirical risk minimization under the
ǫ-insensitive loss function would minimize
n
X
T
¡ ¢
max |yi − Φ(Xi ) W − b| − ǫ, 0
i=1
• We can write this as an equivalent constrained
optimization problem.
PR NPTEL course – p.26/135
• We can pose the problem as follows.
n
X n
X
min ξi + ξi′
′
W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
PR NPTEL course – p.27/135
• We can pose the problem as follows.
n
X n
X
min ξi + ξi′
′
W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• This does not give a dual with the structure we want.
PR NPTEL course – p.28/135
• We can pose the problem as follows.
n
X n
X
min ξi + ξi′
′
W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• This does not give a dual with the structure we want.
• So, we reformulate the optimization problem.
PR NPTEL course – p.29/135
The Optimization Problem
• Find W, b and ξi , ξi′ to
à n n
!
1 T X X
minimize W W +C ξi + ξi′
2 i=1 i=1
subject to yi − W T Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
PR NPTEL course – p.30/135
The Optimization Problem
• Find W, b and ξi , ξi′ to
à n n
!
1 T X X
minimize W W +C ξi + ξi′
2 i=1 i=1
subject to yi − W T Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• We have added the term W T W in the objective
function. This is like model complexity in a
regularization context.
PR NPTEL course – p.31/135
• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
PR NPTEL course – p.32/135
• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.
PR NPTEL course – p.33/135
• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.
• Essentially, the lagrange multipliers corresponding to
the inequality constraints on the errors would be the
determining factors.
PR NPTEL course – p.34/135
• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.
• Essentially, the lagrange multipliers corresponding to
the inequality constraints on the errors would be the
determining factors.
• We can use the same technique as earlier to
formulate the dual to solve for the optimal Lagrange
multipliers.
PR NPTEL course – p.35/135
The dual
• The dual of this problem is
n
X n
X
max yi (αi − αi′ ) − ǫ (αi + αi′ )
α,α
i=1 i=1
1X
− (αi − αi′ )(αj − αj′ )Φ(Xi )T Φ(Xj )
2 i,j
n
X
subject to (αi − αi′ ) = 0
i=1
0 ≤ αi , αi′ ≤ C, i = 1, . . . , n
PR NPTEL course – p.36/135
The solution
• We can use the Kuhn-Tucker conditions to derive the
final optimal values of W and b as earlier.
PR NPTEL course – p.37/135
The solution
• We can use the Kuhn-Tucker conditions to derive the
final optimal values of W and b as earlier.
• This gives us
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1
b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n
PR NPTEL course – p.38/135
• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1
b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n
PR NPTEL course – p.39/135
• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1
b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n
∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
PR NPTEL course – p.40/135
• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1
b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n
∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
• The final W is a linear combination of some of the
examples – the support vectors.
PR NPTEL course – p.41/135
• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1
b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n
∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
• The final W is a linear combination of some of the
examples – the support vectors.
• Note that the dual and the final solution are such that
we can use the kernel trick.
PR NPTEL course – p.42/135
• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
PR NPTEL course – p.43/135
• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗
∗
i i
i=1
PR NPTEL course – p.44/135
• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗
∗
i i
i=1
Xn
∗′
= (α − α )K(Xi , X) + b∗
∗
i i
i=1
PR NPTEL course – p.45/135
• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗
∗
i i
i=1
Xn
∗′
= (α − α )K(Xi , X) + b∗
∗
i i
i=1
• As earlier, b∗ can also be written in terms of the Kernel
function.
PR NPTEL course – p.46/135
Support vector regression
• Once again, the kernel trick allows us to learn
non-linear models using a linear method.
PR NPTEL course – p.47/135
Support vector regression
• Once again, the kernel trick allows us to learn
non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.
PR NPTEL course – p.48/135
Support vector regression
• Once again, the kernel trick allows us to learn
non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.
• The parameters: C , ǫ and parameters of kernel
function.
PR NPTEL course – p.49/135
Support vector regression
• Once again, the kernel trick allows us to learn
non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.
• The parameters: C , ǫ and parameters of kernel
function.
• The basic idea of SVR can be used in many related
problems.
PR NPTEL course – p.50/135
SV regression
• With the ǫ-insensitive loss function, points whose
targets are within ǫ of the prediction do not contribute
any ‘loss’.
PR NPTEL course – p.51/135
SV regression
• With the ǫ-insensitive loss function, points whose
targets are within ǫ of the prediction do not contribute
any ‘loss’.
• Gives rise to some interesting robustness of the
method. It can be proved that local movements of
target values of points outside the ǫ-tube do not
influence the regression.
PR NPTEL course – p.52/135
SV regression
• With the ǫ-insensitive loss function, points whose
targets are within ǫ of the prediction do not contribute
any ‘loss’.
• Gives rise to some interesting robustness of the
method. It can be proved that local movements of
target values of points outside the ǫ-tube do not
influence the regression.
• Robustness essentially comes through the support
vector representation of the regression.
PR NPTEL course – p.53/135
• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
PR NPTEL course – p.54/135
• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1
PR NPTEL course – p.55/135
• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1
• This is ‘regularized risk minimization’.
PR NPTEL course – p.56/135
• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1
• This is ‘regularized risk minimization’.
• Then W T W is the model complexity term which is
intended to favour learning of ‘smoother’ models.
PR NPTEL course – p.57/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
PR NPTEL course – p.58/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
PR NPTEL course – p.59/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.
PR NPTEL course – p.60/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.
• There are ways to characterize the ‘degree of
continuity’ of a function.
PR NPTEL course – p.61/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.
• There are ways to characterize the ‘degree of
continuity’ of a function.
• We consider one such measure now.
PR NPTEL course – p.62/135
ǫ-Margin of a function
• The ǫ-margin of a function, f : ℜn → ℜ is
mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}
PR NPTEL course – p.63/135
ǫ-Margin of a function
• The ǫ-margin of a function, f : ℜn → ℜ is
mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}
• The intuitive idea is:
How small can ||X − X ′ || be, still keeping
|f (X) − f (X ′ )| ‘large’
PR NPTEL course – p.64/135
ǫ-Margin of a function
• The ǫ-margin of a function, f : ℜn → ℜ is
mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}
• The intuitive idea is:
How small can ||X − X ′ || be, still keeping
|f (X) − f (X ′ )| ‘large’
• The larger mǫ (f ), the smoother is the function.
PR NPTEL course – p.65/135
• Obviously, mǫ (f ) = 0 if f is discontinuous.
PR NPTEL course – p.66/135
• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
PR NPTEL course – p.67/135
• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
• mǫ (f ) > 0 for all ǫ > 0 iff f is uniformly continuous.
PR NPTEL course – p.68/135
• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
• mǫ (f ) > 0 for all ǫ > 0 iff f is uniformly continuous.
• Higher margin would mean the function is ‘slowly
varying’ and hence is a ‘smoother’ model.
PR NPTEL course – p.69/135
SVR and margin
• Consider regression with linear models. Then,
|f (X) − f (X ′ )| = |W T (X − X ′ )|.
PR NPTEL course – p.70/135
SVR and margin
• Consider regression with linear models. Then,
|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
PR NPTEL course – p.71/135
SVR and margin
• Consider regression with linear models. Then,
|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .
PR NPTEL course – p.72/135
SVR and margin
• Consider regression with linear models. Then,
|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .
2ǫ
• Thus, mǫ (f ) = ||W ||
.
PR NPTEL course – p.73/135
SVR and margin
• Consider regression with linear models. Then,
|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .
2ǫ
• Thus, mǫ (f ) = ||W ||
.
• Thus in our optimization problem in SVR, minimizing
W T W promotes learning of smoother models.
PR NPTEL course – p.74/135
Solving the SVM optimization problem
• So far we have not considered any algorithms for
solving for the SVM.
PR NPTEL course – p.75/135
Solving the SVM optimization problem
• So far we have not considered any algorithms for
solving for the SVM.
• We have to solve a constrained optimization problem
to obtain the Lagrange multipliers and hence the
SVM.
PR NPTEL course – p.76/135
Solving the SVM optimization problem
• So far we have not considered any algorithms for
solving for the SVM.
• We have to solve a constrained optimization problem
to obtain the Lagrange multipliers and hence the
SVM.
• Many specialized algorithms have been proposed for
this.
PR NPTEL course – p.77/135
• The optimization problem to be solved is
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• A quadratic programming (QP) problem with
interesting structure.
PR NPTEL course – p.78/135
Example
• We will first consider a very simple example problem
in ℜ2 to get a feel for the method of obtaining SVM.
PR NPTEL course – p.79/135
Example
• We will first consider a very simple example problem
in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:
X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
with y1 = y2 = +1 and y3 = −1.
PR NPTEL course – p.80/135
Example
• We will first consider a very simple example problem
in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:
X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
with y1 = y2 = +1 and y3 = −1.
• As is easy to see, a linear classifier is not sufficient
here.
PR NPTEL course – p.81/135
Example
• We will first consider a very simple example problem
in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:
X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
with y1 = y2 = +1 and y3 = −1.
• As is easy to see, a linear classifier is not sufficient
here.
• Suppose we use the Kernel function:
K(X, X ′ ) = (1 + X T X ′ )2 .
PR NPTEL course – p.82/135
• This example is shown below.
PR NPTEL course – p.83/135
• This example is shown below.
PR NPTEL course – p.84/135
• Recall, the examples are
X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
PR NPTEL course – p.85/135
• Recall, the examples are
X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
• The objective function involves K(Xi , Xj ). These are
given in a matrix below.
4 0 1
(1 + XiT Xj )2 = 0 4 1
£ ¤
1 1 1
PR NPTEL course – p.86/135
• Now the objective function to be maximized is
3
X 1
q(µ) = µi − (4µ21 + 4µ22 + µ23 − 2µ1 µ3 − 2µ2 µ3 )
i=1
2
PR NPTEL course – p.87/135
• Now the objective function to be maximized is
3
X 1
q(µ) = µi − (4µ21 + 4µ22 + µ23 − 2µ1 µ3 − 2µ2 µ3 )
i=1
2
• The constraints are
µ1 + µ2 − µ3 = 0; and µi ≥ 0, i = 1, 2, 3.
PR NPTEL course – p.88/135
• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1
PR NPTEL course – p.89/135
• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1
∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
PR NPTEL course – p.90/135
• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1
∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
• This gives us four equations; we have 7 unknowns.
PR NPTEL course – p.91/135
• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1
∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
• This gives us four equations; we have 7 unknowns.
We use complementary slackness conditions on αi .
PR NPTEL course – p.92/135
• The lagrangian for this problem is
3
X
L(µ, λ, α) = q(µ) + λ(µ1 + µ2 − µ3 ) − αi µi
i=1
∂L
• Using Kuhn-Tucker conditions, we have ∂µi
= 0 and
µ1 + µ2 − µ3 = 0.
• This gives us four equations; we have 7 unknowns.
We use complementary slackness conditions on αi .
• We have αi µi = 0. Essentially, we need to guess
which µi > 0.
PR NPTEL course – p.93/135
• In this simple problem we know all µi > 0.
PR NPTEL course – p.94/135
• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
PR NPTEL course – p.95/135
• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
• Hence we take all αi = 0.
PR NPTEL course – p.96/135
• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
• Hence we take all αi = 0.
• We have now four unknowns: µ1 , µ2 , µ3 , λ.
PR NPTEL course – p.97/135
• In this simple problem we know all µi > 0.
• This is because all Xi would be support vectors.
• Hence we take all αi = 0.
• We have now four unknowns: µ1 , µ2 , µ3 , λ.
∂L
• Using ∂µi
= 0, i = 1, 2, 3 and feasibility, we can solve
for µi .
PR NPTEL course – p.98/135
• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
PR NPTEL course – p.99/135
• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1
PR NPTEL course – p.100/135
• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1 and µ3 = 2µ1 = 2µ2 .
PR NPTEL course – p.101/135
• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1 and µ3 = 2µ1 = 2µ2 .
• Thus we get µ1 = µ2 = 1 and µ3 = 2.
PR NPTEL course – p.102/135
• The equations are
1 − 4µ1 + µ3 + λ = 0
1 − 4µ2 + µ3 + λ = 0
1 − µ3 + µ 1 + µ 2 − λ = 0
µ 1 + µ 2 − µ3 = 0
• These give us λ = 1 and µ3 = 2µ1 = 2µ2 .
• Thus we get µ1 = µ2 = 1 and µ3 = 2.
• This completely determines the SVM
PR NPTEL course – p.103/135
• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
PR NPTEL course – p.104/135
• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
• The calssification of any X by this SVM is by the sign
of f (X):
X
∗
f (X) = µi yi K(Xi , X) + b
i
PR NPTEL course – p.105/135
• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
• The calssification of any X by this SVM is by the sign
of f (X):
X
∗
f (X) = µi yi K(Xi , X) + b
i
= K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗
PR NPTEL course – p.106/135
• If we used the penalty constant with C ≥ 2 we get the
same solution.
(If C < 2, we can not get this solution).
• The calssification of any X by this SVM is by the sign
of f (X):
X
∗
f (X) = µi yi K(Xi , X) + b
i
= K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗
• Let us first calculate b∗ .
PR NPTEL course – p.107/135
• Recall the formula
X
∗
b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i
PR NPTEL course – p.108/135
• Recall the formula
X
∗
b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i
• We can take j = 1, 2 or 3.
PR NPTEL course – p.109/135
• Recall the formula
X
∗
b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i
• We can take j = 1, 2 or 3.
• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.
PR NPTEL course – p.110/135
• Recall the formula
X
∗
b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i
• We can take j = 1, 2 or 3.
• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.
• With j = 3 we get b∗ = −1 − (1 + 1 − 2) = −1.
PR NPTEL course – p.111/135
• Recall the formula
X
∗
b = yj − µi yi K(Xi , Xj ), j s.t. 0 < µj
i
• We can take j = 1, 2 or 3.
• With j = 1 we get b∗ = 1 − (4 + 0 − 2) = −1.
• With j = 3 we get b∗ = −1 − (1 + 1 − 2) = −1.
• If we solved our optimization problem correctly, we
should get same b∗ !
PR NPTEL course – p.112/135
• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
PR NPTEL course – p.113/135
• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
• Hence, taking X = (x1 , x2 )T , we have
f (X) = K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗
PR NPTEL course – p.114/135
• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
• Hence, taking X = (x1 , x2 )T , we have
f (X) = K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗
= (1 − x1 )2 + (1 + x1 )2 − 2(1) − 1
PR NPTEL course – p.115/135
• We have X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)
and K(X, X ′ ) = (1 + X T X ′ )2 .
• Hence, taking X = (x1 , x2 )T , we have
f (X) = K(X1 , X) + K(X2 , X) − 2K(X3 , X) + b∗
= (1 − x1 )2 + (1 + x1 )2 − 2(1) − 1
= 2x21 − 1
PR NPTEL course – p.116/135
• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2x21 ≥ 1
PR NPTEL course – p.117/135
• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
PR NPTEL course – p.118/135
• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
• Why not |x1 | ≥ (1/2)?
PR NPTEL course – p.119/135
• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
• Why not |x1 | ≥ (1/2)?
• We are maximizing margin of the hyperplane in
‘x2 ’-space.
PR NPTEL course – p.120/135
• Hence this SVM will assign class +1 to X = (x1 , x2 )T
if
2 1
2x ≥ 1
1 or |x1 | ≥ √
2
• Why not |x1 | ≥ (1/2)?
• We are maximizing margin of the hyperplane in
‘x2 ’-space.
• The final SVM is intuitively very reasonable and we
solve essentially the same problem whether we are
seeking a linear classifier or a nonlinear classifier.
PR NPTEL course – p.121/135
• Getting back to the general case, we need to solve
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
PR NPTEL course – p.122/135
• Getting back to the general case, we need to solve
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• We need a numerical method.
PR NPTEL course – p.123/135
• Getting back to the general case, we need to solve
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• We need a numerical method.
• Due to the special structure, many efficient algorithms
are proposed.
PR NPTEL course – p.124/135
• One interesting idea – Chunking
PR NPTEL course – p.125/135
• One interesting idea – Chunking
• We optimize on only a few variables at a time.
PR NPTEL course – p.126/135
• One interesting idea – Chunking
• We optimize on only a few variables at a time.
• Dimensionality of the optimization problem is
controlled.
PR NPTEL course – p.127/135
• One interesting idea – Chunking
• We optimize on only a few variables at a time.
• Dimensionality of the optimization problem is
controlled.
• We keep randomly choosing the subset of variables.
PR NPTEL course – p.128/135
• One interesting idea – Chunking
• We optimize on only a few variables at a time.
• Dimensionality of the optimization problem is
controlled.
• We keep randomly choosing the subset of variables.
• Gave rise to the first specialized algorithm for SVM –
SVM Light
PR NPTEL course – p.129/135
• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
PR NPTEL course – p.130/135
• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
PR NPTEL course – p.131/135
• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
PR NPTEL course – p.132/135
• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
• We can analytically find the optimum with respect to
two variables.
PR NPTEL course – p.133/135
• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
• We can analytically find the optimum with respect to
two variables.
• We need to decide which two we consider in each
iteration.
PR NPTEL course – p.134/135
• Taking chunking to extreme level – what is the
smallest set of variables we can optimize on?
• We need to consider at least two variables because
there is an equality constraint.
• Sequential Minimal Optimization (SMO) – works on
optimizing two variables at a time.
• We can analytically find the optimum with respect to
two variables.
• We need to decide which two we consider in each
iteration.
• A very efficient algorithm
PR NPTEL course – p.135/135