0% found this document useful (0 votes)

6 views135 pages

Lecture 34

The document discusses the Support Vector Machine (SVM) method for learning classifiers, emphasizing the transformation of feature space and the use of kernel functions to learn nonlinear classifiers. It also covers the formulation of regression problems using SVM, including the use of an epsilon-insensitive loss function and the optimization problem to minimize errors. The document concludes with the derivation of the dual problem related to the optimization of SVM parameters.

Uploaded by

Shoubhik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views135 pages

Lecture 34

Uploaded by

Shoubhik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 135

• We have been discussing SVM method for learning

classifiers.

PR NPTEL course – p.1/135

• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.

PR NPTEL course – p.2/135

PR NPTEL course – p.3/135

• We have been discussing SVM method for learning
classifiers.
• The basic idea is to transform the feature space and
learn a linear classifier in the new space.
• Using Kernel functions we can do this mapping
implicitly.
• Thus Kernels give us an elegant method to learn
nonlinear classifiers.

PR NPTEL course – p.4/135

PR NPTEL course – p.5/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into

appropriate high dimensional space.

PR NPTEL course – p.6/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into

appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.

PR NPTEL course – p.7/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into

appropriate high dimensional space.
• Kernel fn allows us to compute innerproducts in H
implicitly without using (or even knowing) φ.
• Through kernel functions, many algorithms that use
only innerproducts can be implicitly executed in a high
dimensional H.
( e.g., Fisher discriminant, regression etc).

PR NPTEL course – p.8/135

Kernel Trick

• We use φ : ℜm → H to map pattern vectors into

PR NPTEL course – p.9/135

Support Vector Regression

• Now we consider the regression problem.

• Given training data
{(X1 , y1 ), . . . , (Xn , yn )}, Xi ∈ ℜm , yi ∈ ℜ,
want to find ‘best’ function to predict y given X .

PR NPTEL course – p.10/135

Support Vector Regression

• Now we consider the regression problem.

• Given training data
{(X1 , y1 ), . . . , (Xn , yn )}, Xi ∈ ℜm , yi ∈ ℜ,
want to find ‘best’ function to predict y given X .
• We search in a parameterized class of functions
g(X, W ) = w1 φ1 (X) + · · · + wm′ φm′ (X) + b
= W T Φ(X) + b,
where φi : ℜm → ℜ are some chosen functions.

PR NPTEL course – p.11/135

• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.

PR NPTEL course – p.12/135

• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.

PR NPTEL course – p.13/135

PR NPTEL course – p.14/135

• If we choose, φi (X) = xi (and hence, m = m′ ) then it
is a linear model.
m′
• Denoting Z = Φ(X) ∈ ℜ , we are essentially
learning a linear model in a transformed space.
• This is in accordance with the basic idea of SVM
method.
• We want to formulate the problem so that we can use
the Kernel idea.

PR NPTEL course – p.15/135

PR NPTEL course – p.16/135

Loss function

• As in a general regression problem, we need to find

W to minimize
X
L(yi , g(Xi , W ))
i

where L is a loss function.

PR NPTEL course – p.17/135

Loss function

• As in a general regression problem, we need to find

W to minimize
X
L(yi , g(Xi , W ))
i

where L is a loss function.

• This is the general strategy of empirical risk
minimization.

PR NPTEL course – p.18/135

Loss function

• As in a general regression problem, we need to find

W to minimize
X
L(yi , g(Xi , W ))
i

where L is a loss function.

• This is the general strategy of empirical risk
minimization.
• We consider a special loss function that allows us to
use the kernel trick.

PR NPTEL course – p.19/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ

= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.

PR NPTEL course – p.20/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ

= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.

PR NPTEL course – p.21/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ

= |yi − g(Xi , W )| − ǫ otherwise
Here, ǫ is a parameter of the loss function.
• If prediction is within ǫ of true value, there is no loss.
• Using absolute value of error rather than square of
error allows for better robustness.

PR NPTEL course – p.22/135

ǫ-insensitive loss

• We employ ǫ-insensitive loss function:

Lǫ (yi , g(Xi , W )) = 0 If |yi − g(Xi , W )| < ǫ

PR NPTEL course – p.23/135

• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.

PR NPTEL course – p.24/135

• We have chosen the model as:
g(X, W ) = Φ(X)T W + b.
• Hence empirical risk minimization under the
ǫ-insensitive loss function would minimize
n
X
T
¡ ¢
max |yi − Φ(Xi ) W − b| − ǫ, 0
i=1

PR NPTEL course – p.25/135

• We can write this as an equivalent constrained

optimization problem.

PR NPTEL course – p.26/135

• We can pose the problem as follows.
n
X n
X
min ξi + ξi′
′
W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n

PR NPTEL course – p.27/135

• We can pose the problem as follows.
n
X n
X
min ξi + ξi′
′
W,b,ξ ,ξ i=1 i=1
T
subject to yi − W Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• This does not give a dual with the structure we want.

PR NPTEL course – p.28/135

PR NPTEL course – p.29/135

The Optimization Problem

• Find W, b and ξi , ξi′ to

Ã n n
!
1 T X X
minimize W W +C ξi + ξi′
2 i=1 i=1

subject to yi − W T Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n

PR NPTEL course – p.30/135

The Optimization Problem

• Find W, b and ξi , ξi′ to

Ã n n
!
1 T X X
minimize W W +C ξi + ξi′
2 i=1 i=1

subject to yi − W T Φ(Xi ) − b ≤ ǫ + ξi , i = 1, . . . , n
W T Φ(Xi ) + b − yi ≤ ǫ + ξi′ , i = 1, . . . , n
ξi ≥ 0, ξi′ ≥ 0 i = 1, . . . , n
• We have added the term W T W in the objective
function. This is like model complexity in a
regularization context.
PR NPTEL course – p.31/135
• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.

PR NPTEL course – p.32/135

PR NPTEL course – p.33/135

PR NPTEL course – p.34/135

• Like earlier, we can form the Lagrangian and then,
using Kuhn-Tucker conditions, can get the optimal
values of W and b.
• Given that this problem is similar to the earlier one,
we would get W ∗ in terms of the optimal lagrange
multipliers as earlier.
• Essentially, the lagrange multipliers corresponding to
the inequality constraints on the errors would be the
determining factors.
• We can use the same technique as earlier to
formulate the dual to solve for the optimal Lagrange
multipliers.
PR NPTEL course – p.35/135
The dual
• The dual of this problem is
n
X n
X
max yi (αi − αi′ ) − ǫ (αi + αi′ )
α,α
i=1 i=1
1X
− (αi − αi′ )(αj − αj′ )Φ(Xi )T Φ(Xj )
2 i,j
n
X
subject to (αi − αi′ ) = 0
i=1
0 ≤ αi , αi′ ≤ C, i = 1, . . . , n

PR NPTEL course – p.36/135

The solution

• We can use the Kuhn-Tucker conditions to derive the

final optimal values of W and b as earlier.

PR NPTEL course – p.37/135

The solution

• We can use the Kuhn-Tucker conditions to derive the

final optimal values of W and b as earlier.
• This gives us
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

PR NPTEL course – p.38/135

• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

PR NPTEL course – p.39/135

• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.

PR NPTEL course – p.40/135

• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
• The final W is a linear combination of some of the
examples – the support vectors.

PR NPTEL course – p.41/135

• We have
n
X
∗ ∗ ∗′
W = (α − α )Φ(Xi )
i i
i=1

b∗ = yj − Φ(Xj )T W ∗ + ǫ, j s.t. 0 < αj∗ < C/n

∗ ∗′ ∗ ∗′
• Note that we have α α = 0. Also, α , α are zero
i i i i
for examples where error is less than ǫ.
• The final W is a linear combination of some of the
examples – the support vectors.
• Note that the dual and the final solution are such that
we can use the kernel trick.
PR NPTEL course – p.42/135
• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).

PR NPTEL course – p.43/135

• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗
∗
i i
i=1

PR NPTEL course – p.44/135

• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗
∗
i i
i=1
Xn
∗′
= (α − α )K(Xi , X) + b∗
∗
i i
i=1

PR NPTEL course – p.45/135

• Let K(X, X ′ ) = Φ(X)T Φ(X ′ ).
• The optimal model learnt is
n
X
∗ ∗′
g(X, W ) = (α − α )φ(Xi )T φ(X) + b∗
∗
i i
i=1
Xn
∗′
= (α − α )K(Xi , X) + b∗
∗
i i
i=1

• As earlier, b∗ can also be written in terms of the Kernel

function.

PR NPTEL course – p.46/135

Support vector regression

• Once again, the kernel trick allows us to learn

non-linear models using a linear method.

PR NPTEL course – p.47/135

Support vector regression

• Once again, the kernel trick allows us to learn

non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.

PR NPTEL course – p.48/135

Support vector regression

• Once again, the kernel trick allows us to learn

non-linear models using a linear method.
• For example, if we use Gaussian kernel, we get a
Gaussian RBF net as the nonlinear model. The RBF
centers are easily learnt here.
• The parameters: C , ǫ and parameters of kernel
function.

PR NPTEL course – p.49/135

Support vector regression

• Once again, the kernel trick allows us to learn

PR NPTEL course – p.50/135

SV regression

• With the ǫ-insensitive loss function, points whose

targets are within ǫ of the prediction do not contribute
any ‘loss’.

PR NPTEL course – p.51/135

SV regression

• With the ǫ-insensitive loss function, points whose

targets are within ǫ of the prediction do not contribute
any ‘loss’.
• Gives rise to some interesting robustness of the
method. It can be proved that local movements of
target values of points outside the ǫ-tube do not
influence the regression.

PR NPTEL course – p.52/135

SV regression

• With the ǫ-insensitive loss function, points whose

PR NPTEL course – p.53/135

• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.

PR NPTEL course – p.54/135

• In our formulation of the regression problem we did
not explain why we added W T W term in the objective
function.
• We are essentially minimizing
n
1 T X ¡ T
¢
W W + C max |yi − Φ(Xi ) W − b| − ǫ, 0
2 i=1

PR NPTEL course – p.55/135

• This is ‘regularized risk minimization’.

PR NPTEL course – p.56/135

• This is ‘regularized risk minimization’.

• Then W T W is the model complexity term which is
intended to favour learning of ‘smoother’ models.
PR NPTEL course – p.57/135
• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.

PR NPTEL course – p.58/135

• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.

PR NPTEL course – p.59/135

PR NPTEL course – p.60/135

• Next we explain why W T W is a good term to capture
degree of smoothness of the model being fitted.
• Let f : ℜm → ℜ be a continuous function.
• Continuity means we can make |f (X) − f (X ′ )| as
small as we want by taking ||X − X ′ || sufficiently
small.
• There are ways to characterize the ‘degree of
continuity’ of a function.

PR NPTEL course – p.61/135

PR NPTEL course – p.62/135

ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}

PR NPTEL course – p.63/135

ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}

• The intuitive idea is:
How small can ||X − X ′ || be, still keeping
|f (X) − f (X ′ )| ‘large’

PR NPTEL course – p.64/135

ǫ-Margin of a function

• The ǫ-margin of a function, f : ℜn → ℜ is

mǫ (f ) = inf{||X − X ′ || : |f (X) − f (X ′ )| ≥ 2ǫ}

• The intuitive idea is:
How small can ||X − X ′ || be, still keeping
|f (X) − f (X ′ )| ‘large’
• The larger mǫ (f ), the smoother is the function.

PR NPTEL course – p.65/135

• Obviously, mǫ (f ) = 0 if f is discontinuous.

PR NPTEL course – p.66/135

• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.

PR NPTEL course – p.67/135

• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
• mǫ (f ) > 0 for all ǫ > 0 iff f is uniformly continuous.

PR NPTEL course – p.68/135

• Obviously, mǫ (f ) = 0 if f is discontinuous.
• mǫ (f ) can be zero even for continuous functions,
e.g., f (x) = 1/x.
• mǫ (f ) > 0 for all ǫ > 0 iff f is uniformly continuous.
• Higher margin would mean the function is ‘slowly
varying’ and hence is a ‘smoother’ model.

PR NPTEL course – p.69/135

SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.

PR NPTEL course – p.70/135

SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .

PR NPTEL course – p.71/135

SVR and margin
• Consider regression with linear models. Then,

|f (X) − f (X ′ )| = |W T (X − X ′ )|.
• For all X, X ′ with |W T (X − X ′ )| ≥ 2ǫ,
||X − X ′ || would be smallest if
|W T (X − X ′ )| = 2ǫ and (X − X ′ ) is parallel to W .
That is, X − X ′ = ± W2ǫW
TW .

PR NPTEL course – p.72/135

SVR and margin
• Consider regression with linear models. Then,

2ǫ
• Thus, mǫ (f ) = ||W ||
.

PR NPTEL course – p.73/135

SVR and margin
• Consider regression with linear models. Then,

2ǫ
• Thus, mǫ (f ) = ||W ||
.
• Thus in our optimization problem in SVR, minimizing
W T W promotes learning of smoother models.
PR NPTEL course – p.74/135
Solving the SVM optimization problem

• So far we have not considered any algorithms for

solving for the SVM.

PR NPTEL course – p.75/135

Solving the SVM optimization problem

• So far we have not considered any algorithms for

solving for the SVM.
• We have to solve a constrained optimization problem
to obtain the Lagrange multipliers and hence the
SVM.

PR NPTEL course – p.76/135

Solving the SVM optimization problem

• So far we have not considered any algorithms for

solving for the SVM.
• We have to solve a constrained optimization problem
to obtain the Lagrange multipliers and hence the
SVM.
• Many specialized algorithms have been proposed for
this.

PR NPTEL course – p.77/135

• The optimization problem to be solved is
n n
X 1 X
max q(µ) = µi − µi µj yi yj K(Xi , Xj )
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• A quadratic programming (QP) problem with

interesting structure.

PR NPTEL course – p.78/135

Example

• We will first consider a very simple example problem

in ℜ2 to get a feel for the method of obtaining SVM.

PR NPTEL course – p.79/135

Example

• We will first consider a very simple example problem

in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

with y1 = y2 = +1 and y3 = −1.

PR NPTEL course – p.80/135

Example

• We will first consider a very simple example problem

in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

with y1 = y2 = +1 and y3 = −1.
• As is easy to see, a linear classifier is not sufficient
here.

PR NPTEL course – p.81/135

Example

• We will first consider a very simple example problem

in ℜ2 to get a feel for the method of obtaining SVM.
• Suppose we have 3 examples:

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

with y1 = y2 = +1 and y3 = −1.
• As is easy to see, a linear classifier is not sufficient
here.
• Suppose we use the Kernel function:
K(X, X ′ ) = (1 + X T X ′ )2 .

PR NPTEL course – p.82/135

• This example is shown below.

PR NPTEL course – p.83/135

• This example is shown below.

PR NPTEL course – p.84/135

• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

PR NPTEL course – p.85/135

• Recall, the examples are

X1 = (−1, 0), X2 = (1, 0), X3 = (0, 0)

• The objective function involves K(Xi , Xj ). These are
given in a matrix below.
 
4 0 1
(1 + XiT Xj )2 =  0 4 1 
£ ¤
1 1 1

PR NPTEL course – p.86/135

• Now the objective function to be maximized is
3
X 1
q(µ) = µi − (4µ21 + 4µ22 + µ23 − 2µ1 µ3 − 2µ2 µ3 )
i=1
2

PR NPTEL course – p.87/135

• Now the objective function to be maximized is
3
X 1
q(µ) = µi − (4µ21 + 4µ22 + µ23 − 2µ1 µ3 − 2µ2 µ3 )
i=1
2