Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views99 pages

SVM Optimization Explained

The document discusses the Support Vector Machine (SVM) method, focusing on the concept of the optimal hyperplane and the formulation of an optimization problem to learn it under the assumption of linearly separable training data. It outlines the Lagrangian formulation, Kuhn-Tucker conditions, and the dual optimization problem, emphasizing the role of support vectors in determining the optimal hyperplane. Additionally, it addresses the challenges faced when the training data is not linearly separable.

Uploaded by

Shoubhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views99 pages

SVM Optimization Explained

The document discusses the Support Vector Machine (SVM) method, focusing on the concept of the optimal hyperplane and the formulation of an optimization problem to learn it under the assumption of linearly separable training data. It outlines the Lagrangian formulation, Kuhn-Tucker conditions, and the dual optimization problem, emphasizing the role of support vectors in determining the optimal hyperplane. Additionally, it addresses the challenges faced when the training data is not linearly separable.

Uploaded by

Shoubhik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

• We have been discussing the SVM method.

PR NPTEL course – p.1/99


• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.

PR NPTEL course – p.2/99


• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.
• We formulated an optimization problem to learn
optimal hyperplane assuming training set is linearly
separable.

PR NPTEL course – p.3/99


• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.
• We formulated an optimization problem to learn
optimal hyperplane assuming training set is linearly
separable.
• We will quickly review this SVM algorithm before
discussing the more general case.

PR NPTEL course – p.4/99


The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n

PR NPTEL course – p.5/99


The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n
• Quadratic cost function and linear (inequality)
constraints.
• Kuhn-Tucker conditions are necessary and sufficient.
Every local minimum is global minimum.
PR NPTEL course – p.6/99
Optimal hyperplane

PR NPTEL course – p.7/99


Non-optimal hyperplane

PR NPTEL course – p.8/99


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn
∇W L = 0 ⇒ W = ∗
i=1 µ∗i yi Xi

PR NPTEL course – p.9/99


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi

∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0

PR NPTEL course – p.10/99


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi

∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i

PR NPTEL course – p.11/99


• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give


Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi

∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i
µ∗i ≥ 0, & µ∗i [1 − yi (XiT W ∗ + b∗ )] = 0, ∀i
PR NPTEL course – p.12/99
• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1

PR NPTEL course – p.13/99


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.

PR NPTEL course – p.14/99


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
P ∗ P
W = i µi yi Xi = i∈S µ∗i yi Xi

PR NPTEL course – p.15/99


• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
P ∗ P
W = i µi yi Xi = i∈S µ∗i yi Xi

• Optimal W is a linear combination of Support vectors.


• Support vectors constitute a very useful output of the
method.
PR NPTEL course – p.16/99
Optimal hyperplane

PR NPTEL course – p.17/99


The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:


P P
W =∗
i µ y Xi =

i i i∈S µ∗i yi Xi
b∗ = yj − XjT W ∗ , j s.t. µ∗j > 0
(Note that µ∗j > 0 ⇒ yj (XjT W ∗ + b∗ ) = 1)

• Thus, W ∗ , b∗ are determined by µ∗i , i = 1, . . . , n.


• We can use the dual of the optimization problem to
get µ∗i .

PR NPTEL course – p.18/99


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1

PR NPTEL course – p.19/99


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.

PR NPTEL course – p.20/99


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.

PR NPTEL course – p.21/99


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .

PR NPTEL course – p.22/99


Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .
P
• We obtainP the dual by substituting W = µi yi Xi and
imposing µi yi = 0.
PR NPTEL course – p.23/99
P P
• By substituting W = µi yi Xi and µi yi = 0 we get

PR NPTEL course – p.24/99


P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1

PR NPTEL course – p.25/99


P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
à !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j

PR NPTEL course – p.26/99


P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
à !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j
X 1 XX
= µi − µi yi µj yj XiT Xj
i
2 i j
PR NPTEL course – p.27/99
• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.28/99


• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Quadratic cost function and linear constraints


• Training data vectors appear only as innerproduct
• Optimization is over ℜn irrespective of the dimension
of Xi .
PR NPTEL course – p.29/99
Optimal hyperplane

• The optimal hyperplane is a solution of


1 T
min W W
W,b 2
subject to yi (W T Xi + b) ≥ 1, i = 1, . . . , n
• Instead of solving this primal problem, we solve the
dual and obtain optimal Lagrange multipliers, µ∗i .

PR NPTEL course – p.30/99


• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.31/99


• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Then the final solution is:


X
W∗ = µ∗i yi Xi , b∗ = yj −XjT W ∗ , j such that µj > 0

PR NPTEL course – p.32/99


• So far, we assumed that the training data is linearly
separable.

PR NPTEL course – p.33/99


• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?

PR NPTEL course – p.34/99


• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.

PR NPTEL course – p.35/99


• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.
• Hence we can not find the optimal hyperplane by this
formulation.

PR NPTEL course – p.36/99


• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.
• Hence we can not find the optimal hyperplane by this
formulation.
• We will modify the problem by introducing slack
variables so that we can handle the general case.

PR NPTEL course – p.37/99


Using slack variables

• When data are not linearly separable, we can try:


n
1 T X
minimize W W + C ξi
2 i=1

subject to yi (W T Xi + b) ≥ 1 − ξi , i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• Opt. variables: W, b, ξi .

PR NPTEL course – p.38/99


• Feasible solution always exists.

PR NPTEL course – p.39/99


• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.

PR NPTEL course – p.40/99


• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
• When ξ > 0, there is a ‘margin error’. When ξi > 1,
Xi is wrongly classified.

PR NPTEL course – p.41/99


• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
• When ξ > 0, there is a ‘margin error’. When ξi > 1,
Xi is wrongly classified.
• C – user-specified constant. (Like regularization
parameter).

PR NPTEL course – p.42/99


PR NPTEL course – p.43/99
• The optimization problem now is
n
1 T X
min W W + C ξi
W,b,ξ 2 i=1
T
subject to 1 − ξi − yi (W X + b) ≤ 0, i = 1, . . . , n
i
−ξi ≤ 0, i = 1, . . . , n

PR NPTEL course – p.44/99


• The Lagrangian now is
n
1 T X
L(W, b, ξ, µ, λ) = W W + C ξi
2 i=1
n
X n
X
+ µi (1 − ξi − yi (W T Xi + b)) − λi ξi
i=1 i=1

• µi are the lagrange multipliers for the separability


constraints as earlier.
• λi are the lagrange multipliers for the constraints
−ξi ≤ 0.
PR NPTEL course – p.45/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W =

µ∗i yi Xi

PR NPTEL course – p.46/99


The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗

∂b
=0⇒ µi y i = 0

PR NPTEL course – p.47/99


The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗

∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

PR NPTEL course – p.48/99


The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗

∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i

PR NPTEL course – p.49/99


The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗

∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
• µi ≥ 0; λi ≥ 0, ∀i

PR NPTEL course – p.50/99


The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗

∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
• µi ≥ 0; λi ≥ 0, ∀i
• µi (1 − ξi − yi (W T Xi + b)) = 0; λi ξi = 0, ∀i

PR NPTEL course – p.51/99


• The W ∗ is given by the same expression.

PR NPTEL course – p.52/99


• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.

PR NPTEL course – p.53/99


• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.

PR NPTEL course – p.54/99


• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.

PR NPTEL course – p.55/99


• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.
• Thus we get b∗ as

b∗ = yj − XjT W ∗ , j such that 0 < µj < C

PR NPTEL course – p.56/99


• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.
• Thus we get b∗ as

b∗ = yj − XjT W ∗ , j such that 0 < µj < C


• Thus, once again we need to find all µi .

PR NPTEL course – p.57/99


• We can derive the dual as before. The dual function is
q(µ, λ) = inf L(W, b, ξ, µ, λ)
W,b,ξ

PR NPTEL course – p.58/99


• We can derive the dual as before. The dual function is
q(µ, λ) = inf L(W, b, ξ, µ, λ)
W,b,ξ

where the lagrangian is given by


n
1 T X
L(W, b, ξ, µ, λ) = W W + C ξi
2 i=1
n
X n
X
+ µi (1 − ξi − yi (W T Xi + b)) − λi ξi
i=1 i=1

PR NPTEL course – p.59/99


P
• In the lagrangian we have the term i (C − µi − λi )ξi .

PR NPTEL course – p.60/99


P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.

PR NPTEL course – p.61/99


P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.

PR NPTEL course – p.62/99


P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.
• We only need to ensure (in the dual) that λi ≥ 0 and
C = µi + λi , ∀i.

PR NPTEL course – p.63/99


P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.
• We only need to ensure (in the dual) that λi ≥ 0 and
C = µi + λi , ∀i.
• This is easily done by ensuring 0 ≤ µi ≤ C .

PR NPTEL course – p.64/99


The dual

• The dual problem now is:


n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• The only difference – upper bound also on µi .

PR NPTEL course – p.65/99


• The primal problem is
n
1 T X
minimize W W + C ξi
2 i=1

subject to yi (W T Xi + b) ≥ 1 − ξi , i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• As C → ∞, we get back the old problem.

PR NPTEL course – p.66/99


• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.67/99


• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• We solve dual and the final optimal hyperplane is


P ∗
W =∗
µi yi Xi ,
b∗ = yj − XjT W ∗ , j such that 0 < µj < C .
PR NPTEL course – p.68/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.

PR NPTEL course – p.69/99


• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .

PR NPTEL course – p.70/99


• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’

PR NPTEL course – p.71/99


• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’
• How can we learn non-linear discriminant functions?

PR NPTEL course – p.72/99


• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’
• How can we learn non-linear discriminant functions?
• Recall that the SVM idea is to transform Xi into some
other high-dimensional space and learn a linear
classifier there.

PR NPTEL course – p.73/99


An example

PR NPTEL course – p.74/99


An example

PR NPTEL course – p.75/99


Another Example

PR NPTEL course – p.76/99


Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .

PR NPTEL course – p.77/99


Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).

PR NPTEL course – p.78/99


Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).
• We can find optimal hyperplane by solving the dual
(replacing XiT Xj with ZiT Zj ).

PR NPTEL course – p.79/99


Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).
• We can find optimal hyperplane by solving the dual
(replacing XiT Xj with ZiT Zj ).
• The dual problem now would be the following.

PR NPTEL course – p.80/99


n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.81/99


n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• This is an optimization problem over ℜn (with


quadratic cost function & linear constraints)
irrespective of φ and m′ .

PR NPTEL course – p.82/99


n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• This is an optimization problem over ℜn (with


quadratic cost function & linear constraints)
irrespective of φ and m′ .
• But computationally expensive?
PR NPTEL course – p.83/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )

PR NPTEL course – p.84/99


Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.

PR NPTEL course – p.85/99


Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .

PR NPTEL course – p.86/99


Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .
• Replacing ZiT Zj by K(Xi , Xj ), we can solve dual
without ever computing any φ(Xi ). Efficient for
obtaining optimal hyperplane.

PR NPTEL course – p.87/99


Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .
• Replacing ZiT Zj by K(Xi , Xj ), we can solve dual
without ever computing any φ(Xi ). Efficient for
obtaining optimal hyperplane.
• What about storing W ∗ ? Computing φ(X)T W ∗ for
new patterns?
PR NPTEL course – p.88/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =

i

µ∗i yi φ(Xi ).

PR NPTEL course – p.89/99


Kernel function based classifier
P
• Let µ be soln of Dual. Then W =

i

µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i

PR NPTEL course – p.90/99


Kernel function based classifier
P
• Let µ be soln of Dual. Then W =

i

µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i

• Given a new pattern X we only need to compute

PR NPTEL course – p.91/99


Kernel function based classifier
P
• Let µ be soln of Dual. Then W =

i

µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i

• Given a new pattern X we only need to compute


f (X) = Z T W ∗ + b∗

PR NPTEL course – p.92/99


Kernel function based classifier
P
• Let µ be soln of Dual. Then W =

i

µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i

• Given a new pattern X we only need to compute


f (X) = Z T W ∗ + b∗
= i µi yi φ(Xi )T φ(X) + b∗
P ∗

PR NPTEL course – p.93/99


Kernel function based classifier
P
• Let µ be soln of Dual. Then W =

i

µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i

• Given a new pattern X we only need to compute


f (X) = Z T W ∗ + b∗
= i µi yi φ(Xi )T φ(X) + b∗
P ∗
P ∗ P ∗
= i µi yi K(Xi , X) + (yj − i µi yi K(Xi , Xj ))

PR NPTEL course – p.94/99


• This is an interesting way of learning nonlinear
classifiers.

PR NPTEL course – p.95/99


• This is an interesting way of learning nonlinear
classifiers.
• We solve the dual whose dimension is that of n,
number of examples.

PR NPTEL course – p.96/99


• This is an interesting way of learning nonlinear
classifiers.
• We solve the dual whose dimension is that of n,
number of examples.
• All we need to store are:
non-zero Lagrange multipliers: µ∗i > 0,
Support vectors: Xi , i s.t. µ∗i > 0.

PR NPTEL course – p.97/99


• This is an interesting way of learning nonlinear
classifiers.
• We solve the dual whose dimension is that of n,
number of examples.
• All we need to store are:
non-zero Lagrange multipliers: µ∗i > 0,
Support vectors: Xi , i s.t. µ∗i > 0.
• Never need to enter ‘φ(X)’ space!

PR NPTEL course – p.98/99


Support Vector Machine
• Obtain µ∗i by solving the Dual with ZiT Zj replaced by
K(Xi , Xj ). (Choose a suitable Kernel function. Use
‘penalty const’, C if needed).
• Store non-zero µ∗i and the corresponding support
vectors.
• Classify any new pattern X by sign of
P P
f (X) = µ y K(Xi , X) + (yj −

i i i i yi K(Xi , Xj ))
µ ∗

• If we have a suitable Kernel function, we never need


to compute φ(X).
• The range space of φ can even be infinite
dimensional!
PR NPTEL course – p.99/99

You might also like