• We have been discussing the SVM method.
PR NPTEL course – p.1/99
• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.
PR NPTEL course – p.2/99
• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.
• We formulated an optimization problem to learn
optimal hyperplane assuming training set is linearly
separable.
PR NPTEL course – p.3/99
• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.
• We formulated an optimization problem to learn
optimal hyperplane assuming training set is linearly
separable.
• We will quickly review this SVM algorithm before
discussing the more general case.
PR NPTEL course – p.4/99
The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n
PR NPTEL course – p.5/99
The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n
• Quadratic cost function and linear (inequality)
constraints.
• Kuhn-Tucker conditions are necessary and sufficient.
Every local minimum is global minimum.
PR NPTEL course – p.6/99
Optimal hyperplane
PR NPTEL course – p.7/99
Non-optimal hyperplane
PR NPTEL course – p.8/99
• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1
• The Kuhn-Tucker conditions give
Pn
∇W L = 0 ⇒ W = ∗
i=1 µ∗i yi Xi
PR NPTEL course – p.9/99
• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1
• The Kuhn-Tucker conditions give
Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∗
∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
PR NPTEL course – p.10/99
• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1
• The Kuhn-Tucker conditions give
Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∗
∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i
PR NPTEL course – p.11/99
• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1
• The Kuhn-Tucker conditions give
Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∗
∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i
µ∗i ≥ 0, & µ∗i [1 − yi (XiT W ∗ + b∗ )] = 0, ∀i
PR NPTEL course – p.12/99
• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,
i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
PR NPTEL course – p.13/99
• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,
i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
PR NPTEL course – p.14/99
• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,
i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
P ∗ P
W = i µi yi Xi = i∈S µ∗i yi Xi
∗
PR NPTEL course – p.15/99
• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,
i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
P ∗ P
W = i µi yi Xi = i∈S µ∗i yi Xi
∗
• Optimal W is a linear combination of Support vectors.
• Support vectors constitute a very useful output of the
method.
PR NPTEL course – p.16/99
Optimal hyperplane
PR NPTEL course – p.17/99
The SVM solution
• The optimal hyperplane – W ∗ , b∗ given by:
P P
W =∗
i µ y Xi =
∗
i i i∈S µ∗i yi Xi
b∗ = yj − XjT W ∗ , j s.t. µ∗j > 0
(Note that µ∗j > 0 ⇒ yj (XjT W ∗ + b∗ ) = 1)
• Thus, W ∗ , b∗ are determined by µ∗i , i = 1, . . . , n.
• We can use the dual of the optimization problem to
get µ∗i .
PR NPTEL course – p.18/99
Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
PR NPTEL course – p.19/99
Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
PR NPTEL course – p.20/99
Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
PR NPTEL course – p.21/99
Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .
PR NPTEL course – p.22/99
Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .
P
• We obtainP the dual by substituting W = µi yi Xi and
imposing µi yi = 0.
PR NPTEL course – p.23/99
P P
• By substituting W = µi yi Xi and µi yi = 0 we get
PR NPTEL course – p.24/99
P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
PR NPTEL course – p.25/99
P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
à !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j
PR NPTEL course – p.26/99
P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
à !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j
X 1 XX
= µi − µi yi µj yj XiT Xj
i
2 i j
PR NPTEL course – p.27/99
• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1
PR NPTEL course – p.28/99
• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1
• Quadratic cost function and linear constraints
• Training data vectors appear only as innerproduct
• Optimization is over ℜn irrespective of the dimension
of Xi .
PR NPTEL course – p.29/99
Optimal hyperplane
• The optimal hyperplane is a solution of
1 T
min W W
W,b 2
subject to yi (W T Xi + b) ≥ 1, i = 1, . . . , n
• Instead of solving this primal problem, we solve the
dual and obtain optimal Lagrange multipliers, µ∗i .
PR NPTEL course – p.30/99
• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1
PR NPTEL course – p.31/99
• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1
• Then the final solution is:
X
W∗ = µ∗i yi Xi , b∗ = yj −XjT W ∗ , j such that µj > 0
PR NPTEL course – p.32/99
• So far, we assumed that the training data is linearly
separable.
PR NPTEL course – p.33/99
• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
PR NPTEL course – p.34/99
• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.
PR NPTEL course – p.35/99
• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.
• Hence we can not find the optimal hyperplane by this
formulation.
PR NPTEL course – p.36/99
• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.
• Hence we can not find the optimal hyperplane by this
formulation.
• We will modify the problem by introducing slack
variables so that we can handle the general case.
PR NPTEL course – p.37/99
Using slack variables
• When data are not linearly separable, we can try:
n
1 T X
minimize W W + C ξi
2 i=1
subject to yi (W T Xi + b) ≥ 1 − ξi , i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• Opt. variables: W, b, ξi .
PR NPTEL course – p.38/99
• Feasible solution always exists.
PR NPTEL course – p.39/99
• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
PR NPTEL course – p.40/99
• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
• When ξ > 0, there is a ‘margin error’. When ξi > 1,
Xi is wrongly classified.
PR NPTEL course – p.41/99
• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
• When ξ > 0, there is a ‘margin error’. When ξi > 1,
Xi is wrongly classified.
• C – user-specified constant. (Like regularization
parameter).
PR NPTEL course – p.42/99
PR NPTEL course – p.43/99
• The optimization problem now is
n
1 T X
min W W + C ξi
W,b,ξ 2 i=1
T
subject to 1 − ξi − yi (W X + b) ≤ 0, i = 1, . . . , n
i
−ξi ≤ 0, i = 1, . . . , n
PR NPTEL course – p.44/99
• The Lagrangian now is
n
1 T X
L(W, b, ξ, µ, λ) = W W + C ξi
2 i=1
n
X n
X
+ µi (1 − ξi − yi (W T Xi + b)) − λi ξi
i=1 i=1
• µi are the lagrange multipliers for the separability
constraints as earlier.
• λi are the lagrange multipliers for the constraints
−ξi ≤ 0.
PR NPTEL course – p.45/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W =
∗
µ∗i yi Xi
PR NPTEL course – p.46/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
PR NPTEL course – p.47/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi
PR NPTEL course – p.48/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi
• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
PR NPTEL course – p.49/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi
• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
• µi ≥ 0; λi ≥ 0, ∀i
PR NPTEL course – p.50/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi
• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
• µi ≥ 0; λi ≥ 0, ∀i
• µi (1 − ξi − yi (W T Xi + b)) = 0; λi ξi = 0, ∀i
PR NPTEL course – p.51/99
• The W ∗ is given by the same expression.
PR NPTEL course – p.52/99
• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
PR NPTEL course – p.53/99
• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
PR NPTEL course – p.54/99
• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.
PR NPTEL course – p.55/99
• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.
• Thus we get b∗ as
b∗ = yj − XjT W ∗ , j such that 0 < µj < C
PR NPTEL course – p.56/99
• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.
• Thus we get b∗ as
b∗ = yj − XjT W ∗ , j such that 0 < µj < C
• Thus, once again we need to find all µi .
PR NPTEL course – p.57/99
• We can derive the dual as before. The dual function is
q(µ, λ) = inf L(W, b, ξ, µ, λ)
W,b,ξ
PR NPTEL course – p.58/99
• We can derive the dual as before. The dual function is
q(µ, λ) = inf L(W, b, ξ, µ, λ)
W,b,ξ
where the lagrangian is given by
n
1 T X
L(W, b, ξ, µ, λ) = W W + C ξi
2 i=1
n
X n
X
+ µi (1 − ξi − yi (W T Xi + b)) − λi ξi
i=1 i=1
PR NPTEL course – p.59/99
P
• In the lagrangian we have the term i (C − µi − λi )ξi .
PR NPTEL course – p.60/99
P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
PR NPTEL course – p.61/99
P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.
PR NPTEL course – p.62/99
P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.
• We only need to ensure (in the dual) that λi ≥ 0 and
C = µi + λi , ∀i.
PR NPTEL course – p.63/99
P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.
• We only need to ensure (in the dual) that λi ≥ 0 and
C = µi + λi , ∀i.
• This is easily done by ensuring 0 ≤ µi ≤ C .
PR NPTEL course – p.64/99
The dual
• The dual problem now is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• The only difference – upper bound also on µi .
PR NPTEL course – p.65/99
• The primal problem is
n
1 T X
minimize W W + C ξi
2 i=1
subject to yi (W T Xi + b) ≥ 1 − ξi , i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• As C → ∞, we get back the old problem.
PR NPTEL course – p.66/99
• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
PR NPTEL course – p.67/99
• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• We solve dual and the final optimal hyperplane is
P ∗
W =∗
µi yi Xi ,
b∗ = yj − XjT W ∗ , j such that 0 < µj < C .
PR NPTEL course – p.68/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
PR NPTEL course – p.69/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
PR NPTEL course – p.70/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’
PR NPTEL course – p.71/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’
• How can we learn non-linear discriminant functions?
PR NPTEL course – p.72/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’
• How can we learn non-linear discriminant functions?
• Recall that the SVM idea is to transform Xi into some
other high-dimensional space and learn a linear
classifier there.
PR NPTEL course – p.73/99
An example
PR NPTEL course – p.74/99
An example
PR NPTEL course – p.75/99
Another Example
PR NPTEL course – p.76/99
Non-linear discriminant functions
m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
PR NPTEL course – p.77/99
Non-linear discriminant functions
m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).
PR NPTEL course – p.78/99
Non-linear discriminant functions
m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).
• We can find optimal hyperplane by solving the dual
(replacing XiT Xj with ZiT Zj ).
PR NPTEL course – p.79/99
Non-linear discriminant functions
m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).
• We can find optimal hyperplane by solving the dual
(replacing XiT Xj with ZiT Zj ).
• The dual problem now would be the following.
PR NPTEL course – p.80/99
n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
PR NPTEL course – p.81/99
n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• This is an optimization problem over ℜn (with
quadratic cost function & linear constraints)
irrespective of φ and m′ .
PR NPTEL course – p.82/99
n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1
• This is an optimization problem over ℜn (with
quadratic cost function & linear constraints)
irrespective of φ and m′ .
• But computationally expensive?
PR NPTEL course – p.83/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
PR NPTEL course – p.84/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
PR NPTEL course – p.85/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .
PR NPTEL course – p.86/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .
• Replacing ZiT Zj by K(Xi , Xj ), we can solve dual
without ever computing any φ(Xi ). Efficient for
obtaining optimal hyperplane.
PR NPTEL course – p.87/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .
• Replacing ZiT Zj by K(Xi , Xj ), we can solve dual
without ever computing any φ(Xi ). Efficient for
obtaining optimal hyperplane.
• What about storing W ∗ ? Computing φ(X)T W ∗ for
new patterns?
PR NPTEL course – p.88/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).
PR NPTEL course – p.89/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i
PR NPTEL course – p.90/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i
• Given a new pattern X we only need to compute
PR NPTEL course – p.91/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i
• Given a new pattern X we only need to compute
f (X) = Z T W ∗ + b∗
PR NPTEL course – p.92/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i
• Given a new pattern X we only need to compute
f (X) = Z T W ∗ + b∗
= i µi yi φ(Xi )T φ(X) + b∗
P ∗
PR NPTEL course – p.93/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).
• Then we have
X
b∗ = yj − φ(Xj )T W ∗ = (yj − µ∗i yi φ(Xi )T φ(Xj ))
i
• Given a new pattern X we only need to compute
f (X) = Z T W ∗ + b∗
= i µi yi φ(Xi )T φ(X) + b∗
P ∗
P ∗ P ∗
= i µi yi K(Xi , X) + (yj − i µi yi K(Xi , Xj ))
PR NPTEL course – p.94/99
• This is an interesting way of learning nonlinear
classifiers.
PR NPTEL course – p.95/99
• This is an interesting way of learning nonlinear
classifiers.
• We solve the dual whose dimension is that of n,
number of examples.
PR NPTEL course – p.96/99
• This is an interesting way of learning nonlinear
classifiers.
• We solve the dual whose dimension is that of n,
number of examples.
• All we need to store are:
non-zero Lagrange multipliers: µ∗i > 0,
Support vectors: Xi , i s.t. µ∗i > 0.
PR NPTEL course – p.97/99
• This is an interesting way of learning nonlinear
classifiers.
• We solve the dual whose dimension is that of n,
number of examples.
• All we need to store are:
non-zero Lagrange multipliers: µ∗i > 0,
Support vectors: Xi , i s.t. µ∗i > 0.
• Never need to enter ‘φ(X)’ space!
PR NPTEL course – p.98/99
Support Vector Machine
• Obtain µ∗i by solving the Dual with ZiT Zj replaced by
K(Xi , Xj ). (Choose a suitable Kernel function. Use
‘penalty const’, C if needed).
• Store non-zero µ∗i and the corresponding support
vectors.
• Classify any new pattern X by sign of
P P
f (X) = µ y K(Xi , X) + (yj −
∗
i i i i yi K(Xi , Xj ))
µ ∗
• If we have a suitable Kernel function, we never need
to compute φ(X).
• The range space of φ can even be infinite
dimensional!
PR NPTEL course – p.99/99