0% found this document useful (0 votes)

12 views99 pages

SVM Optimization Explained

The document discusses the Support Vector Machine (SVM) method, focusing on the concept of the optimal hyperplane and the formulation of an optimization problem to learn it under the assumption of linearly separable training data. It outlines the Lagrangian formulation, Kuhn-Tucker conditions, and the dual optimization problem, emphasizing the role of support vectors in determining the optimal hyperplane. Additionally, it addresses the challenges faced when the training data is not linearly separable.

Uploaded by

Shoubhik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views99 pages

SVM Optimization Explained

Uploaded by

Shoubhik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

• We have been discussing the SVM method.

PR NPTEL course – p.1/99

• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.

PR NPTEL course – p.2/99

• We have been discussing the SVM method.
• We explained the concept of optimal hyperplane.
• We formulated an optimization problem to learn
optimal hyperplane assuming training set is linearly
separable.

PR NPTEL course – p.3/99

PR NPTEL course – p.4/99

PR NPTEL course – p.5/99

The optimization problem for SVM
• The optimal hyperplane is a solution of the following
constrained optimization problem.
• Find W ∈ ℜm , b ∈ ℜ to
1 T
minimize W W
2
subject to 1 − yi (W T Xi + b) ≤ 0, i = 1, . . . , n
• Quadratic cost function and linear (inequality)
constraints.
• Kuhn-Tucker conditions are necessary and sufficient.
Every local minimum is global minimum.
PR NPTEL course – p.6/99
Optimal hyperplane

PR NPTEL course – p.7/99

Non-optimal hyperplane

PR NPTEL course – p.8/99

• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give

Pn
∇W L = 0 ⇒ W = ∗
i=1 µ∗i yi Xi

PR NPTEL course – p.9/99

• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give

Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∗

∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0

PR NPTEL course – p.10/99

• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give

Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∗

∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i

PR NPTEL course – p.11/99

• The Lagrangian is given by
n
1 T X
L(W, b, µ) = W W + µi [1 − yi (W T Xi + b)]
2 i=1

• The Kuhn-Tucker conditions give

Pn
∇W L = 0 ⇒ W = i=1 µ∗i yi Xi
∗

∂L
Pn ∗
∂b
= 0 ⇒ i=1 µi yi = 0
1 − yi (XiT W ∗ + b∗ ) ≤ 0, ∀i
µ∗i ≥ 0, & µ∗i [1 − yi (XiT W ∗ + b∗ )] = 0, ∀i
PR NPTEL course – p.12/99
• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1

PR NPTEL course – p.13/99

• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.

PR NPTEL course – p.14/99

• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
P ∗ P
W = i µi yi Xi = i∈S µ∗i yi Xi
∗

PR NPTEL course – p.15/99

• Let S = {i | µ∗i > 0}.
• By complementary slackness condition,

i∈S ⇒ yi (XiT W ∗ + b∗ ) = 1
Implies Xi is closest to separating hyperplane.
• {Xi | i ∈ S} are called Support vectors. We have
P ∗ P
W = i µi yi Xi = i∈S µ∗i yi Xi
∗

• Optimal W is a linear combination of Support vectors.

• Support vectors constitute a very useful output of the
method.
PR NPTEL course – p.16/99
Optimal hyperplane

PR NPTEL course – p.17/99

The SVM solution

• The optimal hyperplane – W ∗ , b∗ given by:

P P
W =∗
i µ y Xi =
∗
i i i∈S µ∗i yi Xi
b∗ = yj − XjT W ∗ , j s.t. µ∗j > 0
(Note that µ∗j > 0 ⇒ yj (XjT W ∗ + b∗ ) = 1)

• Thus, W ∗ , b∗ are determined by µ∗i , i = 1, . . . , n.

• We can use the dual of the optimization problem to
get µ∗i .

PR NPTEL course – p.18/99

Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1

PR NPTEL course – p.19/99

Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.

PR NPTEL course – p.20/99

PR NPTEL course – p.21/99

PR NPTEL course – p.22/99

Dual optimization problem for SVM
• The dual function is
( n
)
1 T X
q(µ) = inf W W+ µi [1 − yi (W T Xi + b)]
W,b 2 i=1
P
• If µi yi 6= 0 then q(µ) = −∞.
• Hence
P we need to maximize q only over those µ s.t.
µi yi = 0.
P
• Infimum w.r.t. W is attained at W = µi yi Xi .
P
• We obtainP the dual by substituting W = µi yi Xi and
imposing µi yi = 0.
PR NPTEL course – p.23/99
P P
• By substituting W = µi yi Xi and µi yi = 0 we get

PR NPTEL course – p.24/99

P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1

PR NPTEL course – p.25/99

P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
Ã !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j

PR NPTEL course – p.26/99

P P
• By substituting W = µi yi Xi and µi yi = 0 we get
n n
1 T X X
q(µ) = W W + µi − µi yi (W T Xi + b)
2 i=1 i=1
Ã !T
1 X X X
= µi yi Xi µj yj Xj + µi
2 i j i
X X
− µi yi XiT ( µj yj Xj )
i j
X 1 XX
= µi − µi yi µj yj XiT Xj
i
2 i j
PR NPTEL course – p.27/99
• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.28/99

• Thus, the dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Quadratic cost function and linear constraints

• Training data vectors appear only as innerproduct
• Optimization is over ℜn irrespective of the dimension
of Xi .
PR NPTEL course – p.29/99
Optimal hyperplane

• The optimal hyperplane is a solution of

1 T
min W W
W,b 2
subject to yi (W T Xi + b) ≥ 1, i = 1, . . . , n
• Instead of solving this primal problem, we solve the
dual and obtain optimal Lagrange multipliers, µ∗i .

PR NPTEL course – p.30/99

• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.31/99

• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to µi ≥ 0, i = 1, . . . , n, y i µi = 0
i=1

• Then the final solution is:

X
W∗ = µ∗i yi Xi , b∗ = yj −XjT W ∗ , j such that µj > 0

PR NPTEL course – p.32/99

• So far, we assumed that the training data is linearly
separable.

PR NPTEL course – p.33/99

• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?

PR NPTEL course – p.34/99

PR NPTEL course – p.35/99

• So far, we assumed that the training data is linearly
separable.
• What happens if the data is non-separable?
• Optimization problem has no feasible point (and
hence no solution) if data are not linearly separable.
• Hence we can not find the optimal hyperplane by this
formulation.

PR NPTEL course – p.36/99

PR NPTEL course – p.37/99

Using slack variables

• When data are not linearly separable, we can try:

n
1 T X
minimize W W + C ξi
2 i=1

subject to yi (W T Xi + b) ≥ 1 − ξi , i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• Opt. variables: W, b, ξi .

PR NPTEL course – p.38/99

• Feasible solution always exists.

PR NPTEL course – p.39/99

• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.

PR NPTEL course – p.40/99

• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
• When ξ > 0, there is a ‘margin error’. When ξi > 1,
Xi is wrongly classified.

PR NPTEL course – p.41/99

• Feasible solution always exists.
• ξi measure extent of violation of optimal separation.
• When ξ > 0, there is a ‘margin error’. When ξi > 1,
Xi is wrongly classified.
• C – user-specified constant. (Like regularization
parameter).

PR NPTEL course – p.42/99

PR NPTEL course – p.43/99
• The optimization problem now is
n
1 T X
min W W + C ξi
W,b,ξ 2 i=1
T
subject to 1 − ξi − yi (W X + b) ≤ 0, i = 1, . . . , n
i
−ξi ≤ 0, i = 1, . . . , n

PR NPTEL course – p.44/99

• The Lagrangian now is
n
1 T X
L(W, b, ξ, µ, λ) = W W + C ξi
2 i=1
n
X n
X
+ µi (1 − ξi − yi (W T Xi + b)) − λi ξi
i=1 i=1

• µi are the lagrange multipliers for the separability

constraints as earlier.
• λi are the lagrange multipliers for the constraints
−ξi ≤ 0.
PR NPTEL course – p.45/99
The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W =
∗
µ∗i yi Xi

PR NPTEL course – p.46/99

The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0

PR NPTEL course – p.47/99

The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

PR NPTEL course – p.48/99

The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i

PR NPTEL course – p.49/99

The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
• µi ≥ 0; λi ≥ 0, ∀i

PR NPTEL course – p.50/99

The Kuhn-Tucker conditions give us
P
• ∇W L = 0 ⇒ W = ∗
µ∗i yi Xi
∂L
P ∗
•
∂b
=0⇒ µi y i = 0
• ∂L = 0 ⇒ µ∗i + λ∗i = C, ∀i
∂ξi

• 1 − ξi − yi (W T Xi + b) ≤ 0; ξi ≥ 0; ∀i
• µi ≥ 0; λi ≥ 0, ∀i
• µi (1 − ξi − yi (W T Xi + b)) = 0; λi ξi = 0, ∀i

PR NPTEL course – p.51/99

• The W ∗ is given by the same expression.

PR NPTEL course – p.52/99

• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.

PR NPTEL course – p.53/99

• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.

PR NPTEL course – p.54/99

• The W ∗ is given by the same expression.
• We also have 0 ≤ µi + λi = C, ∀i.
• If 0 < µi < C , then, λi > 0 which implies ξi = 0.
• By the complementary slackness condition, we have
1 − yi (W T Xi + b) = 0.

PR NPTEL course – p.55/99

b∗ = yj − XjT W ∗ , j such that 0 < µj < C

PR NPTEL course – p.56/99

b∗ = yj − XjT W ∗ , j such that 0 < µj < C

• Thus, once again we need to find all µi .

PR NPTEL course – p.57/99

• We can derive the dual as before. The dual function is
q(µ, λ) = inf L(W, b, ξ, µ, λ)
W,b,ξ

PR NPTEL course – p.58/99

• We can derive the dual as before. The dual function is
q(µ, λ) = inf L(W, b, ξ, µ, λ)
W,b,ξ

where the lagrangian is given by

n
1 T X
L(W, b, ξ, µ, λ) = W W + C ξi
2 i=1
n
X n
X
+ µi (1 − ξi − yi (W T Xi + b)) − λi ξi
i=1 i=1

PR NPTEL course – p.59/99

P
• In the lagrangian we have the term i (C − µi − λi )ξi .

PR NPTEL course – p.60/99

P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.

PR NPTEL course – p.61/99

PR NPTEL course – p.62/99

P
• In the lagrangian we have the term i (C − µi − λi )ξi .
• Since we take infimum w.r.t. ξi , we need to impose
C = µi + λi , ∀i.
• When we impose this, all the terms containing λi or ξi
drop out and hence now the q function would be same
as earlier.
• We only need to ensure (in the dual) that λi ≥ 0 and
C = µi + λi , ∀i.

PR NPTEL course – p.63/99

PR NPTEL course – p.64/99

The dual

• The dual problem now is:

n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• The only difference – upper bound also on µi .

PR NPTEL course – p.65/99

• The primal problem is
n
1 T X
minimize W W + C ξi
2 i=1

subject to yi (W T Xi + b) ≥ 1 − ξi , i = 1, . . . , n
ξi ≥ 0, i = 1, . . . , n
• As C → ∞, we get back the old problem.

PR NPTEL course – p.66/99

• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.67/99

• The dual problem is:
n n
X 1 X
max q(µ) = µi − µi µj yi yj XiT Xj
µ 2 i,j=1
i=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• We solve dual and the final optimal hyperplane is

P ∗
W =∗
µi yi Xi ,
b∗ = yj − XjT W ∗ , j such that 0 < µj < C .
PR NPTEL course – p.68/99
• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.

PR NPTEL course – p.69/99

• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .

PR NPTEL course – p.70/99

PR NPTEL course – p.71/99

PR NPTEL course – p.72/99

• By using slack variables, ξi , we can find ‘best’
hyperplane classifier.
• In the dual, the only difference is an upperbound on
µi .
• The best linear classifier may not be ‘good enough’
• How can we learn non-linear discriminant functions?
• Recall that the SVM idea is to transform Xi into some
other high-dimensional space and learn a linear
classifier there.

PR NPTEL course – p.73/99

An example

PR NPTEL course – p.74/99

An example

PR NPTEL course – p.75/99

Another Example

PR NPTEL course – p.76/99

Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .

PR NPTEL course – p.77/99

Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).

PR NPTEL course – p.78/99

Non-linear discriminant functions

m m′
• In general, we can use a mapping, φ : ℜ → ℜ .
m′
• In ℜ , the training set is
{(Zi , yi ), i = 1, . . . , n}, Zi = φ(Xi ).
• We can find optimal hyperplane by solving the dual
(replacing XiT Xj with ZiT Zj ).

PR NPTEL course – p.79/99

Non-linear discriminant functions

PR NPTEL course – p.80/99

n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

PR NPTEL course – p.81/99

n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• This is an optimization problem over ℜn (with

quadratic cost function & linear constraints)
irrespective of φ and m′ .

PR NPTEL course – p.82/99

n n
X 1 X
max q(µ) = µi − µi µj yi yj φ(Xi )T φ(Xj )
µ
i=1
2 i,j=1
n
X
subject to 0 ≤ µi ≤ C, i = 1, . . . , n, y i µi = 0
i=1

• This is an optimization problem over ℜn (with

quadratic cost function & linear constraints)
irrespective of φ and m′ .
• But computationally expensive?
PR NPTEL course – p.83/99
Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )

PR NPTEL course – p.84/99

Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.

PR NPTEL course – p.85/99

PR NPTEL course – p.86/99

PR NPTEL course – p.87/99

Kernel function
• Suppose we have a function, K : ℜm × ℜm → ℜ,
such that
K(Xi , Xj ) = φ(Xi )T φ(Xj )
Called Kernel function.
• Suppose computation of K(Xi , Xj ) is about as
expensive as that of XiT Xj .
• Replacing ZiT Zj by K(Xi , Xj ), we can solve dual
without ever computing any φ(Xi ). Efficient for
obtaining optimal hyperplane.
• What about storing W ∗ ? Computing φ(X)T W ∗ for
new patterns?
PR NPTEL course – p.88/99
Kernel function based classifier
P
• Let µ be soln of Dual. Then W =
∗
i
∗
µ∗i yi φ(Xi ).