Support Vector Machines & Kernels
Lecture 6
David Sontag
New York University
Slides adapted from Luke Zettlemoyer and Carlos Guestrin,
and Vibhav Gogate
Dual SVM derivation (1) – the linearly
separable case
Original optimization problem:
Rewrite One Lagrange multiplier
constraints per example
Lagrangian:
Our goal now is to solve:
Dual SVM derivation (2) – the linearly
separable case
(Primal)
Swap min and max
(Dual)
Slater’s condition from convex optimization guarantees that
these two optimization problems are equivalent!
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
⇥(x) = ⇧ separable case
⇧ ⌃
(1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
Sums over all training examples scalars dot product
⇥
x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
⇥(x) = ⇧ separable case
⇧ ⌃
(1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
So, in dual formulation we will solve for α directly!
• w and b are computed from α (if needed)
Dual SVM derivation (3) – the linearly
separable case
Lagrangian:
αj > 0 for some j implies constraint
is tight. We use this to obtain b:
(1)
(2)
(3)
Classification rule using dual solution
Using dual solution
dot product of feature vectors of
new example with support vectors
Dual for the non-separable case
Primal: Solve for w,b,α:
Dual:
What changed?
• Added upper bound of C on αi!
• Intuitive explanation:
• Without slack, αi ∞ when constraints are violated (points
misclassified)
• Upper bound of C limits the αi, so misclassifications are allowed
Support vectors
• Complementary slackness conditions:
• Support vectors: points xj such that
(includes all j such that , but also additional points
where ↵j⇤ = 0 ^ yj (w
~ ⇤ · ~xj + b) 1 )
• Note: the SVM dual solution may not be unique!
Dual SVM interpretation: Sparsity
+1
-1
=
=
=
w.x + b
w.x + b
w.x + b
Final solution tends to
be sparse
•αj=0 for most j
•don’t need to store these
points to compute w or make
predictions
Non-support Vectors:
•αj=0
•moving them will not Support Vectors:
change w • αj≥0
SVM with kernels
• Never compute features explicitly!!!
– Compute dot products in closed form Predict with:
• O(n2) time in size of dataset to
compute objective
– much work on speeding up
Quadratic kernel
[Tommi Jaakkola]
Quadratic kernel
Feature mapping given by:
[Cynthia Rudin]
Common kernels
• Polynomials of degree exactly d
• Polynomials of degree up to d
• Gaussian kernels
Euclidean distance,
squared
• And many others: very active area of research!
(e.g., structured kernels that use dynamic programming
to evaluate, string kernels, …)
Gaussian kernel
Level sets, i.e. w.x=r for some r
Support vectors
[Cynthia Rudin] [mblondel.org]
Kernel algebra
Q: How would you prove that the “Gaussian kernel” is a valid kernel?
A: Expand the Euclidean norm as follows:
To see that this is a kernel, use the
Taylor series expansion of the
Then, apply (e) from above exponential, together with repeated
application of (a), (b), and (c):
The feature mapping is
infinite dimensional!
[Justin Domke]
Overfitting?
• Huge feature space with kernels: should we worry about
overfitting?
– SVM objective seeks a solution with large margin
• Theory says that large margin leads to good generalization
(we will see this in a couple of lectures)
– But everything overfits sometimes!!!
– Can control by:
• Setting C
• Choosing a better Kernel
• Varying parameters of the Kernel (width of Gaussian, etc.)