Support Vector Machines
- Dual formulation and Kernel
Trick
Aarti Singh
Machine Learning 10-315
Oct 28, 2020
Constrained Optimization – Dual Problem
Primal problem:
b +ve
Moving the constraint to objective function
Lagrangian:
Dual problem:
2
Connection between Primal and Dual
Primal problem: p* = Dual problem: d* =
Ø Weak duality: The dual solution d* lower bounds the primal
solution p* i.e. d* ≤ p*
Duality gap = p*-d*
Ø Strong duality: d* = p* holds often for many problems of
interest e.g. if the primal is a feasible convex objective with linear
constraints (Slater’s condition)
3
Connection between Primal and Dual
What does strong duality say about ↵⇤⇤ (the ↵ that achieved optimal value of
What does ⇤strong duality say about ↵ ⇤ (the ↵ that achieved optimal value of
dual)
Whatanddoesx strong
(the xduality
that achieves
say aboutoptimal value
↵ (the ↵ of primal
that problem)?
achieved optimal value of
dual) and x⇤⇤ (the x that achieves optimal value of primal problem)?
dual) and x (the x that achieves optimal value of primal problem)?
Whenever strong duality holds, the following conditions (known as KKT con-
Whenever strong duality holds, the following conditions (known as KKT con-
ditions)
Whenever are strong ↵⇤⇤ andholds,
true forduality x⇤⇤ : the following conditions (known as KKT con-
ditions) are true for ↵ ⇤ and x ⇤:
ditions) are true for ↵ and x :
• 1. 5L(x⇤⇤ , ↵⇤⇤ ) = 0 i.e. Gradient of Lagrangian at x⇤⇤ and ↵⇤⇤ is zero.
• 1. 5L(x ⇤, ↵ ⇤) = 0 i.e. Gradient of Lagrangian at x ⇤ and ↵ ⇤ is zero.
• 1. 5L(x
⇤
, ↵ ) ⇤= 0 i.e. Gradient of Lagrangian at x and ↵ is zero.
• 2. x⇤ b i.e. x⇤ is primal feasible
• 2. x ⇤ b i.e. x ⇤ is primal feasible
• 2. x⇤ b i.e. x⇤ is primal feasible
• 3. ↵⇤ 0 i.e. ↵⇤ is dual feasible
• 3. ↵ ⇤ 0 i.e. ↵ ⇤ is dual feasible
• 3. ↵⇤ ⇤ 0 i.e. ↵ is dual feasible
• 4. ↵⇤ (x⇤ b) = 0 (called as complementary slackness)
• 4. ↵ ⇤(x ⇤ b) = 0 (called as complementary slackness)
• 4. ↵ (x b) = 0 (called as complementary slackness)
We use the first one to relate x⇤⇤ and ↵⇤⇤ . We use the last one (complimentary
We use the first one to ⇤relate x ⇤and ↵ ⇤. We use the last one ⇤ (complimentary
slackness)
We usetotheargue
firstthat ↵⇤relate
one to = 0 ifxconstraint is inactive
and ↵ . We use the and
last ↵one>(complimentary
0 if constraint
slackness) to argue that ↵ ⇤= 0 if constraint is inactive and ↵⇤⇤> 0 if constraint
isslackness)
active andto tight.
argue that ↵ = 0 if constraint is inactive and ↵ > 0 if constraint
is active and tight.
is active and tight.
4
Solving the dual
5
Solving the dual
Find the dual: Optimization over x is unconstrained.
Solve: Now need to maximize L(x*,α) over α ≥ 0
Solve unconstrained problem to get α’ and then take max(α’,0)
) ↵0 = 2b
a = 0 constraint is inactive, α > 0 constraint is active (tight) 6
Dual SVM – linearly separable case
n training points, d features (x1, …, xn) where xi is a d-dimensional
vector
• Primal problem:
w – weights on features (d-dim problem)
• Dual problem (derivation):
a – weights on training pts (n-dim problem)
7
Dual SVM – linearly separable case
• Dual problem (derivation):
If we can solve for
as (dual problem),
then we have a
solution for w,b
(primal problem)
8
Dual SVM – linearly separable case
• Dual problem:
9
Dual SVM – linearly separable case
Dual problem is also QP
Solution gives ajs
What about b?
10
Dual SVM: Sparsity of dual solution
aj = 0
aj > 0 aj = 0
=0
Only few ajs can be
w .x + b
non-zero : where
aj > 0
constraint is active and
tight
aj > 0
aj = 0
(w.xj + b)yj = 1
Support vectors –
training points j whose
ajs are non-zero 11
Dual SVM – linearly separable case
Dual problem is also QP
Solution gives ajs
Use any one of support vectors with
ak>0 to compute b since constraint is
tight (w.xk + b)yk = 1 12
Dual SVM – non-separable case
• Primal problem:
,{ξj}
Lagrange
• Dual problem: Multipliers
,{ξj} L(w, b, ⇠, ↵, µ)
HW3!
13
Dual SVM – non-separable case
comes from
@L Intuition:
=0 If C→∞, recover hard-margin SVM
@⇠
Dual problem is also QP
Solution gives ajs
14
So why solve the dual SVM?
• There are some quadratic programming algorithms
that can solve the dual faster than the primal,
(specially in high dimensions d>>n)
• But, more importantly, the “kernel trick”!!!
15
Separable using higher-order features
x2 q
x1 r = √x12+x22
x12
x1
16
x1
What if data is not linearly separable?
Use features of features
of features of features….
Φ(x) = (x12, x22, x1x2, …., exp(x1))
Feature space becomes really large very quickly!
17
Higher Order Polynomials
m – input features d – degree of polynomial
grows fast!
d = 6, m = 100
about 1.6 billion terms
18
Dual formulation only depends on
dot-products, not on w!
Φ(x) – High-dimensional feature space, but never need it explicitly as long
as we can compute the dot product fast using some Kernel K
19
Dot Product of Polynomials
d=1
d=2
d 20
Finally: The Kernel Trick!
• Never represent features explicitly
– Compute dot products in closed
form
• Constant-time high-dimensional dot-
products for many classes of features
21
Common Kernels
• Polynomials of degree d
• Polynomials of degree up to d
• Gaussian/Radial kernels (polynomials of all orders – recall
series expansion of exp)
• Sigmoid
22
Mercer Kernels
What functions are valid kernels that correspond to feature
vectors j(x)?
Answer: Mercer kernels K
• K is continuous
• K is symmetric
• K is positive semi-definite, i.e. xTKx ≥ 0 for all x
Ensures optimization is concave maximization
23
Overfitting
• Huge feature space with kernels, what about
overfitting???
– Maximizing margin leads to sparse set of support
vectors
– Some interesting theory says that SVMs search for
simple hypothesis with large margin
– Often robust to overfitting
24
What about classification time?
• For a new input x, if we need to represent F(x), we are in trouble!
• Recall classifier: sign(w.F(x)+b)
• Using kernels we are cool!
25
SVMs with Kernels
• Choose a set of features and kernel function
• Solve dual problem to obtain support vectors ai
• At classification time, compute:
Classify as
26
SVMs with Kernels
• Iris dataset, 2 vs 13, Linear Kernel
27
SVMs with Kernels
• Iris dataset, 1 vs 23, Polynomial Kernel degree 2
28
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
29
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
30
SVMs with Kernels
• Chessboard dataset, Gaussian RBF kernel
31
SVMs with Kernels
• Chessboard dataset, Polynomial kernel
32
USPS Handwritten digits
33
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
34
SVMs vs. Logistic Regression
SVM : Hinge loss
Logistic Regression : Log loss ( -ve log conditional likelihood)
Log loss Hinge loss
0-1 loss
-1 0 1
35
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
High dimensional Yes! Yes!
features with
kernels
Solution sparse Often yes! Almost always no!
Semantics of “Margin” Real probabilities
output
36
Kernels in Logistic Regression
• Define weights in terms of features:
• Derive simple gradient descent rule on ai 37
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
High dimensional Yes! Yes!
features with
kernels
Solution sparse Often yes! Almost always no!
Semantics of “Margin” Real probabilities
output
38
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
High dimensional Yes! Yes!
features with
kernels
Solution sparse Often yes! Almost always no!
Semantics of “Margin” Real probabilities
output
39
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
High dimensional Yes! Yes!
features with
kernels
Solution sparse Often yes! Almost always no!
Semantics of “Margin” Real probabilities
output
40
What you need to know
• Maximizing margin
• Derivation of SVM formulation
• Slack variables and hinge loss
• Relationship between SVMs and logistic regression
– 0/1 loss
– Hinge loss
– Log loss
• Tackling multiple class
– One against All
– Multiclass SVMs
• Dual SVM formulation
– Easier to solve when dimension high d > n
– Kernel Trick 41