0% found this document useful (0 votes)

32 views41 pages

SVM Dual Kernel

This document discusses support vector machines and the kernel trick. It explains the primal and dual formulations of SVMs, how the dual problem can be solved faster using kernels, and how kernels allow SVMs to implicitly operate in very high-dimensional feature spaces without explicitly representing those features. The kernel trick works by computing dot products directly in closed form using kernel functions, avoiding the need to explicitly compute the high-dimensional feature mappings. Common kernels like polynomial and Gaussian kernels are discussed. The document also covers how kernels allow SVMs to generalize well while avoiding overfitting, and how classification can be done efficiently at test time using kernels.

Uploaded by

matin ashrafi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views41 pages

SVM Dual Kernel

Uploaded by

matin ashrafi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Support Vector Machines

- Dual formulation and Kernel

Trick
Aarti Singh

Machine Learning 10-315

Oct 28, 2020
Constrained Optimization – Dual Problem

Primal problem:

b +ve

Moving the constraint to objective function

Lagrangian:

Dual problem:

2
Connection between Primal and Dual
Primal problem: p* = Dual problem: d* =

Ø Weak duality: The dual solution d* lower bounds the primal

solution p* i.e. d* ≤ p*

Duality gap = p-d

Ø Strong duality: d* = p* holds often for many problems of

interest e.g. if the primal is a feasible convex objective with linear
constraints (Slater’s condition)

3
Connection between Primal and Dual
What does strong duality say about ↵⇤⇤ (the ↵ that achieved optimal value of
What does ⇤strong duality say about ↵ ⇤ (the ↵ that achieved optimal value of
dual)
Whatanddoesx strong
(the xduality
that achieves
say aboutoptimal value
↵ (the ↵ of primal
that problem)?
achieved optimal value of
dual) and x⇤⇤ (the x that achieves optimal value of primal problem)?
dual) and x (the x that achieves optimal value of primal problem)?
Whenever strong duality holds, the following conditions (known as KKT con-
Whenever strong duality holds, the following conditions (known as KKT con-
ditions)
Whenever are strong ↵⇤⇤ andholds,
true forduality x⇤⇤ : the following conditions (known as KKT con-
ditions) are true for ↵ ⇤ and x ⇤:
ditions) are true for ↵ and x :
• 1. 5L(x⇤⇤ , ↵⇤⇤ ) = 0 i.e. Gradient of Lagrangian at x⇤⇤ and ↵⇤⇤ is zero.
• 1. 5L(x ⇤, ↵ ⇤) = 0 i.e. Gradient of Lagrangian at x ⇤ and ↵ ⇤ is zero.
• 1. 5L(x
⇤
, ↵ ) ⇤= 0 i.e. Gradient of Lagrangian at x and ↵ is zero.
• 2. x⇤ b i.e. x⇤ is primal feasible
• 2. x ⇤ b i.e. x ⇤ is primal feasible
• 2. x⇤ b i.e. x⇤ is primal feasible
• 3. ↵⇤ 0 i.e. ↵⇤ is dual feasible
• 3. ↵ ⇤ 0 i.e. ↵ ⇤ is dual feasible
• 3. ↵⇤ ⇤ 0 i.e. ↵ is dual feasible
• 4. ↵⇤ (x⇤ b) = 0 (called as complementary slackness)
• 4. ↵ ⇤(x ⇤ b) = 0 (called as complementary slackness)
• 4. ↵ (x b) = 0 (called as complementary slackness)
We use the first one to relate x⇤⇤ and ↵⇤⇤ . We use the last one (complimentary
We use the first one to ⇤relate x ⇤and ↵ ⇤. We use the last one ⇤ (complimentary
slackness)
We usetotheargue
firstthat ↵⇤relate
one to = 0 ifxconstraint is inactive
and ↵ . We use the and
last ↵one>(complimentary
0 if constraint
slackness) to argue that ↵ ⇤= 0 if constraint is inactive and ↵⇤⇤> 0 if constraint
isslackness)
active andto tight.
argue that ↵ = 0 if constraint is inactive and ↵ > 0 if constraint
is active and tight.
is active and tight.
4
Solving the dual

5
Solving the dual

Find the dual: Optimization over x is unconstrained.

Solve: Now need to maximize L(x*,α) over α ≥ 0

Solve unconstrained problem to get α’ and then take max(α’,0)

) ↵0 = 2b

a = 0 constraint is inactive, α > 0 constraint is active (tight) 6

Dual SVM – linearly separable case
n training points, d features (x1, …, xn) where xi is a d-dimensional
vector

• Primal problem:

w – weights on features (d-dim problem)

• Dual problem (derivation):

a – weights on training pts (n-dim problem)

7
Dual SVM – linearly separable case
• Dual problem (derivation):

If we can solve for

as (dual problem),
then we have a
solution for w,b
(primal problem)

8
Dual SVM – linearly separable case
• Dual problem:

9
Dual SVM – linearly separable case

Dual problem is also QP

Solution gives ajs
What about b?

10
Dual SVM: Sparsity of dual solution

aj = 0
aj > 0 aj = 0
=0
Only few ajs can be
w .x + b

non-zero : where
aj > 0
constraint is active and
tight
aj > 0
aj = 0
(w.xj + b)yj = 1

Support vectors –
training points j whose
ajs are non-zero 11
Dual SVM – linearly separable case

Dual problem is also QP

Solution gives ajs

Use any one of support vectors with

ak>0 to compute b since constraint is
tight (w.xk + b)yk = 1 12
Dual SVM – non-separable case
• Primal problem:
,{ξj}

Lagrange
• Dual problem: Multipliers

,{ξj} L(w, b, ⇠, ↵, µ)

HW3!
13
Dual SVM – non-separable case

comes from
@L Intuition:
=0 If C→∞, recover hard-margin SVM
@⇠

Dual problem is also QP

Solution gives ajs
14
So why solve the dual SVM?
• There are some quadratic programming algorithms
that can solve the dual faster than the primal,
(specially in high dimensions d>>n)

• But, more importantly, the “kernel trick”!!!

15
Separable using higher-order features

x2 q

x1 r = √x12+x22

x12

16
x1
What if data is not linearly separable?
Use features of features
of features of features….

Φ(x) = (x12, x22, x1x2, …., exp(x1))

Feature space becomes really large very quickly!

17
Higher Order Polynomials
m – input features d – degree of polynomial

grows fast!
d = 6, m = 100
about 1.6 billion terms

18
Dual formulation only depends on
dot-products, not on w!

Φ(x) – High-dimensional feature space, but never need it explicitly as long

as we can compute the dot product fast using some Kernel K
19
Dot Product of Polynomials

d=1

d=2

d 20
Finally: The Kernel Trick!

• Never represent features explicitly

– Compute dot products in closed
form

• Constant-time high-dimensional dot-

products for many classes of features

21
Common Kernels
• Polynomials of degree d

• Polynomials of degree up to d

• Gaussian/Radial kernels (polynomials of all orders – recall

series expansion of exp)

• Sigmoid

22
Mercer Kernels
What functions are valid kernels that correspond to feature
vectors j(x)?

Answer: Mercer kernels K

• K is continuous
• K is symmetric
• K is positive semi-definite, i.e. xTKx ≥ 0 for all x

Ensures optimization is concave maximization

23
Overfitting
• Huge feature space with kernels, what about
overfitting???
– Maximizing margin leads to sparse set of support
vectors
– Some interesting theory says that SVMs search for
simple hypothesis with large margin
– Often robust to overfitting

24
What about classification time?
• For a new input x, if we need to represent F(x), we are in trouble!
• Recall classifier: sign(w.F(x)+b)

• Using kernels we are cool!

25
SVMs with Kernels
• Choose a set of features and kernel function
• Solve dual problem to obtain support vectors ai
• At classification time, compute:

Classify as

26
SVMs with Kernels
• Iris dataset, 2 vs 13, Linear Kernel

27
SVMs with Kernels
• Iris dataset, 1 vs 23, Polynomial Kernel degree 2

28
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel

29
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel

30
SVMs with Kernels
• Chessboard dataset, Gaussian RBF kernel

31
SVMs with Kernels
• Chessboard dataset, Polynomial kernel

32
USPS Handwritten digits

33
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss

34
SVMs vs. Logistic Regression
SVM : Hinge loss

Logistic Regression : Log loss ( -ve log conditional likelihood)

Log loss Hinge loss

0-1 loss

-1 0 1
35
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss

High dimensional Yes! Yes!

features with
kernels
Solution sparse Often yes! Almost always no!

Semantics of “Margin” Real probabilities

output
36
Kernels in Logistic Regression

• Define weights in terms of features:

• Derive simple gradient descent rule on ai 37

SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss

High dimensional Yes! Yes!

features with
kernels
Solution sparse Often yes! Almost always no!

Semantics of “Margin” Real probabilities

output
38
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss

High dimensional Yes! Yes!

features with
kernels
Solution sparse Often yes! Almost always no!

Semantics of “Margin” Real probabilities

output
39
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss

High dimensional Yes! Yes!

features with
kernels
Solution sparse Often yes! Almost always no!

Semantics of “Margin” Real probabilities

output
40
What you need to know
• Maximizing margin
• Derivation of SVM formulation
• Slack variables and hinge loss
• Relationship between SVMs and logistic regression
– 0/1 loss
– Hinge loss
– Log loss
• Tackling multiple class
– One against All
– Multiclass SVMs
• Dual SVM formulation
– Easier to solve when dimension high d > n
– Kernel Trick 41

FANUC Series 0-PC Function Connection Manual B-64153EN01 CNC Manual
No ratings yet
FANUC Series 0-PC Function Connection Manual B-64153EN01 CNC Manual
148 pages
Selenium With Java
No ratings yet
Selenium With Java
205 pages
SC 200t00a Enu Powerpoint 05
No ratings yet
SC 200t00a Enu Powerpoint 05
35 pages
PON
No ratings yet
PON
10 pages
Mmwave SDK User Guide
No ratings yet
Mmwave SDK User Guide
64 pages
0-1 Knapsack Problem
No ratings yet
0-1 Knapsack Problem
6 pages
Lab 2 Event Driven
No ratings yet
Lab 2 Event Driven
2 pages
Teju Transport Report C++
100% (1)
Teju Transport Report C++
24 pages
Master-Data Science2017 PDF
No ratings yet
Master-Data Science2017 PDF
4 pages
s71500 TM Count 2x24v Manual en-US en-US
No ratings yet
s71500 TM Count 2x24v Manual en-US en-US
81 pages
Software Testing Checklist-Rev2
No ratings yet
Software Testing Checklist-Rev2
7 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
ADC22 04 02 Viterbi
No ratings yet
ADC22 04 02 Viterbi
29 pages
Software Design Process Explained
No ratings yet
Software Design Process Explained
30 pages
HMMs for Genomic Sequence Analysis
No ratings yet
HMMs for Genomic Sequence Analysis
59 pages
AWS Cloud Computing Basics Guide
No ratings yet
AWS Cloud Computing Basics Guide
2 pages
Chương 1. Cấu trúc và hoạt động của máy tính
No ratings yet
Chương 1. Cấu trúc và hoạt động của máy tính
99 pages
On Block Chain
No ratings yet
On Block Chain
16 pages
Revision Sheet For Graphics: Exercise
No ratings yet
Revision Sheet For Graphics: Exercise
19 pages
F
No ratings yet
F
45 pages
123 Kamble Aditya AJP
No ratings yet
123 Kamble Aditya AJP
20 pages
Document
No ratings yet
Document
13 pages
MU Exam Prep: Distributed Systems
No ratings yet
MU Exam Prep: Distributed Systems
19 pages
Paste
No ratings yet
Paste
3 pages
4.ELEC2130 Op-Amp Characteristics - VP
No ratings yet
4.ELEC2130 Op-Amp Characteristics - VP
10 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
Hoapn CV
No ratings yet
Hoapn CV
3 pages
System and Method For Adaptive Traffic Path Management
No ratings yet
System and Method For Adaptive Traffic Path Management
10 pages
What Is The CIA Triad
No ratings yet
What Is The CIA Triad
5 pages
Laynier CV 2023 - en
No ratings yet
Laynier CV 2023 - en
3 pages
Computer Basics
100% (1)
Computer Basics
21 pages
Divide and Conquer Algorithms
No ratings yet
Divide and Conquer Algorithms
2 pages
Cloud Security
No ratings yet
Cloud Security
54 pages
Huber Connectivity For Circulators
No ratings yet
Huber Connectivity For Circulators
24 pages
Error Detection
No ratings yet
Error Detection
29 pages

SVM Dual Kernel

Uploaded by

SVM Dual Kernel

Uploaded by

Support Vector Machines

- Dual formulation and Kernel

Machine Learning 10-315

Moving the constraint to objective function

Ø Weak duality: The dual solution d* lower bounds the primal

Duality gap = p*-d*

Ø Strong duality: d* = p* holds often for many problems of

Find the dual: Optimization over x is unconstrained.

Solve: Now need to maximize L(x*,α) over α ≥ 0

a = 0 constraint is inactive, α > 0 constraint is active (tight) 6

w – weights on features (d-dim problem)

• Dual problem (derivation):

a – weights on training pts (n-dim problem)

If we can solve for

Dual problem is also QP

Dual problem is also QP

Use any one of support vectors with

Dual problem is also QP

• But, more importantly, the “kernel trick”!!!

Φ(x) = (x12, x22, x1x2, …., exp(x1))

Feature space becomes really large very quickly!

Φ(x) – High-dimensional feature space, but never need it explicitly as long

• Never represent features explicitly

• Constant-time high-dimensional dot-

• Gaussian/Radial kernels (polynomials of all orders – recall

Answer: Mercer kernels K

Ensures optimization is concave maximization

• Using kernels we are cool!

Logistic Regression : Log loss ( -ve log conditional likelihood)

Log loss Hinge loss

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

• Define weights in terms of features:

• Derive simple gradient descent rule on ai 37

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

You might also like

Duality gap = p-d