SVMC
An introduction to Support Vector Machines Classification
6.783, Biomedical Decision Support
Lorenzo Rosasco
([email protected])
Department of Brain and Cognitive Science
MIT
Friday, October 30, 2009
A typical problem
We have a cohort of patients from two
groups- say A and B.
We wish to devise a classification rule to
distinguish patients of one group from
patients of the other group.
Friday, October 30, 2009
Learning and
Generalization
Goal: classify correctly new
patients
3
Friday, October 30, 2009
Plan
1. Linear SVM
2. Non Linear SVM: Kernels
3. Tuning SVM
4. Beyond SVM: Regularization Networks
Friday, October 30, 2009
Learning from Data
To make predictions we need informations
about the patients
patient 1:
x = (x , . . . , x )
1
patient 2 : x = (x , . . . , x )
1
....
patient ! : x = (x , . . . , x )
1
Friday, October 30, 2009
Linear model
Patients of class A are labeled y=1
Patients of class B are labeled y=-1
Linear model
wx=
classification rule
Friday, October 30, 2009
n
!
wj xj
j=1
sign(w x)
1D Case
Y
y=1
wx>0
wx=0
X
y=-1
Friday, October 30, 2009
wx<0
How do we find a good solution?
x = (x , x )
1
y=1
2D Classification Problem
Friday, October 30, 2009
y=-1
How do we find a good solution?
wx>0
wx<0
wx=0
Friday, October 30, 2009
How do we find a good solution?
Friday, October 30, 2009
How do we find a good solution?
Friday, October 30, 2009
How do we find a good solution?
Friday, October 30, 2009
How do we find a good solution?
M
The margin M
measures the
distance of the
two closest points
Friday, October 30, 2009
Maximum Margin Hyperplane
....with little effort ... one can show that
maximizing the margin M is equivalent to:
maximizing
Friday, October 30, 2009
1
!w!
he Linear, Homogeneous, Separable SVM
SVM
Linear and Separable SVM
Text
2
||w ||
minn
w R
Bias and Slack
subject to : yi (w x) 1 i = 1, . . . , !
The SVM introduced by Vapnik includes an unregularized bias
term b,an
leading
to classification
a function
of thesolution
form:
Typically
off-set
term is via
added
to the
f (x) = sign (w x + b).
In practice, we want to work with datasets that are not linearly
Friday, October 30, 2009
A more general
Algorithm
There are two things we would like to
improve:
Friday, October 30, 2009
Allow for errors
Non Linear Models
Measuring errors
Friday, October 30, 2009
Measuring errors (cont)
i
i
i
i
Friday, October 30, 2009
Slack Variables
The New Primal
Linear SVM
With slack variables, the primal SVM problem becomes
!"
1
2
min
C
+
||w
||
i=1 i
2
n
n
w R ,R ,bR
subject to : yi (w x + b) 1 i
i 0
Friday, October 30, 2009
i = 1, . . . , "
i = 1, . . . , "
Optimization
How do we solve this minimization problem?
(...and why do we call it SVM anyway?)
Friday, October 30, 2009
Some facts
Friday, October 30, 2009
Representer Theorem
Dual Formulation
Box Constraints and Support Vectors
Representer Theorem
The solution to the minimization problem
can be written as
wx=
Friday, October 30, 2009
!
!
i=1
ci (x xi )
min
cR! ,bR,R!
!"
i=1 i
1 T
2 c Kc
Dual Problem
i = 1, . . . , "
i 0
i = 1, . . . , "
!"
subject to : yi ( j=1 cj K (xi , xj ) + b) 1 i
The coefficients can be found solving:
max
R!
subject to :
!"
i=1 i
!"
1 T
2 Q
Text
Text
i=1 yi i
=0
0 i C
xj )
Here Q = yi yj (xiR.Rifkin
i = ci /yi
Friday, October 30, 2009
i = 1, . . . , "
Support Vector Machines
Towards Simpler Optimality Conditions, II
Toward
Simpler
Optimality
Conditions
Deter
Simpler Optimality Conditions Determining
b
Optimality conditions
with
little
effort
...
one
can
show
that
Suppose we have the optimal s. Also suppose (this hap
i (this happens
e we haveConversely,
the optimal
i s.Also
suppose
suppose
=
C:
i
in practice)
that
exists0an
ce) that
there exists
anthere
i satisfying
< i isatisfying
< C. Then0 < i < C. The
i = C = yi f (xi ) 1 + i = 0
< 0C
i < C =
If i >
==i y>i f (x0i ) 1
then
= i = 0 = i = 0
!
!
!
!
=
yj =
K (x
xij )( +some
b)yj
=
i(
j is
i,y
j1Ktraining
(x0i , xj ) +points
b) 1 = 0
The ysolution
sparse:
j=1
do not contribute to j=1
the solution.
!
!
!
!
= b = yi =yj bj K=
(xy
i , x
j)
yj j K (xi , xj )
i
j=1
Friday, October 30, 2009
R. Rifkin
Support Vector Machines
j=1
Sparse Solution
Note that:
The solution depends only on the training
Geometric Interpretation
Reduced
Optimalityof
set points. (no dependence
onofthe
number
Conditions
features!)
R. Rifkin
Friday, October 30, 2009
Support Vector Machines
RE
M AP
Feature Map
X F
lanes in the feature space
f (x) = w (x)
f (x) = ", (x)#
n linear functions in the original space.
Friday, October 30, 2009
min
!"
i=1 i
1 T
2 c Kc
A Key Observation
i = 1, . . . , "
cR! ,bR,R!
!"
subject to : yi ( j=1 cj K (xi , xj ) + b) 1 i
i 0
i = 1, . . . , "
The solution depends only on Q = yi yj (xi xj )
max
R!
subject to :
!"
i=1 i
!"
1 T
2 Q
Text
i=1 yi i
=0
0 i C
i = 1, . . . , "
Idea: use Q = yi yj ((xi ) (xj ))
R. Rifkin
Friday, October 30, 2009
Support Vector Machines
Kernels and Feature
Maps
The crucial quantity is the inner product
K(x, t) = (x) (t)
called Kernel.
A function is called Kernel if it is:
symmetric
positive definite
Friday, October 30, 2009
Examples of pd kernels
Examples of Kernels
Very common examples of symmetric pd kernels are
Linear kernel
K (x, x ! ) = x x !
Gaussian kernel
K (x, x ! ) = e
!xx # !2
>0
Polynomial kernel
K (x, x ! ) = (x x ! + 1)d ,
d N
For specific applications, designing an effective kernel is a
challenging problem.
L. Rosasco
Friday, October 30, 2009
RKHS
Non Linear SVM
Summing up:
Define Feature Map either explicitly or via a
kernel
Find linear solution in the Feature space
Use same solver as in the linear case
Representer theorem now gives:
w (x) =
Friday, October 30, 2009
!
!
i=1
ci ((x) (xi )) =
!
!
i=1
ci K(x, xi )
Example in 1D
Y
y=1
X
y=-1
Friday, October 30, 2009
Software
Good Large-Scale Solvers
SVM Light: http://svmlight.joachims.org
SVM Torch: http://www.torch.ch
libSVM:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Friday, October 30, 2009
R. Rifkin
Support Vector Machines
Model Selection
We have to fix the Regularization parameter C
We have to choose the kernel (and its
parameter)
Using default values is
usually a BAD BAD idea
Friday, October 30, 2009
Regularization Parameter
With slack variables, the primal SVM problem becomes
!"
1
2
min
C
+
||w
||
i
i=1
2
n
n
w R ,R ,bR
subject to : yi (w x + b) 1 i
Friday, October 30, 2009
i = 1, . . .
i 0
i = 1, . . .
Large C: we try to minimize errors ignoring
the complexity of the solution
Small C we ignore the errors to obtain a
simple solution
Which Kernel?
For very high dimensional data linear kernel is often
the default choice
allows computational speed up
less prone to overfitting
Gaussian Kernel with proper tuning is another
common choice
Whenever possible use prior knowledge
to build problem specific features or
Friday, October 30, 2009
2D demo
Large and Small Margin Hyperplanes
demo
(a)
(b)
R. Rifkin
Friday, October 30, 2009
Support Vector Machines
Practical Rules
We can choose C (and the kernel
parameter) via cross validation
Holdout set
Training Set
Validation
Set
K-fold cross validation
K=# of examples is called Leave One Out
Friday, October 30, 2009
K-Fold CV
We have to compute several solutions...
Friday, October 30, 2009
ISCLASS minimum, and this is reflected in inefficiencies nearer
CLASS curves in other simulation studies we have done show this
We have observed (as did Joachims) that the value of XA in the
a good estimate of the value of MISCLASS at its minimizer, only
stic. The GACV at its minimizer is an estimate of twice the miste. The value of one half the GACV is somewhat more pessimistic.
ce one obtains the solution to the problem the computation of both
R)XA are equally trivial.
A Rule of Thumb
This is how the CV error typically looks like
og10 GCKL
log10 GACV
!!"'%
!!"'$
&
)%%%
!"6%
'
!!"$%
!!"&%
*
+,-(.45-0/3
!!"$$
!"&%
!"'%
(
)
!"(%
!%%%
!!"&$
!)
!!"#%
!(
!!"#$
!*
!!"(
!!"'
!)!
,-(.+/012/3
!'
!(!
!$
og10 BRM ISCLASS
!!"&
!)$
!)!
+,-(.+/012/3
!$
log10 BRXA
!)%%
&
Fix a reasonable kernel, then fine tune C
!!"'
$
'
Friday, October 30, 2009
!)")
/3
*
(
!!"$
!!"&
Which values do we start from?
Basics: RKHS, Kernel
For
the
Gaussian
kernel,
of the
RKHS
H with
a positive semidefinite
kernelpick
functionsigma
k:
order oflinear:
the average
distance...
k(X , X ) = X X
i
polynomial:
gaussian:
t
i
k(Xi , Xj ) = (Xit Xj + 1)d
"
!
||Xi Xj ||2
k(Xi , Xj ) = exp
2
Define the kernel matrix K to satisfy Kij = k(Xi , Xj ).
Abusing notation, allow k to take and produce sets:
(X
, X) = K
TakekGiven
min
(and max) C as the value for which
an arbitrary point X , k (X , X ) is a column vector
whose ith entry is k (X , X ).
the
training
set
error
does
not
increase
The linear kernel has special properties, which we discuss
in detail later.
(decrease)
anymore.
R. Rifkin
Friday, October 30, 2009
Regularized Least Squares
Computational Considerations
the training time depends on the
parameters: the more we fit, the slower the
algorithm.
typically the computational burden is in the
selection of the regularization parameter
(solvers for regularization path).
Friday, October 30, 2009
Regularization Networks
SVM are an example of a family of algorithms
of the form:
C
!
!
i=1
V (yi , w (xi )) + !w!2
V is called loss function
Friday, October 30, 2009
Hinge Loss
V (yw (x))
0-1 loss
hinge loss
Friday, October 30, 2009
yw (x)
Loss functions
L OSS FUNCTIONS
Friday, October 30, 2009
Representer Theorem
For a LARGE class of loss functions:
w (x) =
n
!
i=1
i ((x) (xi )) =
n
!
i K(x, xi )
i=1
The way we compute the coefficients depends on the
considered loss function.
Friday, October 30, 2009
Regularized LS
The simplest, yet powerful, algorithm is
probably RLS
Square loss V (y, w (x)) = (y w (x))2
Algorithm
1
(Q + I) = y, Qi,j = K(xi , xj )
C
Leave one out can be computed at the price
of one (!!!) solution
Friday, October 30, 2009
Summary
Friday, October 30, 2009
Separable, Linear SVM
Non Separable, Linear SVM
Non Separable, Non Linear SVM
How to use SVM