Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
50 views66 pages

11 Ethem Linear SVM 2015

This document contains lecture slides about linear discrimination and machine learning. It discusses key concepts such as: - Likelihood-based and discriminant-based classification approaches - Linear discriminants which use a weighted sum of attributes to classify examples - Generalized linear models which can represent non-linear decision boundaries through feature transformations - Learning linear discriminants from data by estimating class means and covariances - Handling multiple classes through pairwise separation or by choosing the class with the highest discriminant value - The geometric interpretation of discriminants in terms of projections onto weight vectors and separating hyperplanes.

Uploaded by

aycaize
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views66 pages

11 Ethem Linear SVM 2015

This document contains lecture slides about linear discrimination and machine learning. It discusses key concepts such as: - Likelihood-based and discriminant-based classification approaches - Linear discriminants which use a weighted sum of attributes to classify examples - Generalized linear models which can represent non-linear decision boundaries through feature transformations - Learning linear discriminants from data by estimating class means and covariances - Handling multiple classes through pairwise separation or by choosing the class with the highest discriminant value - The geometric interpretation of discriminants in terms of projections onto weight vectors and separating hyperplanes.

Uploaded by

aycaize
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture Slides for

INTRODUCTION  TO    
Machine  Learning

ETHEM ALPAYDIN
© The MIT Press, 2004
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml
CHAPTER  10:  

Linear  Discrimination
Likelihood- vs. Discriminant-based Classification

n Likelihood-based: Assume a model for p(x|Ci), use


Bayes’ rule to calculate P(Ci|x)
Choose Ci if gi(x) = log P(Ci|x) is maximum

n Discriminant-based: Assume a model for the


discriminant gi (x|Φi); no density estimation
¨ Estimating the boundaries is enough; no need to accurately
estimate the densities inside the boundaries

3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Discriminant

n Linear discriminant:
d
gi (x | wi , wi 0 ) = wiT x + wi 0 = ∑ wij x j + wi 0
j =1

n Advantages:
¨ Simple: O(d) space/computation
¨ Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
¨ Optimal when p(x|Ci) are Gaussian with shared cov
matrix; useful when classes are (almost) linearly
separable

4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Generalized Linear Model
n Quadratic discriminant:
gi (x | Wi , wi , wi 0 ) = x T Wi x + wiT x + wi 0

n Instead of higher complexity, we can still use a linear


classifier if we use higher-order (product) terms.

n Map from x to z using nonlinear basis functions and use


a linear discriminant in z-space
z1 = x 1 , z2 = x 2 , z3 = x 12 , z 4 = x 22 , z5 = x 1x 2

n The linear function defined in the z space corresponds to


a non-linear function in the x space.
k
gi (x ) = ∑ wijφ j (x )
j =1 5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Two Classes
choose C1 if g1(x) > g2(x)
C2 if g2(x) > g1(x) Define:
g (x ) = g1 (x ) − g2 (x )
( ) (
= w 1T x + w 10 − w 2T x + w 20 )
T
= (w 1 − w 2 ) x + (w 10 − w 20 )
= wT x + w0

⎧C1 if g (x ) > 0
choose ⎨
⎩C2 otherwise

6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning the Discriminants

As we have seen before, when p (x | Ci ) ~ N ( μi , ∑),


the optimal discriminant is a linear one:

gi (x | w i , w i 0 ) = w iT x + w i 0
−1 1 T −1
w i = Σ µi w i 0 = − µi Σ µi + log P (Ci )
2

So,  estimate  µi  and  Σ  from  data,  and  plug  into  the  
gi’s  to  find  the  linear  discriminant  functions.

Of  course  any  way  of  learning  can  be  used  (e.g.  
perceptron,  gradient  descent,  logistic  regression...).
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n When K > 2
¨ Combine K two-class problems, each one separating one
class from all other classes

8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multiple Classes
gi (x | wi , w i 0 ) = wiT x + w i 0

How  to  train?


How  to  decide  on  a  test?  

Choose Ci if
K
gi (x ) = max g j (x )
j =1

Why? Any problem?

Convex decision regions


based on gis (indicated with blue)
dist is |gi(x)|/||wi||

Assumes that classes


are linearly separable:
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
reject may be used
Pairwise Separation
If the classes are not
linearly separable:
gij (x | wij , wij 0 ) = wijT x + wij 0

⎧ >0 if x ∈ Ci
⎪
gij (x ) = ⎨ ≤0 if x ∈ C j
⎪don' t care otherwise
⎩

choose Ci if
∀j ≠ i , gij (x ) > 0

Uses k(k-1)/2
linear discriminants

10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
A  Bit  of  Geometry
n Dot Product and Projection

n <w,p> =wTp = ||w||||p||Cosθ


w
n proj. of p onto w
= ||p||Cosθ θ
p p
= wT.p/||w||

13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Geometry The points x on the separating hyperplane have g(x) =
wTx+w0=0. Hence for the points on the boundary wTx=-
w 0.

Thus, these points also have the same projection onto


the weight vector w, namely wTx/||w|| (by definition of
projection and dot product). But this is equal to –w0/||w||.
Hence ...

The perpendicular distance of the


boundary to the origin is
|w0|/||w||.

The distance of any point x to the


decision boundary is |g(x)|/||w||.

14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support  Vector  Machines
n Vapnik and Chervonenkis – 1963
n Boser, Guyon and Vapnik – 1992 (kernel trick)
n Cortes and Vapnik – 1995 (soft margin)

n The SVM is a machine learning algorithm which


¨ solves classification problems
¨ uses a flexible representation of the class boundaries
¨ implements automatic complexity control to reduce overfitting
¨ has a single global minimum which can be found in polynomial
time

n It is popular because
• it can be easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM Concepts
n Convex programming and duality
n Using maximum margin to control complexity
n Representing non-linear boundaries with feature
expansion
n The kernel trick for efficient optimization

17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Separators

n Which of the linear separators is


optimal?

18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification Margin
w T xi + b
r=
n Distance from example xi to the separator is w
n Examples closest to the hyperplane are support vectors.
n Margin ρ of the separator is the distance between support
vectors from two classes.

19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximum Margin Classification
n Maximizing the margin is good according to intuition.

n Implies that only support vectors matter; other training


examples are ignorable.

20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM as 2-class Linear Classifier
(Cortes and Vapnik, 1995; Vapnik, 1995)

t
⎧ + 1 if x ∈ C1
t
{t
}t
X = x , r t where r = ⎨ t
⎩ − 1 if x ∈ C2
find w and w0 such that Note the condition >= 1 (not just
0). We can always do this if the
w T x t + w0 ≥ +1 for r t = +1 classes are linearly separable by
rescaling w and w0, without
w T x t + w0 ≤ -1 for r t = −1 affecting the separating
hyperplane: T t
w x + w0 = 0


Optimal separating hyperplane: Separating hyperplane maximizing the
margin

21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Optimal Separating Hyperplane

Must satisfy :
T t t
w x + w0 ≥ +1 for r = +1
w T x t + w0 ≤ −1 for r t = −1
which can be rewritten as
t
( T
r w x + w0 ≥ +1 t
)
(Cortes and Vapnik, 1995; Vapnik, 1995)

22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximizing the Margin

Distance from the discriminant to the closest instances on either


side is called the margin
| g ( x) |
In general this relationship holds (geometry): d=
w
So, for the support vectors, we have:

⎧ 1 2
⎪ w ρ = 2d =
⎪⎪ w
d = ⎨
To maximize margin,
⎪ | −1 | minimize the Euclidian norm
⎪ of the weight vector w
⎪⎩ w

23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximizing the Margin-Alternate explanation

n Distance from the discriminant to the closest instances on


either side is called the margin

n Distance of x to the hyperplane is wT xt + w0


w

n We require that this distance is (


r t wT xt + w0 )
≥ ρ, ∀t
at least some value ρ > 0. w

n We would like to maximize ρ, but we can do so in infinitely


many ways by scaling w.

n For a unique sol’n, we fix ρ||w||=1 and minimize ||w||.

24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 2
min w subject to r t ( wT x t + w0 ) ≥ +1, ∀t
2 Unconstrained
N problem using
1 2
L p = w − ∑α t $%r t ( wT x t + w0 ) −1&' Lagrange
2 t=1
multipliers
(+ numbers)

The solution, if it exists, is


always at a saddle point of the Lagrangian

Lp should be minimized w.r.t w and


maximized w.r.t αts

25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1
w subject to r t (w T x t + w0 ) ≥ +1, ∀t
2
min
2
N
1
[
L p = w − ∑ α t r t (w T x t + w0 ) − 1
2
2
]
t =1
N N
1
= w − ∑ α t r t (w T x t + w0 ) + ∑ α t
2

2 t =1 t =1

∂L p N
t t t Convex quadratic optimization
= 0 ⇒ w = ∑α r x problem can be solved using
∂w t =1 the dual form where we use
∂L p N these local minima constraints
= 0 ⇒ ∑α t r t = 0 and maximize w.r.t αts
∂w0 t =1

34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
from:  hIp://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/lagrang/lagrang.html

35
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
N
1 ∂L p N
2
L p = w − ∑ α t r t w T x t + w0 − 1
2
[ ( ) ] = 0 ⇒ w = ∑ α t r t xt
t =1 ∂w t =1

1 N N
∂L p N

2
2
= w − ∑ α r w x + w0 + ∑ α t
t t T t
( ) ∂w0
= 0 ⇒ ∑α t r t = 0
t =1 t =1 t =1

1 T
( )
Ld = w w − w T ∑ α tr t x t −w 0 ∑ α tr t + ∑ α t
2
•Maximize Ld with
t t t respect to αt only
1 T
= − w w + ∑α t
2
( ) • Quadratic
programming
t
problem
1 t T s • Thanks to the
= − ∑∑ α α r r x x + ∑ α t
2 t s
t s t s
( ) convexity of the
t problem, optimal
value of Lp = Ld
subject to ∑ α tr t = 0 and α t ≥ 0, ∀t
t
36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n To every convex program corresponds a dual

n Solving original (primal) is equivalent to solving dual

37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support vectors
for which

( )
r t w T x t + w0 = 1

38
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 T
( )
Ld = w w − w T ∑α t r t xt −w0 ∑α t r t + ∑α t
2 t t t

1 T
= − w w + ∑α t
2
( )
t
Size of the dual
1 t T s depends on N
= − ∑∑α α r r x x + ∑α t
2 t s
t s t s
( )
and not on d
t

•Maximize Ld with respect to αt only


subject to ∑α t r t = 0 and α t ≥ 0, ∀t
t
•Quadratic programming problem
•Thanks to the convexity of the problem, optimal value of Lp = Ld

39
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Calculating the parameters w and w0

Note that:
¨ either the constraint is exactly satisfied (=1)
(and αt can be non-zero)
¨ or the constraint is clearly satisfied (> 1)
(then αt must be zero)

N
1 2
L p = w − ∑ α t r t w T x t + w0 − 1
2
[ ( ) ]
t =1

n Once we solve for αt, we see that most of them are 0 and only
a small number have αt >0
¨ the corresponding xts are called the support vectors

40
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Calculating the parameters w and w0

Once we have the Lagrange multipliers, we can compute w and w0:


N
w = ∑α t r t xt = ∑α t r t xt
t =1 t∈SV

where SV is the set of the Support Vectors.

t T t
w0 = r − w x

42
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n We make decisions by comparing each query x with only
the support vectors

N
y = sign(wT x + w0 ) = ( ∑ α t r t x t )x + w0
t∈SV

n Choose class C1 if +, C2 if negative

43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Not-­‐‑Linearly  Separable  Case
n The non-separable case cannot find a feasible
solution using the previous approach
¨ The objective function (LD) grows arbitrarily large.

n Relax the constraints, but only when necessary


¨ Introduce a further cost for this

46
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
n Not linearly separable

(
r t w T x t + w0 ≥ 1 − ξ t )
ξt ≥ 0

Three cases (shown in fig):

Case 1: ξt = 0
Case 2: ξt ≥1
Case 3: 0 ≤ ξt <1
48
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
Upper bound on the
number of training
t errors
n Define Soft error
∑ξ t
Lagrange multipliers
to enforce positivity
n New primal is of ξ

1 2
[ (
Lp = w + C ∑t ξt − ∑t αt r t w T x t + w 0 − 1 + ξt −∑t µt ξt
2
) ]

n Parameter C can be viewed as a way to control overfitting: it “trades


off” the relative importance of maximizing the margin and fitting the
training data.

49
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane

n New dual is the same as the old one

1 t T s
Ld = − ∑∑α α r r x x + ∑α t
2 t s
t s t s
( )
t

subject to

t t t
∑α r = 0 and 0 ≤ α ≤ C, ∀t
t

n As in the separable case, instances that are not support


vectors vanish with their αt=0 and the remaining define
the boundary.

50
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel  Functions  in  SVM
n We can handle the overfitting problem: even if we have
lots of parameters, large margins make simple classifiers

n “All” that is left is efficiency

n Solution: kernel trick

52
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions

n Instead of trying to fit a non-linear model, we can


¨ map the problem to a new space through a non-linear
transformation and
¨ use a linear model in the new space

n Say we have the new space calculated by the basis


functions z = φ(x) where zj=φj(x) , j=1,...,k

d-dimensional x space k-dimensional z space

φ(x) = [φ1(x) φ2(x) ... φk(x)]

53
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
54
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions

g (x ) = ∑ wkϕ k (x ) + b
k =1

g (x ) = ∑ wkϕ k (x )
k =0

if we assume ϕ 0 (x ) = 1 for ∀x

55
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Machines

n Preprocess input x by basis functions


z = φ(x) g(z)=wTz
g(x)=wT φ(x)

n SVM solution: Find Kernel functions K(x,y) such that the


inner product of basis functions are replaced by a Kernel
function in the original input space

w = ∑ αt r t z t = ∑ αt r t φ x t ( )
t t

t T
T
g (x ) = w φ(x ) = ∑ α r φ x t t
( ) φ(x )
t

g (x ) = ∑ αtr t K x t , x ( )
t
56
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions

n Consider polynomials of degree q:


q
K ( x, y ) = ( x y +1) T

2
K (x, y ) = (x y + 1) T

2
= (x1 y1 + x2 y2 + 1)
= 1 + 2 x1 y1 + 2 x2 y2 + 2 x1 x2 y1 y2 + x12 y12 + x22 y22
[
φ (x ) = 1, 2 x1 , 2 x2 , 2 x1 x2 , x , x 2
1 2]
2 T

(Cherkassky and Mulier, 1998)

57
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
59
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Examples of Kernel Functions

n Linear: K(xi,xj)= xiTxj


¨ Mapping Φ: x → φ(x), where φ(x) is x itself

n Polynomial of power p: K(xi,xj)= (1+ xiTxj)p


¨ Mapping Φ: x → φ(x), where
φ(x) has ⎛ d + p ⎞
⎜ ⎟ dimensions
⎜ p ⎟
⎝ ⎠ 2
xi −x j

2σ 2
n Gaussian (radial-basis function): K(xi,xj) = e
¨ Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional:
every point is mapped to a function (a Gaussian)

n Higher-dimensional space still has intrinsic


dimensionality d, but linear separators in it correspond to
non-linear separators in original space.
60
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n Typically k is much larger than d, and possibly larger
than N
¨ Using the dual where the complexity depends on N rather than k
is advantageous

n We use the soft margin hyperplane


¨ If C is too large, too high a penalty for non-separable points (too
many support vectors)
¨ If C is too small, we may have underfitting

n Decide by cross-validation

61
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Other Kernel Functions
t q
n Polynomials of degree q: ( ) ( )
K x ,x = x xt T

q
K (x , x ) = (x x + 1)
t T t

2
⎡ x t − x ⎤
n Radial-basis functions: ( )
K x t , x = exp ⎢−
⎢ σ2
⎥
⎥
⎣ ⎦

n Sigmoidal functions such as: ( ) (


K x t , x = tanh 2xT x t + 1 )

62
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
What Functions are Kernels? Advanced

n For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can
be cumbersome.
n Any function that satisfies some constraints called the Mercer
conditions can be a Kernel function - (Cherkassky and Mulier, 1998)
Every semi-positive definite symmetric function is a kernel

n Semi-positive definite symmetric functions correspond to a semi-


positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)


K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)

K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)

63
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n Informally, kernel methods implicitly define the class of
possible patterns by introducing a notion of similarity
between data
¨ Choice of similarity -> Choice of relevant features

n More formally, kernel methods exploit information about


the inner products between data items
¨ Many standard algorithms can be rewritten so that they only
require inner products between data (inputs)
¨ Kernel functions = inner products in some feature space
(potentially very complex)
¨ If kernel given, no need to specify what features of the data are
being used
¨ Kernel functions make it possible to use infinite dimensions
n efficiently in time / space

64
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
String kernels
n For example, given two documents, D1 and D2, the
number of words appearing in both may form a kernel.

n Define φ(D1) as the M-dimensional binary vector where


dimension i is 1 if word wi appears in D1; 0 otherwise.
n Then φ(D1)Tφ(D2) indicates the number of shared words.

n If we define K(D1,D2) as the number of shared words;


¨ no need to preselect the M words
¨ no need to create the bag-of-words model explicitly
¨ M can be as large as we want

65
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Projecting into Higher Dimensions

66
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM Applications

n Cortes and Vapnik 1995:


¨ Handwritten digit classification
¨ 16x16 bitmaps -> 256 dimensions
¨ Polynomial kernel where q=3 -> feature space with 106
dimensions
¨ No overfitting on a training set of 7300 instances
¨ Average of 148 support vectors over different training sets

Expected test error rate:

ExpN[P(error)] = ExpN[#support vectors] / N


(= 0.02 for the above example)

67
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM history and applications

n SVMs were originally proposed by Boser, Guyon and Vapnik in 1992


and gained increasing popularity in late 1990s.

n SVMs represent a general methodology for many PR problems:


classification,regression, feature extraction, clustering, novelty detection,
etc.

n SVMs can be applied to complex data types beyond feature vectors


(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.

n SVM techniques have been extended to a number of tasks such as


regression [Vapnik et al. ’97], principal component analysis [Schölkopf et
al. ’99], etc.

n Most popular optimization algorithms for SVMs use decomposition to


hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
70
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Advantages of SVMs
¨ There are no problems with local minima, because the solution is a
Qaudratic Programming problem with a global minimum.
¨ The optimal solution can be found in polynomial time
¨ There are few model parameters to select: the penalty term C, the
kernel function and parameters (e.g., spread σ in the case of RBF
kernels)
¨ The final results are stable and repeatable (e.g., no random initial
weights)
¨ The SVM solution is sparse; it only involves the support vectors
¨ SVMs rely on elegant and principled learning methods
¨ SVMs provide a method to control complexity independently of
dimensionality
¨ SVMs have been shown (theoretically and empirically) to have
excellent generalization capabilities

71
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Challenges

n Can the kernel functions be selected in a principled manner?

n SVMs still require selection of a few parameters, typically through


cross-validation

n How does one incorporate domain knowledge?


¨ Currently this is performed through the selection of the kernel and the
introduction of “artificial” examples

n How interpretable are the results provided by an SVM?

n What is the optimal data representation for SVM? What is the effect of
feature weighting? How does an SVM handle categorical or missing
features?

n Do SVMs always perform best? Can they beat a hand-crafted solution for a
particular problem?

n Do SVMs eliminate the model selection problem?

72
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n More explanations or demonstrations can be found at:
¨ http://www.support-vector-machines.org/index.html
¨ Haykin Chp. 6 pp. 318-339
¨ Burges tutorial (under/reading/)
n Burges, CJC "A Tutorial on Support Vector Machines for Pattern Recognition" Data
Mining and Knowledge Discovery, Vol 2 No 2, 1998.
¨ http://www.dtreg.com/svm.htm

n Software
¨ SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows,
Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).

¨ LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for


Support Vector Machines), is developed by Chang and Lin; also widely used.
Developed in C++ and Java, it supports also multi-class classification, weighted
SVM for unbalanced data, cross-validation and automatic model selection. It has
interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels:
linear, polynomial, radial basis function, and neural (tanh).

n Applet to play with:


¨ http://lcn.epfl.ch/tutorial/english/svm/html/index.html
¨ http://cs.stanford.edu/people/karpathy/svmjs/demo/

73
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
75
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
76
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
77
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

You might also like