11 Ethem Linear SVM 2015
11 Ethem Linear SVM 2015
INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN
© The MIT Press, 2004
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml
CHAPTER 10:
Linear Discrimination
Likelihood- vs. Discriminant-based Classification
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Discriminant
n Linear discriminant:
d
gi (x | wi , wi 0 ) = wiT x + wi 0 = ∑ wij x j + wi 0
j =1
n Advantages:
¨ Simple: O(d) space/computation
¨ Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
¨ Optimal when p(x|Ci) are Gaussian with shared cov
matrix; useful when classes are (almost) linearly
separable
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Generalized Linear Model
n Quadratic discriminant:
gi (x | Wi , wi , wi 0 ) = x T Wi x + wiT x + wi 0
⎧C1 if g (x ) > 0
choose ⎨
⎩C2 otherwise
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning the Discriminants
gi (x | w i , w i 0 ) = w iT x + w i 0
−1 1 T −1
w i = Σ µi w i 0 = − µi Σ µi + log P (Ci )
2
So, estimate µi and Σ from data, and plug into the
gi’s to find the linear discriminant functions.
Of course any way of learning can be used (e.g.
perceptron, gradient descent, logistic regression...).
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n When K > 2
¨ Combine K two-class problems, each one separating one
class from all other classes
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multiple Classes
gi (x | wi , w i 0 ) = wiT x + w i 0
Choose Ci if
K
gi (x ) = max g j (x )
j =1
⎧ >0 if x ∈ Ci
⎪
gij (x ) = ⎨ ≤0 if x ∈ C j
⎪don' t care otherwise
⎩
choose Ci if
∀j ≠ i , gij (x ) > 0
Uses k(k-1)/2
linear discriminants
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
A Bit of Geometry
n Dot Product and Projection
13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Geometry The points x on the separating hyperplane have g(x) =
wTx+w0=0. Hence for the points on the boundary wTx=-
w 0.
14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines
n Vapnik and Chervonenkis – 1963
n Boser, Guyon and Vapnik – 1992 (kernel trick)
n Cortes and Vapnik – 1995 (soft margin)
n It is popular because
• it can be easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM Concepts
n Convex programming and duality
n Using maximum margin to control complexity
n Representing non-linear boundaries with feature
expansion
n The kernel trick for efficient optimization
17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Separators
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification Margin
w T xi + b
r=
n Distance from example xi to the separator is w
n Examples closest to the hyperplane are support vectors.
n Margin ρ of the separator is the distance between support
vectors from two classes.
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximum Margin Classification
n Maximizing the margin is good according to intuition.
20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM as 2-class Linear Classifier
(Cortes and Vapnik, 1995; Vapnik, 1995)
t
⎧ + 1 if x ∈ C1
t
{t
}t
X = x , r t where r = ⎨ t
⎩ − 1 if x ∈ C2
find w and w0 such that Note the condition >= 1 (not just
0). We can always do this if the
w T x t + w0 ≥ +1 for r t = +1 classes are linearly separable by
rescaling w and w0, without
w T x t + w0 ≤ -1 for r t = −1 affecting the separating
hyperplane: T t
w x + w0 = 0
Optimal separating hyperplane: Separating hyperplane maximizing the
margin
21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Optimal Separating Hyperplane
Must satisfy :
T t t
w x + w0 ≥ +1 for r = +1
w T x t + w0 ≤ −1 for r t = −1
which can be rewritten as
t
( T
r w x + w0 ≥ +1 t
)
(Cortes and Vapnik, 1995; Vapnik, 1995)
22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximizing the Margin
⎧ 1 2
⎪ w ρ = 2d =
⎪⎪ w
d = ⎨
To maximize margin,
⎪ | −1 | minimize the Euclidian norm
⎪ of the weight vector w
⎪⎩ w
23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximizing the Margin-Alternate explanation
24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 2
min w subject to r t ( wT x t + w0 ) ≥ +1, ∀t
2 Unconstrained
N problem using
1 2
L p = w − ∑α t $%r t ( wT x t + w0 ) −1&' Lagrange
2 t=1
multipliers
(+ numbers)
25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1
w subject to r t (w T x t + w0 ) ≥ +1, ∀t
2
min
2
N
1
[
L p = w − ∑ α t r t (w T x t + w0 ) − 1
2
2
]
t =1
N N
1
= w − ∑ α t r t (w T x t + w0 ) + ∑ α t
2
2 t =1 t =1
∂L p N
t t t Convex quadratic optimization
= 0 ⇒ w = ∑α r x problem can be solved using
∂w t =1 the dual form where we use
∂L p N these local minima constraints
= 0 ⇒ ∑α t r t = 0 and maximize w.r.t αts
∂w0 t =1
34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
from: hIp://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/lagrang/lagrang.html
35
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
N
1 ∂L p N
2
L p = w − ∑ α t r t w T x t + w0 − 1
2
[ ( ) ] = 0 ⇒ w = ∑ α t r t xt
t =1 ∂w t =1
1 N N
∂L p N
2
2
= w − ∑ α r w x + w0 + ∑ α t
t t T t
( ) ∂w0
= 0 ⇒ ∑α t r t = 0
t =1 t =1 t =1
1 T
( )
Ld = w w − w T ∑ α tr t x t −w 0 ∑ α tr t + ∑ α t
2
•Maximize Ld with
t t t respect to αt only
1 T
= − w w + ∑α t
2
( ) • Quadratic
programming
t
problem
1 t T s • Thanks to the
= − ∑∑ α α r r x x + ∑ α t
2 t s
t s t s
( ) convexity of the
t problem, optimal
value of Lp = Ld
subject to ∑ α tr t = 0 and α t ≥ 0, ∀t
t
36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n To every convex program corresponds a dual
37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support vectors
for which
( )
r t w T x t + w0 = 1
38
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 T
( )
Ld = w w − w T ∑α t r t xt −w0 ∑α t r t + ∑α t
2 t t t
1 T
= − w w + ∑α t
2
( )
t
Size of the dual
1 t T s depends on N
= − ∑∑α α r r x x + ∑α t
2 t s
t s t s
( )
and not on d
t
39
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Calculating the parameters w and w0
Note that:
¨ either the constraint is exactly satisfied (=1)
(and αt can be non-zero)
¨ or the constraint is clearly satisfied (> 1)
(then αt must be zero)
N
1 2
L p = w − ∑ α t r t w T x t + w0 − 1
2
[ ( ) ]
t =1
n Once we solve for αt, we see that most of them are 0 and only
a small number have αt >0
¨ the corresponding xts are called the support vectors
40
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Calculating the parameters w and w0
t T t
w0 = r − w x
42
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n We make decisions by comparing each query x with only
the support vectors
N
y = sign(wT x + w0 ) = ( ∑ α t r t x t )x + w0
t∈SV
43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Not-‐‑Linearly Separable Case
n The non-separable case cannot find a feasible
solution using the previous approach
¨ The objective function (LD) grows arbitrarily large.
46
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
n Not linearly separable
(
r t w T x t + w0 ≥ 1 − ξ t )
ξt ≥ 0
Case 1: ξt = 0
Case 2: ξt ≥1
Case 3: 0 ≤ ξt <1
48
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
Upper bound on the
number of training
t errors
n Define Soft error
∑ξ t
Lagrange multipliers
to enforce positivity
n New primal is of ξ
1 2
[ (
Lp = w + C ∑t ξt − ∑t αt r t w T x t + w 0 − 1 + ξt −∑t µt ξt
2
) ]
49
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
1 t T s
Ld = − ∑∑α α r r x x + ∑α t
2 t s
t s t s
( )
t
subject to
t t t
∑α r = 0 and 0 ≤ α ≤ C, ∀t
t
50
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions in SVM
n We can handle the overfitting problem: even if we have
lots of parameters, large margins make simple classifiers
52
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions
53
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
54
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions
g (x ) = ∑ wkϕ k (x ) + b
k =1
g (x ) = ∑ wkϕ k (x )
k =0
if we assume ϕ 0 (x ) = 1 for ∀x
55
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Machines
w = ∑ αt r t z t = ∑ αt r t φ x t ( )
t t
t T
T
g (x ) = w φ(x ) = ∑ α r φ x t t
( ) φ(x )
t
g (x ) = ∑ αtr t K x t , x ( )
t
56
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions
2
K (x, y ) = (x y + 1) T
2
= (x1 y1 + x2 y2 + 1)
= 1 + 2 x1 y1 + 2 x2 y2 + 2 x1 x2 y1 y2 + x12 y12 + x22 y22
[
φ (x ) = 1, 2 x1 , 2 x2 , 2 x1 x2 , x , x 2
1 2]
2 T
57
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
59
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Examples of Kernel Functions
n Decide by cross-validation
61
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Other Kernel Functions
t q
n Polynomials of degree q: ( ) ( )
K x ,x = x xt T
q
K (x , x ) = (x x + 1)
t T t
2
⎡ x t − x ⎤
n Radial-basis functions: ( )
K x t , x = exp ⎢−
⎢ σ2
⎥
⎥
⎣ ⎦
62
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
What Functions are Kernels? Advanced
n For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can
be cumbersome.
n Any function that satisfies some constraints called the Mercer
conditions can be a Kernel function - (Cherkassky and Mulier, 1998)
Every semi-positive definite symmetric function is a kernel
K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
63
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n Informally, kernel methods implicitly define the class of
possible patterns by introducing a notion of similarity
between data
¨ Choice of similarity -> Choice of relevant features
64
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
String kernels
n For example, given two documents, D1 and D2, the
number of words appearing in both may form a kernel.
65
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Projecting into Higher Dimensions
66
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM Applications
67
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM history and applications
71
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Challenges
n What is the optimal data representation for SVM? What is the effect of
feature weighting? How does an SVM handle categorical or missing
features?
n Do SVMs always perform best? Can they beat a hand-crafted solution for a
particular problem?
72
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n More explanations or demonstrations can be found at:
¨ http://www.support-vector-machines.org/index.html
¨ Haykin Chp. 6 pp. 318-339
¨ Burges tutorial (under/reading/)
n Burges, CJC "A Tutorial on Support Vector Machines for Pattern Recognition" Data
Mining and Knowledge Discovery, Vol 2 No 2, 1998.
¨ http://www.dtreg.com/svm.htm
n Software
¨ SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows,
Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).
73
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
75
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
76
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
77
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)