0% found this document useful (0 votes)

50 views66 pages

11 Ethem Linear SVM 2015

This document contains lecture slides about linear discrimination and machine learning. It discusses key concepts such as: - Likelihood-based and discriminant-based classification approaches - Linear discriminants which use a weighted sum of attributes to classify examples - Generalized linear models which can represent non-linear decision boundaries through feature transformations - Learning linear discriminants from data by estimating class means and covariances - Handling multiple classes through pairwise separation or by choosing the class with the highest discriminant value - The geometric interpretation of discriminants in terms of projections onto weight vectors and separating hyperplanes.

Uploaded by

aycaize

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views66 pages

11 Ethem Linear SVM 2015

Uploaded by

aycaize

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Lecture Slides for

INTRODUCTION TO
Machine Learning

ETHEM ALPAYDIN
© The MIT Press, 2004
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml
CHAPTER 10:

Linear Discrimination
Likelihood- vs. Discriminant-based Classification

n Likelihood-based: Assume a model for p(x|Ci), use

Bayes’ rule to calculate P(Ci|x)
Choose Ci if gi(x) = log P(Ci|x) is maximum

n Discriminant-based: Assume a model for the

discriminant gi (x|Φi); no density estimation
¨ Estimating the boundaries is enough; no need to accurately
estimate the densities inside the boundaries

3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Discriminant

n Linear discriminant:
d
gi (x | wi , wi 0 ) = wiT x + wi 0 = ∑ wij x j + wi 0
j =1

n Advantages:
¨ Simple: O(d) space/computation
¨ Knowledge extraction: Weighted sum of attributes;
positive/negative weights, magnitudes (credit scoring)
¨ Optimal when p(x|Ci) are Gaussian with shared cov
matrix; useful when classes are (almost) linearly
separable

4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Generalized Linear Model
n Quadratic discriminant:
gi (x | Wi , wi , wi 0 ) = x T Wi x + wiT x + wi 0

n Instead of higher complexity, we can still use a linear

classifier if we use higher-order (product) terms.

n Map from x to z using nonlinear basis functions and use

a linear discriminant in z-space
z1 = x 1 , z2 = x 2 , z3 = x 12 , z 4 = x 22 , z5 = x 1x 2

n The linear function defined in the z space corresponds to

a non-linear function in the x space.
k
gi (x ) = ∑ wijφ j (x )
j =1 5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Two Classes
choose C1 if g1(x) > g2(x)
C2 if g2(x) > g1(x) Deﬁne:
g (x ) = g1 (x ) − g2 (x )
( ) (
= w 1T x + w 10 − w 2T x + w 20 )
T
= (w 1 − w 2 ) x + (w 10 − w 20 )
= wT x + w0

⎧C1 if g (x ) > 0
choose ⎨
⎩C2 otherwise

6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning the Discriminants

As we have seen before, when p (x | Ci ) ~ N ( μi , ∑),

the optimal discriminant is a linear one:

gi (x | w i , w i 0 ) = w iT x + w i 0
−1 1 T −1
w i = Σ µi w i 0 = − µi Σ µi + log P (Ci )
2

So, estimate µi and Σ from data, and plug into the
gi’s to ﬁnd the linear discriminant functions.

Of course any way of learning can be used (e.g.
perceptron, gradient descent, logistic regression...).
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n When K > 2
¨ Combine K two-class problems, each one separating one
class from all other classes

8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multiple Classes
gi (x | wi , w i 0 ) = wiT x + w i 0

How to train?

How to decide on a test?

Choose Ci if
K
gi (x ) = max g j (x )
j =1

Why? Any problem?

Convex decision regions

based on gis (indicated with blue)
dist is |gi(x)|/||wi||

Assumes that classes

are linearly separable:
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
reject may be used
Pairwise Separation
If the classes are not
linearly separable:
gij (x | wij , wij 0 ) = wijT x + wij 0

⎧ >0 if x ∈ Ci
⎪
gij (x ) = ⎨ ≤0 if x ∈ C j
⎪don' t care otherwise
⎩

choose Ci if
∀j ≠ i , gij (x ) > 0

Uses k(k-1)/2
linear discriminants

10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
A Bit of Geometry
n Dot Product and Projection

n <w,p> =wTp = ||w||||p||Cosθ

w
n proj. of p onto w
= ||p||Cosθ θ
p p
= wT.p/||w||

13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Geometry The points x on the separating hyperplane have g(x) =
wTx+w0=0. Hence for the points on the boundary wTx=-
w 0.

Thus, these points also have the same projection onto

the weight vector w, namely wTx/||w|| (by definition of
projection and dot product). But this is equal to –w0/||w||.
Hence ...

The perpendicular distance of the

boundary to the origin is
|w0|/||w||.

The distance of any point x to the

decision boundary is |g(x)|/||w||.

14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines
n Vapnik and Chervonenkis – 1963
n Boser, Guyon and Vapnik – 1992 (kernel trick)
n Cortes and Vapnik – 1995 (soft margin)

n The SVM is a machine learning algorithm which

¨ solves classification problems
¨ uses a flexible representation of the class boundaries
¨ implements automatic complexity control to reduce overfitting
¨ has a single global minimum which can be found in polynomial
time

n It is popular because
• it can be easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning
16
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM Concepts
n Convex programming and duality
n Using maximum margin to control complexity
n Representing non-linear boundaries with feature
expansion
n The kernel trick for efficient optimization

17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Separators

n Which of the linear separators is

optimal?

18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification Margin
w T xi + b
r=
n Distance from example xi to the separator is w
n Examples closest to the hyperplane are support vectors.
n Margin ρ of the separator is the distance between support
vectors from two classes.

19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximum Margin Classification
n Maximizing the margin is good according to intuition.

n Implies that only support vectors matter; other training

examples are ignorable.

20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM as 2-class Linear Classifier
(Cortes and Vapnik, 1995; Vapnik, 1995)

t
⎧ + 1 if x ∈ C1
t
{t
}t
X = x , r t where r = ⎨ t
⎩ − 1 if x ∈ C2
find w and w0 such that Note the condition >= 1 (not just
0). We can always do this if the
w T x t + w0 ≥ +1 for r t = +1 classes are linearly separable by
rescaling w and w0, without
w T x t + w0 ≤ -1 for r t = −1 affecting the separating
hyperplane: T t
w x + w0 = 0

Optimal separating hyperplane: Separating hyperplane maximizing the
margin

21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Optimal Separating Hyperplane

Must satisfy :
T t t
w x + w0 ≥ +1 for r = +1
w T x t + w0 ≤ −1 for r t = −1
which can be rewritten as
t
( T
r w x + w0 ≥ +1 t
)
(Cortes and Vapnik, 1995; Vapnik, 1995)

22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximizing the Margin

Distance from the discriminant to the closest instances on either

side is called the margin
| g ( x) |
In general this relationship holds (geometry): d=
w
So, for the support vectors, we have:

⎧ 1 2
⎪ w ρ = 2d =
⎪⎪ w
d = ⎨
To maximize margin,
⎪ | −1 | minimize the Euclidian norm
⎪ of the weight vector w
⎪⎩ w

23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Maximizing the Margin-Alternate explanation

n Distance from the discriminant to the closest instances on

either side is called the margin

n Distance of x to the hyperplane is wT xt + w0

n We require that this distance is (

r t wT xt + w0 )
≥ ρ, ∀t
at least some value ρ > 0. w

n We would like to maximize ρ, but we can do so in infinitely

many ways by scaling w.

n For a unique sol’n, we fix ρ||w||=1 and minimize ||w||.

24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 2
min w subject to r t ( wT x t + w0 ) ≥ +1, ∀t
2 Unconstrained
N problem using
1 2
L p = w − ∑α t $%r t ( wT x t + w0 ) −1&' Lagrange
2 t=1
multipliers
(+ numbers)

The solution, if it exists, is

always at a saddle point of the Lagrangian

Lp should be minimized w.r.t w and

maximized w.r.t αts

25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1
w subject to r t (w T x t + w0 ) ≥ +1, ∀t
2
min
2
N
1
[
L p = w − ∑ α t r t (w T x t + w0 ) − 1
2
2
]
t =1
N N
1
= w − ∑ α t r t (w T x t + w0 ) + ∑ α t
2

2 t =1 t =1

∂L p N
t t t Convex quadratic optimization
= 0 ⇒ w = ∑α r x problem can be solved using
∂w t =1 the dual form where we use
∂L p N these local minima constraints
= 0 ⇒ ∑α t r t = 0 and maximize w.r.t αts
∂w0 t =1

34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
from: hIp://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/lagrang/lagrang.html

35
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
N
1 ∂L p N
2
L p = w − ∑ α t r t w T x t + w0 − 1
2
[ ( ) ] = 0 ⇒ w = ∑ α t r t xt
t =1 ∂w t =1

1 N N
∂L p N

2
2
= w − ∑ α r w x + w0 + ∑ α t
t t T t
( ) ∂w0
= 0 ⇒ ∑α t r t = 0
t =1 t =1 t =1

1 T
( )
Ld = w w − w T ∑ α tr t x t −w 0 ∑ α tr t + ∑ α t
2
•Maximize Ld with
t t t respect to αt only
1 T
= − w w + ∑α t
2
( ) • Quadratic
programming
t
problem
1 t T s • Thanks to the
= − ∑∑ α α r r x x + ∑ α t
2 t s
t s t s
( ) convexity of the
t problem, optimal
value of Lp = Ld
subject to ∑ α tr t = 0 and α t ≥ 0, ∀t
t
36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n To every convex program corresponds a dual

n Solving original (primal) is equivalent to solving dual

37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support vectors
for which

( )
r t w T x t + w0 = 1

38
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 T
( )
Ld = w w − w T ∑α t r t xt −w0 ∑α t r t + ∑α t
2 t t t

1 T
= − w w + ∑α t
2
( )
t
Size of the dual
1 t T s depends on N
= − ∑∑α α r r x x + ∑α t
2 t s
t s t s
( )
and not on d
t

•Maximize Ld with respect to αt only

subject to ∑α t r t = 0 and α t ≥ 0, ∀t
t
•Quadratic programming problem
•Thanks to the convexity of the problem, optimal value of Lp = Ld

39
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Calculating the parameters w and w0

Note that:
¨ either the constraint is exactly satisfied (=1)
(and αt can be non-zero)
¨ or the constraint is clearly satisfied (> 1)
(then αt must be zero)

N
1 2
L p = w − ∑ α t r t w T x t + w0 − 1
2
[ ( ) ]
t =1

n Once we solve for αt, we see that most of them are 0 and only
a small number have αt >0
¨ the corresponding xts are called the support vectors

40
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Calculating the parameters w and w0

Once we have the Lagrange multipliers, we can compute w and w0:

N
w = ∑α t r t xt = ∑α t r t xt
t =1 t∈SV

where SV is the set of the Support Vectors.

t T t
w0 = r − w x

42
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n We make decisions by comparing each query x with only
the support vectors

N
y = sign(wT x + w0 ) = ( ∑ α t r t x t )x + w0
t∈SV

n Choose class C1 if +, C2 if negative

43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Not-‐‑Linearly Separable Case
n The non-separable case cannot find a feasible
solution using the previous approach
¨ The objective function (LD) grows arbitrarily large.

n Relax the constraints, but only when necessary

¨ Introduce a further cost for this

46
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
n Not linearly separable

(
r t w T x t + w0 ≥ 1 − ξ t )
ξt ≥ 0

Three cases (shown in fig):

Case 1: ξt = 0
Case 2: ξt ≥1
Case 3: 0 ≤ ξt <1
48
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane
Upper bound on the
number of training
t errors
n Define Soft error
∑ξ t
Lagrange multipliers
to enforce positivity
n New primal is of ξ

1 2
[ (
Lp = w + C ∑t ξt − ∑t αt r t w T x t + w 0 − 1 + ξt −∑t µt ξt
2
) ]

n Parameter C can be viewed as a way to control overfitting: it “trades

off” the relative importance of maximizing the margin and fitting the
training data.

49
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Soft Margin Hyperplane

n New dual is the same as the old one

1 t T s
Ld = − ∑∑α α r r x x + ∑α t
2 t s
t s t s
( )
t

subject to

t t t
∑α r = 0 and 0 ≤ α ≤ C, ∀t
t

n As in the separable case, instances that are not support

vectors vanish with their αt=0 and the remaining define
the boundary.

50
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions in SVM
n We can handle the overfitting problem: even if we have
lots of parameters, large margins make simple classifiers

n “All” that is left is efficiency

n Solution: kernel trick

52
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions

n Instead of trying to fit a non-linear model, we can

¨ map the problem to a new space through a non-linear
transformation and
¨ use a linear model in the new space

n Say we have the new space calculated by the basis

functions z = φ(x) where zj=φj(x) , j=1,...,k

d-dimensional x space k-dimensional z space

φ(x) = [φ1(x) φ2(x) ... φk(x)]

53
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
54
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions

g (x ) = ∑ wkϕ k (x ) + b
k =1

g (x ) = ∑ wkϕ k (x )
k =0

if we assume ϕ 0 (x ) = 1 for ∀x

55
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Machines

n Preprocess input x by basis functions

z = φ(x) g(z)=wTz
g(x)=wT φ(x)

n SVM solution: Find Kernel functions K(x,y) such that the

inner product of basis functions are replaced by a Kernel
function in the original input space

w = ∑ αt r t z t = ∑ αt r t φ x t ( )
t t

t T
T
g (x ) = w φ(x ) = ∑ α r φ x t t
( ) φ(x )
t

g (x ) = ∑ αtr t K x t , x ( )
t
56
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Kernel Functions

n Consider polynomials of degree q:

q
K ( x, y ) = ( x y +1) T

2
K (x, y ) = (x y + 1) T

2
= (x1 y1 + x2 y2 + 1)
= 1 + 2 x1 y1 + 2 x2 y2 + 2 x1 x2 y1 y2 + x12 y12 + x22 y22
[
φ (x ) = 1, 2 x1 , 2 x2 , 2 x1 x2 , x , x 2
1 2]
2 T

(Cherkassky and Mulier, 1998)

57
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
59
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Examples of Kernel Functions

n Linear: K(xi,xj)= xiTxj

¨ Mapping Φ: x → φ(x), where φ(x) is x itself

n Polynomial of power p: K(xi,xj)= (1+ xiTxj)p

¨ Mapping Φ: x → φ(x), where
φ(x) has ⎛ d + p ⎞
⎜ ⎟ dimensions
⎜ p ⎟
⎝ ⎠ 2
xi −x j
−
2σ 2
n Gaussian (radial-basis function): K(xi,xj) = e
¨ Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional:
every point is mapped to a function (a Gaussian)

n Higher-dimensional space still has intrinsic

dimensionality d, but linear separators in it correspond to
non-linear separators in original space.
60
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n Typically k is much larger than d, and possibly larger
than N
¨ Using the dual where the complexity depends on N rather than k
is advantageous

n We use the soft margin hyperplane

¨ If C is too large, too high a penalty for non-separable points (too
many support vectors)
¨ If C is too small, we may have underfitting

n Decide by cross-validation

61
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Other Kernel Functions
t q
n Polynomials of degree q: ( ) ( )
K x ,x = x xt T

q
K (x , x ) = (x x + 1)
t T t

2
⎡ x t − x ⎤
n Radial-basis functions: ( )
K x t , x = exp ⎢−
⎢ σ2
⎥
⎥
⎣ ⎦

n Sigmoidal functions such as: ( ) (

K x t , x = tanh 2xT x t + 1 )

62
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
What Functions are Kernels? Advanced

n For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can
be cumbersome.
n Any function that satisfies some constraints called the Mercer
conditions can be a Kernel function - (Cherkassky and Mulier, 1998)
Every semi-positive definite symmetric function is a kernel

n Semi-positive definite symmetric functions correspond to a semi-

positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)

K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)

63
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n Informally, kernel methods implicitly define the class of
possible patterns by introducing a notion of similarity
between data
¨ Choice of similarity -> Choice of relevant features

n More formally, kernel methods exploit information about

the inner products between data items
¨ Many standard algorithms can be rewritten so that they only
require inner products between data (inputs)
¨ Kernel functions = inner products in some feature space
(potentially very complex)
¨ If kernel given, no need to specify what features of the data are
being used
¨ Kernel functions make it possible to use infinite dimensions
n efficiently in time / space

64
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
String kernels
n For example, given two documents, D1 and D2, the
number of words appearing in both may form a kernel.

n Define φ(D1) as the M-dimensional binary vector where

dimension i is 1 if word wi appears in D1; 0 otherwise.
n Then φ(D1)Tφ(D2) indicates the number of shared words.

n If we define K(D1,D2) as the number of shared words;

¨ no need to preselect the M words
¨ no need to create the bag-of-words model explicitly
¨ M can be as large as we want

65
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Projecting into Higher Dimensions

n Cortes and Vapnik 1995:

¨ Handwritten digit classification
¨ 16x16 bitmaps -> 256 dimensions
¨ Polynomial kernel where q=3 -> feature space with 106
dimensions
¨ No overfitting on a training set of 7300 instances
¨ Average of 148 support vectors over different training sets

Expected test error rate:

ExpN[P(error)] = ExpN[#support vectors] / N

(= 0.02 for the above example)

67
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SVM history and applications

n SVMs were originally proposed by Boser, Guyon and Vapnik in 1992

and gained increasing popularity in late 1990s.

n SVMs represent a general methodology for many PR problems:

classification,regression, feature extraction, clustering, novelty detection,
etc.

n SVMs can be applied to complex data types beyond feature vectors

(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.

n SVM techniques have been extended to a number of tasks such as

regression [Vapnik et al. ’97], principal component analysis [Schölkopf et
al. ’99], etc.

n Most popular optimization algorithms for SVMs use decomposition to

hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
70
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Advantages of SVMs
¨ There are no problems with local minima, because the solution is a
Qaudratic Programming problem with a global minimum.
¨ The optimal solution can be found in polynomial time
¨ There are few model parameters to select: the penalty term C, the
kernel function and parameters (e.g., spread σ in the case of RBF
kernels)
¨ The final results are stable and repeatable (e.g., no random initial
weights)
¨ The SVM solution is sparse; it only involves the support vectors
¨ SVMs rely on elegant and principled learning methods
¨ SVMs provide a method to control complexity independently of
dimensionality
¨ SVMs have been shown (theoretically and empirically) to have
excellent generalization capabilities

n Can the kernel functions be selected in a principled manner?

n SVMs still require selection of a few parameters, typically through

cross-validation

n How does one incorporate domain knowledge?

¨ Currently this is performed through the selection of the kernel and the
introduction of “artificial” examples

n How interpretable are the results provided by an SVM?

n What is the optimal data representation for SVM? What is the effect of
feature weighting? How does an SVM handle categorical or missing
features?

n Do SVMs always perform best? Can they beat a hand-crafted solution for a
particular problem?

n Do SVMs eliminate the model selection problem?

72
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
n More explanations or demonstrations can be found at:
¨ http://www.support-vector-machines.org/index.html
¨ Haykin Chp. 6 pp. 318-339
¨ Burges tutorial (under/reading/)
n Burges, CJC "A Tutorial on Support Vector Machines for Pattern Recognition" Data
Mining and Knowledge Discovery, Vol 2 No 2, 1998.
¨ http://www.dtreg.com/svm.htm

n Software
¨ SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows,
Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).

¨ LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for

Support Vector Machines), is developed by Chang and Lin; also widely used.
Developed in C++ and Java, it supports also multi-class classification, weighted
SVM for unbalanced data, cross-validation and automatic model selection. It has
interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels:
linear, polynomial, radial basis function, and neural (tanh).

n Applet to play with:

¨ http://lcn.epfl.ch/tutorial/english/svm/html/index.html
¨ http://cs.stanford.edu/people/karpathy/svmjs/demo/

73
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
75
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
76
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
77
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Living Things: Student Worksheet
100% (1)
Living Things: Student Worksheet
3 pages
SVM
No ratings yet
SVM
21 pages
Schedule D SAFETY, HEALTH AND ENVIRONMENTAL REQUIREMENTS
100% (1)
Schedule D SAFETY, HEALTH AND ENVIRONMENTAL REQUIREMENTS
26 pages
Regression
No ratings yet
Regression
33 pages
Kernel Machines
No ratings yet
Kernel Machines
33 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
ML Lectures - 20 22
No ratings yet
ML Lectures - 20 22
14 pages
Support Vector Machine
No ratings yet
Support Vector Machine
29 pages
Discriminant Functions
No ratings yet
Discriminant Functions
33 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machine Master Thesis
100% (3)
Support Vector Machine Master Thesis
7 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Chap13 KernelMachines
No ratings yet
Chap13 KernelMachines
24 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Dimensionality Reduction Lecture Slide
No ratings yet
Dimensionality Reduction Lecture Slide
27 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Machine Learning
No ratings yet
Machine Learning
45 pages
Machine Learning: Lecture Slides For
No ratings yet
Machine Learning: Lecture Slides For
67 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Unit 3 Kernel Machines
No ratings yet
Unit 3 Kernel Machines
24 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
SVM PPT
No ratings yet
SVM PPT
32 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
AI18-Support Vector Machines
No ratings yet
AI18-Support Vector Machines
24 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
Taz TFG 2016 2057
No ratings yet
Taz TFG 2016 2057
52 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
SVM Classifiers: A Technical Guide
No ratings yet
SVM Classifiers: A Technical Guide
44 pages
10 SVM
No ratings yet
10 SVM
77 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
SVM Basics for Machine Learning Enthusiasts
No ratings yet
SVM Basics for Machine Learning Enthusiasts
4 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
26 pages
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
No ratings yet
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
20 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
SVM Formulation and Optimization
No ratings yet
SVM Formulation and Optimization
16 pages
27 Support - Vector - Machine
No ratings yet
27 Support - Vector - Machine
17 pages
ML - 05 - Support Vector Machines
No ratings yet
ML - 05 - Support Vector Machines
52 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
SVM Applications and Properties
100% (1)
SVM Applications and Properties
34 pages
Support Vector Machines
No ratings yet
Support Vector Machines
18 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
شباتر اله مجمعه
No ratings yet
شباتر اله مجمعه
126 pages
S V M (SVM) : Upport Ector Achine
No ratings yet
S V M (SVM) : Upport Ector Achine
67 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
Support Vector Machines
No ratings yet
Support Vector Machines
13 pages
Support Vector Machines: (Vapnik, 1979)
No ratings yet
Support Vector Machines: (Vapnik, 1979)
34 pages
Report 1
No ratings yet
Report 1
6 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Extending Machine Learning Models
No ratings yet
Extending Machine Learning Models
64 pages
Civil Engineering Homework Guide
No ratings yet
Civil Engineering Homework Guide
5 pages
PR2Q1
No ratings yet
PR2Q1
12 pages
Maikop Series Geochemical Analysis
No ratings yet
Maikop Series Geochemical Analysis
4 pages
Remote Sensing: Radiometric Variations of On-Orbit FORMOSAT-5 RSI From Vicarious and Cross-Calibration Measurements
No ratings yet
Remote Sensing: Radiometric Variations of On-Orbit FORMOSAT-5 RSI From Vicarious and Cross-Calibration Measurements
18 pages
Chapter 5 - Short-Term and Working Memory
100% (1)
Chapter 5 - Short-Term and Working Memory
32 pages
splat mover 斯坦福数字孪生解决方案
No ratings yet
splat mover 斯坦福数字孪生解决方案
23 pages
Material List Summary-Waptech
No ratings yet
Material List Summary-Waptech
5 pages
WBGCore Competencies Final
No ratings yet
WBGCore Competencies Final
12 pages
DGNB Report KTVWD
No ratings yet
DGNB Report KTVWD
9 pages
Urban Life Essay Guide
100% (2)
Urban Life Essay Guide
8 pages
Ai Yapping
No ratings yet
Ai Yapping
2 pages
如何撰写研究假设
100% (1)
如何撰写研究假设
6 pages
Do Cry Over Spilt Milk Possibly You Can Change The Past
No ratings yet
Do Cry Over Spilt Milk Possibly You Can Change The Past
19 pages
Numerical Calculation of Manoeuvring Coefficients For Modelling The Effect of Submarine Motion Near The Surface
No ratings yet
Numerical Calculation of Manoeuvring Coefficients For Modelling The Effect of Submarine Motion Near The Surface
124 pages
Chapter 2
No ratings yet
Chapter 2
9 pages
Quality by Design in Pharma Development
No ratings yet
Quality by Design in Pharma Development
18 pages
Rheolube 363F: Rust Inhibited, PTFE Fortified
No ratings yet
Rheolube 363F: Rust Inhibited, PTFE Fortified
1 page
Mastery 2 (Etech)
No ratings yet
Mastery 2 (Etech)
4 pages
Engine Thermostat Testing Lesson
No ratings yet
Engine Thermostat Testing Lesson
3 pages
Ancient Psychomusicology Studies
No ratings yet
Ancient Psychomusicology Studies
557 pages
Group Behavior Foundations
No ratings yet
Group Behavior Foundations
36 pages
Kinematics Conceptual Questions
No ratings yet
Kinematics Conceptual Questions
3 pages
Dynamic Analysis of Roll Cages
No ratings yet
Dynamic Analysis of Roll Cages
6 pages
Semi-Detailed Lesson Plan: Kagawaran NG Edukasyon
No ratings yet
Semi-Detailed Lesson Plan: Kagawaran NG Edukasyon
7 pages
Catalogue - Contact Rivets
No ratings yet
Catalogue - Contact Rivets
10 pages
Rural Electric Motor Skills
100% (2)
Rural Electric Motor Skills
16 pages
TNCT Q1 COT On Roles of Parts of A Whole
No ratings yet
TNCT Q1 COT On Roles of Parts of A Whole
43 pages
ICT 1st Paper, CH 01
No ratings yet
ICT 1st Paper, CH 01
78 pages

11 Ethem Linear SVM 2015

Uploaded by

11 Ethem Linear SVM 2015

Uploaded by

Lecture Slides for

n Likelihood-based: Assume a model for p(x|Ci), use

n Discriminant-based: Assume a model for the

n Instead of higher complexity, we can still use a linear

n Map from x to z using nonlinear basis functions and use

n The linear function defined in the z space corresponds to

As we have seen before, when p (x | Ci ) ~ N ( μi , ∑),

How to train?

Why? Any problem?

Convex decision regions

Assumes that classes

n <w,p> =wTp = ||w||||p||Cosθ

Thus, these points also have the same projection onto

The perpendicular distance of the

The distance of any point x to the

n The SVM is a machine learning algorithm which

n Which of the linear separators is

n Implies that only support vectors matter; other training

Distance from the discriminant to the closest instances on either

n Distance from the discriminant to the closest instances on

n Distance of x to the hyperplane is wT xt + w0

n We require that this distance is (

n We would like to maximize ρ, but we can do so in infinitely

n For a unique sol’n, we fix ρ||w||=1 and minimize ||w||.

The solution, if it exists, is

Lp should be minimized w.r.t w and

n Solving original (primal) is equivalent to solving dual

•Maximize Ld with respect to αt only

Once we have the Lagrange multipliers, we can compute w and w0:

where SV is the set of the Support Vectors.

n Choose class C1 if +, C2 if negative

n Relax the constraints, but only when necessary

Three cases (shown in fig):

n Parameter C can be viewed as a way to control overfitting: it “trades

n New dual is the same as the old one

n As in the separable case, instances that are not support

n “All” that is left is efficiency

n Solution: kernel trick

n Instead of trying to fit a non-linear model, we can

n Say we have the new space calculated by the basis

d-dimensional x space k-dimensional z space

φ(x) = [φ1(x) φ2(x) ... φk(x)]

n Preprocess input x by basis functions

n SVM solution: Find Kernel functions K(x,y) such that the

n Consider polynomials of degree q:

(Cherkassky and Mulier, 1998)

n Linear: K(xi,xj)= xiTxj

n Polynomial of power p: K(xi,xj)= (1+ xiTxj)p

n Higher-dimensional space still has intrinsic

n We use the soft margin hyperplane

n Sigmoidal functions such as: ( ) (

n Semi-positive definite symmetric functions correspond to a semi-

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)

n More formally, kernel methods exploit information about

n Define φ(D1) as the M-dimensional binary vector where

n If we define K(D1,D2) as the number of shared words;

n Cortes and Vapnik 1995:

Expected test error rate:

ExpN[P(error)] = ExpN[#support vectors] / N

n SVMs were originally proposed by Boser, Guyon and Vapnik in 1992

n SVMs represent a general methodology for many PR problems:

n SVMs can be applied to complex data types beyond feature vectors

n SVM techniques have been extended to a number of tasks such as

n Most popular optimization algorithms for SVMs use decomposition to

n Can the kernel functions be selected in a principled manner?

n SVMs still require selection of a few parameters, typically through

n How does one incorporate domain knowledge?

n How interpretable are the results provided by an SVM?

n Do SVMs eliminate the model selection problem?

¨ LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for

n Applet to play with:

You might also like