Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views63 pages

ML Lecture06 2

This document summarizes a lecture on kernels for support vector machines. It discusses how to determine if a function is a valid kernel, examples of polynomial and Gaussian kernels, and how SVM regression and classification can be performed in the dual formulation without explicitly computing the feature mapping. It also covers properties of kernel matrices, Mercer's theorem, and reproducing kernel Hilbert spaces.

Uploaded by

Marche Remi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views63 pages

ML Lecture06 2

This document summarizes a lecture on kernels for support vector machines. It discusses how to determine if a function is a valid kernel, examples of polynomial and Gaussian kernels, and how SVM regression and classification can be performed in the dual formulation without explicitly computing the feature mapping. It also covers properties of kernel matrices, Mercer's theorem, and reproducing kernel Hilbert spaces.

Uploaded by

Marche Remi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Lecture 5: More on Kernels.

SVM regression and


classification

• How to tell if a function is a kernel


• SVM regression
• SVM classification

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 1


The “kernel trick”

• Recall: kernel functions are ways of expressing dot-products in some


feature space:
K(x, z) = φ(x) · φ(z)
• In many cases, K can be computed in time that depends on the size of
the inputs x not the size of the feature space φ(x)
• If we work with a “dual” representation of the learning algorithm, we do
not actually have to compute the feature mapping φ. We just have to
compute the similarity K.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 2


Polynomial kernels

• More generally, K(x, z) = (x · z)d is a kernel, for any positive integer d:

n
!d
X
K(x, z) = xi z i
i=1

• If we expanded the sum above in the obvious way, we get nd terms (i.e.
feature expansion)
• Terms are monomials (products of xi) with total power equal to d.
• Curse of dimensionality: it is very expensive both to optimize and to
predict in primal form
• However, evaluating the dot-product of any two feature vectors can be
done using K in O(n)!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 3


Some other (fairly generic) kernel functions

• K(x, z) = (1 + x · z)d – feature expansion has all monomial terms of


≤ d total power.
• Radial basis/Gaussian kernel:

K(x, z) = exp(−kx − zk2/2σ 2)

The kernel has an infinite-dimensional feature expansion, but dot-


products can still be computed in O(n)!
• Sigmoidal kernel:

K(x, z) = tanh(c1x · z + c2)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 4


Recall: Dual-view regression

• By re-writing the parameter vector as a linear combination of instances


and solving, we get:
a = (K + λIm)−1y
• The feature mapping is not needed either to learn or to make predictions!
• This approach is useful if the feature space is very large

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 5


Making predictions in the dual view

• For a new input x, the prediction is:

h(x) = wT φ(x) = aT Φφ(x) = k(x)T (K + λIm)−1y

where k(x) is an m-dimensional vector, with the ith element equal to


K(x, xi)
• That is, the ith element has the similarity of the input to the ith instance
• The features are not needed for this step either!
• This is a non-parametric representation - its size scales with the number
of instances.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 6


Regularization in the dual view

• We want to penalize the function we are trying to estimate (to keep it


simple)
• Assume this is part of a reproducing kernel Hilbert space (Doina will post
extra notes for those interested in this)
• We want to minimize:
n
1X λ
J(h) = 2
(yi − h(xi)) + ||h||2H
2 i=1 2

• If we put a Gaussian prior on h, and solve, we obtain Gaussian process


regression

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 7


Logistic regression

• The output of a logistic regression predictor is:

1
hw (x) =
1 + ewT φ(x)+w0

• Again,
Pm we can define the weights in terms of support vectors: w =
i=1 αi φ(xi )
• The prediction can now be computed as:

1
h(x) = Pm
1+e ı=1 αi K(xi ,x)+w0

• αi are the new parameters (one per instance) and can be derived using
gradient descent

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 8


Kernels

• A lot of current research has to do with defining new kernels functions,


suitable to particular tasks / kinds of input objects
• Many kernels are available:
– Information diffusion kernels (Lafferty and Lebanon, 2002)
– Diffusion kernels on graphs (Kondor and Jebara 2003)
– String kernels for text classification (Lodhi et al, 2002)
– String kernels for protein classification (e.g., Leslie et al, 2002)
... and others!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 9


Example: String kernels

• Very important for DNA matching, text classification, ...


• Example: in DNA matching, we use a sliding window of length k over
the two strings that we want to compare
• The window is of a given size, and inside we can do various things:
– Count exact matches
– Weigh mismatches based on how bad they are
– Count certain markers, e.g. AGT
• The kernel is the sum of these similarities over the two sequences
• How do we prove this is a kernel?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 10


Establishing “kernelhood”

• Suppose someone hands you a function K. How do you know that it is


a kernel?
• More precisely, given a function K : Rn ×Rn → R, under what conditions
can K(x, z) be written as a dot product φ(x) · φ(z) for some feature
mapping φ?
• We want a general recipe, which does not require explicitly defining φ
every time

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 11


Kernel matrix

• Suppose we have an arbitrary set of input vectors x1, x2, . . . xm


• The kernel matrix (or Gram matrix) K corresponding to kernel function
K is an m×m matrix such that Kij = K(xi, xj ) (notation is overloaded
on purpose).
• What properties does the kernel matrix K have?
• Claims:
1. K is symmetric
2. K is positive semidefinite
• Note that these claims are consistent with the intuition that K is a
“similarity” measure (and will be true regardless of the data)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 12


Proving the first claim
If K is a valid kernel, then the kernel matrix is symmetric

Kij = φ(xi) · φ(xj ) = φ(xj ) · φ(xi) = Kji

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 13


Proving the second claim
If K is a valid kernel, then the kernel matrix is positive semidefinite

Proof: Consider an arbitrary vector z


XX XX
T
z Kz = ziKij zj = zi (φ(xi) · φ(xj )) zj
i j i j
!
XX X
= zi φk (xi)φk (xj ) zj
i j k
XXX
= ziφk (xi)φk (xj )zj
k i j
!2
X X
= ziφk (xi) ≥0
k i

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 14


Mercer’s theorem
• We have shown that if K is a kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = K(xi, xj) is
symmetric and positive semidefinite
• Mercer’s theorem states that the reverse is also true:
Given a function K : Rn × Rn → R, K is a kernel if and only if, for
any data set, the corresponding kernel matrix is symmetric and positive
semidefinite
• The reverse direction of the proof is much harder (see e.g. Vapnik’s
book for details)
• This result gives us a way to check if a given function is a kernel, by
checking these two properties of its kernel matrix.
• Kernels can also be obtained by combining other kernels, or by learning
from data
• Kernel learning may suffer from overfitting (kernel matrix close to
diagonal)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 15


More on RKHS
• Mercer’s theorem tells us that a function K : X × X → R is a kernel
(i.e. it corresponds to the inner product in some feature space) if and
only if it is positive semi-definite.
• The feature space is the reproducing kernel Hilbert space (RKHS)
 
X 
H= αj K(zj , ·) : zj ∈ X , αj ∈ R
 
j

with inner product hK(z, ·), K(z 0, ·)iH = K(z, z 0).


• The term reproducing comes from the reproducing property of the kernel
function:
∀f ∈ H, x ∈ X : f (x) = hf (·), K(x, ·)iH
• Recall that the solution of the regularized least square inP
the feature space
m
associated to a kernel function K has the form h(x) = i=1 αiK(xi, x).
This is a particular case of the representer theorem...

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 16


Representer Theorem
Theorem 1. Let K : X × X → R be a positive definite kernel and let H
be the corresponding RKHS.
Then for any training sample S = {(xi, yi)}m i=1 ⊂ X × R, any loss
function ` : (X × R × R)m → R and any real-valued non-decreasing
function g, the solution of the optimization problem

arg min ` ((x1, y1, f (x1)), · · · , (xm, ym, f (xm))) + g(kf kH)
f ∈H

admits a representation of the form


m
X
f ∗(·) = αiK(xi, ·).
i=1

[Schlkopf, Herbrich and Smola. A generalized representer Theorem. COLT 2001.]

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 17


Support Vector Regression

• In regression problems, so far we have been trying to minimize mean-


squared error: X
(yi − (w · xi + w0))2
i

• In SVM regression, we will be interested instead in minimizing absolute


error: X
|yi − (w · xi + w0)|
i

• This is more robust to outliers than the squared loss


• But we cannot require that all points be approximated correctly
(overfitting!)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 18


Loss function for support vector regression
In order to allow for misclassifications in SVM regression (and have
SVMs fortoRegression
robustness noise), we use the -insensitive loss:
m square
!"# $%&"#&"%'%(# #))*) +#,"-)#
!
X loss
" .!/ 0
1 %2 !!! J = J(xi), where
!!! *'3#)4%"##
i=1 " .!/
53%" 6,& ,7"* 8# 4)%''#& ,"

" .!/ 0 0
.!!! /9 if |yi − (w · xi + w0)| ≤ 
J(xi) = !
|yi ;*"%'%(#
43#)# ./9 %&:%6,'#" '3# − (w · ;,)'
xi +*2w.#/<
0 )| − otherwise
=) #>-%(,7#&'7? ,"
cost is zero inside epsilon “tube”
" .!/ 0 +,@ ..!!! /$ 1/
or Regression
(# #))*) +#,"-)# square
! loss
1 %2 !!!
0
!!! *'3#)4%"## " .!/
8# 4)%''#& ,"

" .!/ 0 .!!! /9 loss function regularization 1


!
:%6,'#" '3# ;*"%'%(# ;,)' *2 .#/<
? ,"
cost is zero inside epsilon “tube”
COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 19
"
!/ 0 +,@ ..!!!A" 8#2*)#B
/$ 1/ %&')*:-6# "7,6C (,)%,87#" 2*)
cost is zero inside epsilon “tube”
;*%&'" '3,' (%*7,'# $%&"#&"%'%(# #))*)<
Solving SVM regression

• We introduce slack variables, ξi+, ξi− to account for errors outside the
tolerance area
• We need two kinds of variables to account for both positive and negative
errors

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 20


The optimization problem
P +
min 1 2
2 kwk + C i (ξi + ξi−)
w.r.t. w, w0, ξi+, ξi−
s.t. yi − (w · xi + w0) ≤  + ξi+
yi − (w · xi + w0) ≥ − − ξi−
ξi+, ξi− ≥ 0

• Like before, we can write the Lagrangian and solve the dual form of the
problem
• Kernels can be used as before to get non-linear functions

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 21


• Parameters are: C, epsilon, sigma -1

!! "!# ! !" w is a N-vector -1.5


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

!
!
! !"# !" # $" % ## #! $! $ !
# !! !"" Effect of 
"!"

1.5 1.5

!! "!# ! ! " points


Sample w is a N-vector Sample points
Ideal fit Validation set fit
1 1 epsilon = 0.5 Support vectors
epsilon = 0.8
1.5 1.5
0.5 0.5 Sample points Sample points
Validation set fit Validation set fit
1
1.5 Support vectors 1 Support vectors
1.5
0 Sample points 0 Sample points
Ideal fit Validation set fit
1 0.51 Support vectors
0.5
0.5 -0.5
0.5 0.5
0 0
-1 -1
0 0
y

-0.5 -0.5
1.5 -0.5 -1.5
-0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x -1 -1
-1 -1

-1.5
-1.5 -1.5 0 0.1 0.2 epsilon
0.3 0.4 = 0.01
0.5 0.6 0.7 0.8 0.9 1
-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6
x
0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2
• Validation set fit is a search
epsilon = 0.01
As over both increases:
epsilon C and sigma
• As  increases, the function
• Validation set is
• fitboth
over becomes
fit is allowed
a search
looser
C and sigma
to move away from the data
points, the number of •support vectors
less data points decreases
are support vectorsand the fit gets worse

2
Zisserman course notes

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 Loss functions for regression 22

! !"#$%#&'( )*!"#%+, -.** !)"# $ )%,, / 0


1 )" $ )%,,1
Binary classification revisited

• Consider a linearly separable binary classification data set {xi, yi}m


i=1 .
• There is an infinite number of hyperplanes that separate the classes:
+ +
+
+ +

-
- -
-
-

• Which plane is best?


• Relatedly, for a given plane, for which points should we be most confident
in the classification?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 23


The margin, and linear SVMs

• For a given separating hyperplane, the margin is two times the (Euclidean)
distance from the hyperplane to the nearest training example.
+ + +
+
+ +
+ + +
+

- -
- - - -
- -
- -

• It is the width of the “strip” around the decision boundary containing no


training examples.
• A linear SVM is a perceptron for which we choose w, w0 so that margin
is maximized

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 24


Distance to the decision boundary

• Suppose we have a decision boundary that separates the data.

A w

(i)

• Let γi be the distance from instance xi to the decision boundary.


• How can we write γi in term of xi, yi, w, w0?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 25


Distance to the decision boundary (II)
w
• The vector w is normal to the decision boundary. Thus, ||w|| is the unit
normal.
w
• The vector from the B to A is γi ||w || .
• B, the point on the decision boundary nearest xi, is xi − γi ||w
w|| .
• As B is on the decision boundary,
 
w
w · xi − γi + w0 = 0
||w||

• Solving for γi yields, for a positive example:

w w0
γi = · xi +
||w|| ||w||

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 26


The margin

• The margin of the hyperplane is 2M , where M = mini γi


• The most direct statement of the problem of finding a maximum margin
separating hyperplane is thus
max min γi
w,w0 i
 
w w0
≡ max min yi · xi +
w,w0 i ||w|| ||w||

• This turns out to be inconvenient for optimization, however. . .

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 27


Treating the γi as constraints

• From the definition of margin, we have:


 
w w0
M ≤ γi = yi · xi + ∀i
kw|| kwk

• This suggests:
maximize M
with respect to w,w0 
subject to yi w
kw k · xi + kww0k ≥ M for all i
• Problems:
– w appears nonlinearly in the constraints.
– This problem is underconstrained. If (w, w0, M ) is an optimal solution,
then so is (βw, βw0, M ) for any β > 0.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 28


Adding a constraint

• Let’s try adding the constraint that kwkM = 1.


• This allows us to rewrite the objective function and constraints as:
min kwk
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• This is really nice because the constraints are linear.
• The objective function kwk is still a bit awkward.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 29


Final formulation

• Let’s maximize kwk2 instead of kwk.


(Taking the square is a monotone transformation, as kwk is postive, so
this doesn’t change the optimal solution.)
• This gets us to:
min kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• This we can solve! How?
– It is a quadratic programming (QP) problem—a standard type of
optimization problem for which many efficient packages are available.
– Better yet, it’s a convex (positive semidefinite) QP

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 30


Example
w = [49.6504 46.8962] w0 = −48.6936 w = [11.7959 12.8066] w0 = −12.9174
1 0.9

0.9 0.8

0.8 0.7
0.7
0.6
0.6
0.5

x2
x2

0.5
0.4
0.4
0.3
0.3
0.2
0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1

We have a solution, but no support vectors yet...

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 31


Lagrange multipliers for inequality constraints (revisited)

• Suppose we have the following optimization problem, called primal:

min f (w)
w
such that gi(w) ≤ 0, i = 1 . . . k

• We define the generalized Lagrangian:

k
X
L(w, α) = f (w) + αigi(w), (1)
i=1

where αi, i = 1 . . . k are the Lagrange multipliers.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 32


A different optimization problem

• Consider P(w) = maxα:αi≥0 L(w, α)


• Observe that the follow is true (see extra notes):

f (w) if all constraints are satisfied
P(w) =
+∞ otherwise

• Hence, instead of computing minw f (w) subject to the original


constraints, we can compute:

p∗ = min P(w) = min max L(w, α)


w w α:αi ≥0

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 33


Dual optimization problem

• Let d∗ = maxα:αi≥0 minw L(w, α) (max and min are reversed)


• We can show that d∗ ≤ p∗.
– Let p∗ = L(wp, αp)
– Let d∗ = L(wd, αd)
– Then d∗ = L(wd, αd) ≤ L(wp, αd) ≤ L(wp, αp) = p∗.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 34


Dual optimization problem

• If f , gi are convex and the gi can all be satisfied simultaneously for some
w, then we have equality: d∗ = p∗ = L(w∗, α∗)
• Moreover w∗, α∗ solve the primal and dual if and only if they satisfy the
following conditions (called Karush-Kunh-Tucker):


L(w∗, α∗) = 0, i = 1 . . . n (2)
∂wi
αi∗gi(w∗) = 0, i = 1 . . . k (3)
gi(w∗) ≤ 0, i = 1 . . . k (4)
αi∗ ≥ 0, i = 1 . . . k (5)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 35


Back to maximum margin perceptron

• We wanted to solve (rewritten slightly):


min 12 kwk2
w.r.t. w, w0
s.t. 1 − yi(w · xi + w0) ≤ 0
• The Lagrangian is:

1 X
2
L(w, w0, α) = kwk + αi(1 − yi(w · xi + w0))
2 i

• The primal problem is: minw,w0 maxα:αi≥0 L(w, w0, α)


• We will solve the dual problem: maxα:αi≥0 minw,w0 L(w, w0, α)
• In this case, the optimal solutions coincide, because we have a quadratic
objective and linear constraints (both of which are convex).

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 36


Solving the dual
• From KKT (2), the derivatives of L(w, w0, α) wrt w, w0 should be 0
P
• The condition on the derivative wrt w0 gives i αiyi = 0
• The condition on the derivative wrt w gives:
X
w= αiyixi
i

⇒ Just like for the perceptron with zero initial weights, the optimal solution
for w is a linear combination of the xi, and likewise for w0.
• The output is m
!
X
hw,w0 (x) = sign αiyi(xi · x) + w0
i=1

⇒ Output depends on weighted dot product of input vector with training


examples

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 37


Solving the dual (II)

• By plugging these back into the expression for L, we get:

X 1X
max αi − yiyj αiαj (xi · xj )
α
i
2 i,j

P
with constraints: αi ≥ 0 and i αi yi =0

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 38


The support vectors

• Suppose we find optimal αs (e.g., using a standard QP package)


• The αi will be > 0 only for the points for which 1 − yi(w · xi + w0) = 0
• These are the points lying on the edge of the margin, and they are called
support vectors, because they define the decision boundary
• The output of the classifier for query point x is computed as:

m
!
X
sgn αiyi(xi · x) + w0
i=1

Hence, the output is determined by computing the dot product of the


point with the support vectors!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 39


Example

Support vectors are in bold

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 40


Non-linearly separable data
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

• A linear boundary might be too simple to capture the class structure.


• One way of getting a nonlinear decision boundary in the input space is to
find a linear decision boundary in an expanded space (e.g., for polynomial
regression.)
• Thus, xi is replaced by φ(xi), where φ is called a feature mapping

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 41


Margin optimization in feature space

• Replacing xi with φ(xi), the optimization problem to find w and w0


becomes:
min kwk2
w.r.t. w, w0
s.t. yi(w · φ(xi) + w0) ≥ 1
• Dual form:
Pm Pm
1
max i=1 αi − 2 i,j=1 yi yj αi αj φ(xi ) · φ(xj )
w.r.t. αi
s.t. 0P≤m
αi
i=1 αi yi = 0

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 42


Feature space solution

• The
Pm optimal weights, in the expanded feature space, are w =
i=1 αi yi φ(xi ).

• Classification of an input x is given by:

m
!
X
hw,w0 (x) = sign αiyiφ(xi) · φ(x) + w0
i=1

⇒ Note that to solve the SVM optimization problem in dual form and to
make a prediction, we only ever need to compute dot-products of feature
vectors.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 43


Kernel functions

• Whenever a learning algorithm (such as SVMs) can be written in terms


of dot-products, it can be generalized to kernels.
• A kernel is any function K : Rn × Rn 7→ R which corresponds to a dot
product for some feature mapping φ:

K(x1, x2) = φ(x1) · φ(x2) for some φ.

• Conversely, by choosing feature mapping φ, we implicitly choose a kernel


function
• Recall that φ(x1) · φ(x2) = cos ∠(x1, x2) where ∠ denotes the angle
between the vectors, so a kernel function can be thought of as a notion
of similarity.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 44


The “kernel trick”

• If we work with the dual, we do not actually have to ever compute the
feature mapping φ. We just have to compute the similarity K.
• That is, we can solve
Pm the dual forPthe αi :
1 m
max i=1 αi − 2 i,j=1 yi yj αi αj K(xi , xj )
w.r.t. αi
s.t. 0P≤ αi
m
i=1 αi yi = 0
• The class of a new input x is computed as:

m
! m
!
X X
hw,w0 (x) = sign ( αiyiφ(xi)) · φ(x) + w0 = sign αiyiK(xi, x) + w0
i=1 i=1

• Often, K(·, ·) can be evaluated in O(n) time—a big savings!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 45


Regularization with SVMs

• Kernels are a powerful tool for allowing non-linear, complex functions


• But now the number of parameters can be as high as the number of
instances!
• With a very specific, non-linear kernel, each data point may be in its own
partition
• With linear and logistic regression, we used regularization to avoid
overfitting
• We need a method for allowing regularization with SVMs as well.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 46


Soft margin classifiers

• Recall that in the linearly separable case, we compute the solution to the
following optimization problem:
min 12 kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• If we want to allow misclassifications, we can relax the constraints to:

yi(w · xi + w0) ≥ 1 − ξi

• If ξi ∈ (0, 1), the data point is within the margin


• If ξi ≥ 1, then the data point is misclassified
P
• We define the soft error as i ξi
• We will have to change the criterion to reflect the soft errors

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 47


New problem formulation with soft errors

• Instead of:
min 12 kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
we want to solve: P
min 12 kwk2 + C i ξi
w.r.t. w, w0, ξi
s.t. yi(w · xi + w0) ≥ 1 − ξi, ξi ≥ 0
• Note that soft errors include points that are misclassified, as well as
points within the margin
• There is a linear penalty for both categories
• The choice of the constant C controls overfitting

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 48


A built-in overfitting knob

1 2
P
min 2 kwk + C i ξi
w.r.t. w, w0, ξi
s.t. yi(w · xi + w0) ≥ 1 − ξi
ξi ≥ 0

• If C is 0, there is no penalty for soft errors, so the focus is on maximizing


the margin, even if this means more mistakes
• If C is very large, the emphasis on the soft errors will cause decreasing
the margin, if this helps to classify more examples correctly.
• Internal cross-validation is a good way to choose C appropriately

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 49


Lagrangian for the new problem

• Like before, we can write a Lagrangian for the problem and then use the
dual formulation to find the optimal parameters:

1 X
2
L(w, w0, α, ξ, µ) = ||w|| + C ξi
2 i
X X
+ αi (1 − ξi − yi(wi · xi + w0)) + µ i ξi
i i

• All the previously described machinery can be used to solve this problem
• Note that in addition to αi we have coefficients µi, which ensure that
the errors are positive, but do not participate in the decision boundary

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 50


Soft margin optimization with kernels

• Replacing xi with φ(xi), the optimization problem to find w and w0


becomes: P
2
min kwk + C i ζi
w.r.t. w, w0, ζi
s.t. yi(w · φ(xi) + w0) ≥ (1 − ζi)
ζi ≥ 0
• Dual form and solution have similar forms to what we described last
time, but in terms of kernels

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 51


Getting SVMs to work in practice

• Two important choices:


– Kernel (and kernel parameters)
– Regularization parameter C
• The parameters may interact!
E.g. for Gaussian kernel, the larger the width of the kernel, the more
biased the classifier, so low C is better
• Together, these control overfitting: always do an internal parameter
search, using a validation set!
• Overfitting symptoms:
– Low margin
– Large fraction of instances are support vectors

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 52


Solving the quadratic optimization problem

• Many approaches exist


• Because we have constraints, gradient descent does not apply directly
(the optimum might be outside of the feasible region)
• Platt’s algorithm is the fastest current approach, based on coordinate
ascent

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 53


Coordinate ascent

• Suppose you want to find the maximum of some function F (α1, . . . αn)
• Coordinate ascent optimizes the function by repeatedly picking an αi
and optimizing it, while all other parameters are fixed
• There are different ways of looping through the parameters:
– Round-robin
– Repeatedly pick a parameter at random
– Choose next the variable expected to make the largest improvement
– ...

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 54


Example 22

2.5

1.5

0.5

−0.5

−1

−1.5

−2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in and
COMP-652 theECSE-608,
figureLecture
is the path28,that
6 - January 2016 it took on its way to the global maximum.
55
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.
Our optimization problem (dual form)

1XX
max αi − yiyj αiαj (φ(xi) · φ(xj ))
α
i
2 i,j
P
with constraints: 0 ≤ αi ≤ C and i αiyi = 0

• Suppose we want to optimize for α1 while α2, . . . αn are fixed


• We cannot do it because
Pm α1 will be completely determined by the last
constraint: α1 = −y1 i=2 αiyi
• Instead, we have to optimize pairs of parameters αi, αj together

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 56


SMO

• Suppose that we want to optimize α1 and α2 together, while all other


parameters are fixed.
Pm
• We know that y1α1 + y2α2 = − i=1 yiαi = ξ, where ξ is a constant
• So α1 = y1(ξ − y2α2) (because y1 is either +1 or −1 so y12 = 1)
• This defines a line, and any pair α1, α2 which is a solution has to be on
the line

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 57


SMO (II)

• We also know that 0 ≤ α1 ≤ C and 0 ≤ α2 ≤ C, so the solution


24 has to
be on the line segment inside the rectangle below

H α1y(1)+ α2y(2)=ζ
α2

L
α1 C

From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 58
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissable values for α2 that will ensure that α1 , α2
SMO(III)

• By plugging α1 back in the optimization criterion, we obtain a quadratic


function of α2, whose optimum we can find exactly
• If the optimum is inside the rectangle, we take it.
• If not, we pick the closest intersection point of the line and the rectangle
• This procedure is very fast because all these are simple computations.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 59


Interpretability

• SVMs are not very intuitive, but typically are more interpretable than
neural nets, if you look at the machine and the misclassifications
• E.g. Ovarian cancer data (Haussler) - 31 tissue samples of 3 classes,
misclassified examples wrongly labelled
• But no biological plausibility!
• Hard to interpret if the percentage of instances that are recruited as
support vectors is high

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 60


Complexity

• Quadratic programming is expensive in the number of training examples


• Platt’s SMO algorithm is quite fast though, and other fancy optimization
approaches are available
• Best packages can handle 50, 000+ instances, but not more than 100, 000
• On the other hand, number of attributes can be very high (strength
compared to neural nets)
• Evaluating a SVM is slow if there are a lot of support vectors.
• Dictionary methods attempt to select a subset of the data on which to
train.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 61


Passive supervised learning

• The environment provides labelled data in the form of pairs (x, y)


• We can process the examples either as a batch or one at a time, with
the goal of producing a predictor of y as a function of x
• We assume that there is an underlying distribution P generating the
examples
• Each example is drawn i.i.d. from P
• What if instead we are allowed to ask for particular examples?
• Intuitively, if we are allowed to ask questions, and if we are smart about
what we want to know, fewer examples may be necessary

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 62


speech samples
images and video
But labeling can be expensive.
Semi-Supervised and Active Learning

Unlabeled points Supervised learning Semisupervised and


active learning

• Suppose you had access to a lot of unlabeled data


E.g. all the documents on the web
E.g. all the pictures on Instagram
• You can also get some labelled data, but not much
• How can we take advantage of the unlabeled data to improve supervised
learning performance?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 63

You might also like