0% found this document useful (0 votes)

19 views63 pages

ML Lecture06 2

This document summarizes a lecture on kernels for support vector machines. It discusses how to determine if a function is a valid kernel, examples of polynomial and Gaussian kernels, and how SVM regression and classification can be performed in the dual formulation without explicitly computing the feature mapping. It also covers properties of kernel matrices, Mercer's theorem, and reproducing kernel Hilbert spaces.

Uploaded by

Marche Remi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views63 pages

ML Lecture06 2

Uploaded by

Marche Remi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Lecture 5: More on Kernels.

SVM regression and

classification

• How to tell if a function is a kernel

• SVM regression
• SVM classification

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 1

The “kernel trick”

• Recall: kernel functions are ways of expressing dot-products in some

feature space:
K(x, z) = φ(x) · φ(z)
• In many cases, K can be computed in time that depends on the size of
the inputs x not the size of the feature space φ(x)
• If we work with a “dual” representation of the learning algorithm, we do
not actually have to compute the feature mapping φ. We just have to
compute the similarity K.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 2

Polynomial kernels

• More generally, K(x, z) = (x · z)d is a kernel, for any positive integer d:

n
!d
X
K(x, z) = xi z i
i=1

• If we expanded the sum above in the obvious way, we get nd terms (i.e.
feature expansion)
• Terms are monomials (products of xi) with total power equal to d.
• Curse of dimensionality: it is very expensive both to optimize and to
predict in primal form
• However, evaluating the dot-product of any two feature vectors can be
done using K in O(n)!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 3

Some other (fairly generic) kernel functions

• K(x, z) = (1 + x · z)d – feature expansion has all monomial terms of

≤ d total power.
• Radial basis/Gaussian kernel:

K(x, z) = exp(−kx − zk2/2σ 2)

The kernel has an infinite-dimensional feature expansion, but dot-

products can still be computed in O(n)!
• Sigmoidal kernel:

K(x, z) = tanh(c1x · z + c2)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 4

Recall: Dual-view regression

• By re-writing the parameter vector as a linear combination of instances

and solving, we get:
a = (K + λIm)−1y
• The feature mapping is not needed either to learn or to make predictions!
• This approach is useful if the feature space is very large

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 5

Making predictions in the dual view

• For a new input x, the prediction is:

h(x) = wT φ(x) = aT Φφ(x) = k(x)T (K + λIm)−1y

where k(x) is an m-dimensional vector, with the ith element equal to

K(x, xi)
• That is, the ith element has the similarity of the input to the ith instance
• The features are not needed for this step either!
• This is a non-parametric representation - its size scales with the number
of instances.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 6

Regularization in the dual view

• We want to penalize the function we are trying to estimate (to keep it

simple)
• Assume this is part of a reproducing kernel Hilbert space (Doina will post
extra notes for those interested in this)
• We want to minimize:
n
1X λ
J(h) = 2
(yi − h(xi)) + ||h||2H
2 i=1 2

• If we put a Gaussian prior on h, and solve, we obtain Gaussian process

regression

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 7

Logistic regression

• The output of a logistic regression predictor is:

1
hw (x) =
1 + ewT φ(x)+w0

• Again,
Pm we can define the weights in terms of support vectors: w =
i=1 αi φ(xi )
• The prediction can now be computed as:

1
h(x) = Pm
1+e ı=1 αi K(xi ,x)+w0

• αi are the new parameters (one per instance) and can be derived using
gradient descent

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 8

Kernels

• A lot of current research has to do with defining new kernels functions,

suitable to particular tasks / kinds of input objects
• Many kernels are available:
– Information diffusion kernels (Lafferty and Lebanon, 2002)
– Diffusion kernels on graphs (Kondor and Jebara 2003)
– String kernels for text classification (Lodhi et al, 2002)
– String kernels for protein classification (e.g., Leslie et al, 2002)
... and others!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 9

Example: String kernels

• Very important for DNA matching, text classification, ...

• Example: in DNA matching, we use a sliding window of length k over
the two strings that we want to compare
• The window is of a given size, and inside we can do various things:
– Count exact matches
– Weigh mismatches based on how bad they are
– Count certain markers, e.g. AGT
• The kernel is the sum of these similarities over the two sequences
• How do we prove this is a kernel?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 10

Establishing “kernelhood”

• Suppose someone hands you a function K. How do you know that it is

a kernel?
• More precisely, given a function K : Rn ×Rn → R, under what conditions
can K(x, z) be written as a dot product φ(x) · φ(z) for some feature
mapping φ?
• We want a general recipe, which does not require explicitly defining φ
every time

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 11

Kernel matrix

• Suppose we have an arbitrary set of input vectors x1, x2, . . . xm

• The kernel matrix (or Gram matrix) K corresponding to kernel function
K is an m×m matrix such that Kij = K(xi, xj ) (notation is overloaded
on purpose).
• What properties does the kernel matrix K have?
• Claims:
1. K is symmetric
2. K is positive semidefinite
• Note that these claims are consistent with the intuition that K is a
“similarity” measure (and will be true regardless of the data)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 12

Proving the first claim
If K is a valid kernel, then the kernel matrix is symmetric

Kij = φ(xi) · φ(xj ) = φ(xj ) · φ(xi) = Kji

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 13

Proving the second claim
If K is a valid kernel, then the kernel matrix is positive semidefinite

Proof: Consider an arbitrary vector z

XX XX
T
z Kz = ziKij zj = zi (φ(xi) · φ(xj )) zj
i j i j
!
XX X
= zi φk (xi)φk (xj ) zj
i j k
XXX
= ziφk (xi)φk (xj )zj
k i j
!2
X X
= ziφk (xi) ≥0
k i

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 14

Mercer’s theorem
• We have shown that if K is a kernel function, then for any data set,
the corresponding kernel matrix K defined such that Kij = K(xi, xj) is
symmetric and positive semidefinite
• Mercer’s theorem states that the reverse is also true:
Given a function K : Rn × Rn → R, K is a kernel if and only if, for
any data set, the corresponding kernel matrix is symmetric and positive
semidefinite
• The reverse direction of the proof is much harder (see e.g. Vapnik’s
book for details)
• This result gives us a way to check if a given function is a kernel, by
checking these two properties of its kernel matrix.
• Kernels can also be obtained by combining other kernels, or by learning
from data
• Kernel learning may suffer from overfitting (kernel matrix close to
diagonal)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 15

More on RKHS
• Mercer’s theorem tells us that a function K : X × X → R is a kernel
(i.e. it corresponds to the inner product in some feature space) if and
only if it is positive semi-definite.
• The feature space is the reproducing kernel Hilbert space (RKHS)
 
X 
H= αj K(zj , ·) : zj ∈ X , αj ∈ R
 
j

with inner product hK(z, ·), K(z 0, ·)iH = K(z, z 0).

• The term reproducing comes from the reproducing property of the kernel
function:
∀f ∈ H, x ∈ X : f (x) = hf (·), K(x, ·)iH
• Recall that the solution of the regularized least square inP
the feature space
m
associated to a kernel function K has the form h(x) = i=1 αiK(xi, x).
This is a particular case of the representer theorem...

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 16

Representer Theorem
Theorem 1. Let K : X × X → R be a positive definite kernel and let H
be the corresponding RKHS.
Then for any training sample S = {(xi, yi)}m i=1 ⊂ X × R, any loss
function ` : (X × R × R)m → R and any real-valued non-decreasing
function g, the solution of the optimization problem

arg min ` ((x1, y1, f (x1)), · · · , (xm, ym, f (xm))) + g(kf kH)
f ∈H

admits a representation of the form

m
X
f ∗(·) = αiK(xi, ·).
i=1

[Schlkopf, Herbrich and Smola. A generalized representer Theorem. COLT 2001.]

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 17

Support Vector Regression

• In regression problems, so far we have been trying to minimize mean-

squared error: X
(yi − (w · xi + w0))2
i

• In SVM regression, we will be interested instead in minimizing absolute

error: X
|yi − (w · xi + w0)|
i

• This is more robust to outliers than the squared loss

• But we cannot require that all points be approximated correctly
(overfitting!)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 18

Loss function for support vector regression
In order to allow for misclassifications in SVM regression (and have
SVMs fortoRegression
robustness noise), we use the -insensitive loss:
m square
!"# $%&"#&"%'%(# #))*) +#,"-)#
!
X loss
" .!/ 0
1 %2 !!! J = J(xi), where
!!! *'3#)4%"##
i=1 " .!/
53%" 6,& ,7"* 8# 4)%''#& ,"

" .!/ 0 0
.!!! /9 if |yi − (w · xi + w0)| ≤
J(xi) = !
|yi ;*"%'%(#
43#)# ./9 %&:%6,'#" '3# − (w · ;,)'
xi +*2w.#/<
0 )| − otherwise
=) #>-%(,7#&'7? ,"
cost is zero inside epsilon “tube”
" .!/ 0 +,@ ..!!! /$ 1/
or Regression
(# #))*) +#,"-)# square
! loss
1 %2 !!!
0
!!! *'3#)4%"## " .!/
8# 4)%''#& ,"

" .!/ 0 .!!! /9 loss function regularization 1

!
:%6,'#" '3# ;*"%'%(# ;,)' *2 .#/<
? ,"
cost is zero inside epsilon “tube”
COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 19
"
!/ 0 +,@ ..!!!A" 8#2*)#B
/$ 1/ %&')*:-6# "7,6C (,)%,87#" 2*)
cost is zero inside epsilon “tube”
;*%&'" '3,' (%*7,'# $%&"#&"%'%(# #))*)<
Solving SVM regression

• We introduce slack variables, ξi+, ξi− to account for errors outside the
tolerance area
• We need two kinds of variables to account for both positive and negative
errors

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 20

The optimization problem
P +
min 1 2
2 kwk + C i (ξi + ξi−)
w.r.t. w, w0, ξi+, ξi−
s.t. yi − (w · xi + w0) ≤ + ξi+
yi − (w · xi + w0) ≥ − − ξi−
ξi+, ξi− ≥ 0

• Like before, we can write the Lagrangian and solve the dual form of the
problem
• Kernels can be used as before to get non-linear functions

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 21

• Parameters are: C, epsilon, sigma -1

!! "!# ! !" w is a N-vector -1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

!
!
! !"# !" # $" % ## #! $! $ !
# !! !"" Effect of
"!"

1.5 1.5

!! "!# ! ! " points

Sample w is a N-vector Sample points
Ideal fit Validation set fit
1 1 epsilon = 0.5 Support vectors
epsilon = 0.8
1.5 1.5
0.5 0.5 Sample points Sample points
Validation set fit Validation set fit
1
1.5 Support vectors 1 Support vectors
1.5
0 Sample points 0 Sample points
Ideal fit Validation set fit
1 0.51 Support vectors
0.5
0.5 -0.5
0.5 0.5
0 0
-1 -1
0 0
y

-0.5 -0.5
1.5 -0.5 -1.5
-0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x -1 -1
-1 -1

-1.5
-1.5 -1.5 0 0.1 0.2 epsilon
0.3 0.4 = 0.01
0.5 0.6 0.7 0.8 0.9 1
-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6
x
0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2
• Validation set fit is a search
epsilon = 0.01
As over both increases:
epsilon C and sigma
• As increases, the function
• Validation set is
• fitboth
over becomes
fit is allowed
a search
looser
C and sigma
to move away from the data
points, the number of •support vectors
less data points decreases
are support vectorsand the fit gets worse

2
Zisserman course notes

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 Loss functions for regression 22

! !"#$%#&'( )*!"#%+, -.** !)"# $ )%,, / 0

1 )" $ )%,,1
Binary classification revisited

• Consider a linearly separable binary classification data set {xi, yi}m

i=1 .
• There is an infinite number of hyperplanes that separate the classes:
+ +
+
+ +

-
- -
-
-

• Which plane is best?

• Relatedly, for a given plane, for which points should we be most confident
in the classification?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 23

The margin, and linear SVMs

• For a given separating hyperplane, the margin is two times the (Euclidean)
distance from the hyperplane to the nearest training example.
+ + +
+
+ +
+ + +
+

- -
- - - -
- -
- -

• It is the width of the “strip” around the decision boundary containing no

training examples.
• A linear SVM is a perceptron for which we choose w, w0 so that margin
is maximized

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 24

Distance to the decision boundary

• Suppose we have a decision boundary that separates the data.

A w

(i)

• Let γi be the distance from instance xi to the decision boundary.

• How can we write γi in term of xi, yi, w, w0?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 25

Distance to the decision boundary (II)
w
• The vector w is normal to the decision boundary. Thus, ||w|| is the unit
normal.
w
• The vector from the B to A is γi ||w || .
• B, the point on the decision boundary nearest xi, is xi − γi ||w
w|| .
• As B is on the decision boundary,

w
w · xi − γi + w0 = 0
||w||

• Solving for γi yields, for a positive example:

w w0
γi = · xi +
||w|| ||w||

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 26

The margin

• The margin of the hyperplane is 2M , where M = mini γi

• The most direct statement of the problem of finding a maximum margin
separating hyperplane is thus
max min γi
w,w0 i

w w0
≡ max min yi · xi +
w,w0 i ||w|| ||w||

• This turns out to be inconvenient for optimization, however. . .

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 27

Treating the γi as constraints

• From the definition of margin, we have:

w w0
M ≤ γi = yi · xi + ∀i
kw|| kwk

• This suggests:
maximize M
with respect to w,w0
subject to yi w
kw k · xi + kww0k ≥ M for all i
• Problems:
– w appears nonlinearly in the constraints.
– This problem is underconstrained. If (w, w0, M ) is an optimal solution,
then so is (βw, βw0, M ) for any β > 0.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 28

Adding a constraint

• Let’s try adding the constraint that kwkM = 1.

• This allows us to rewrite the objective function and constraints as:
min kwk
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• This is really nice because the constraints are linear.
• The objective function kwk is still a bit awkward.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 29

Final formulation

• Let’s maximize kwk2 instead of kwk.

(Taking the square is a monotone transformation, as kwk is postive, so
this doesn’t change the optimal solution.)
• This gets us to:
min kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• This we can solve! How?
– It is a quadratic programming (QP) problem—a standard type of
optimization problem for which many efficient packages are available.
– Better yet, it’s a convex (positive semidefinite) QP

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 30

Example
w = [49.6504 46.8962] w0 = −48.6936 w = [11.7959 12.8066] w0 = −12.9174
1 0.9

0.9 0.8

0.8 0.7
0.7
0.6
0.6
0.5

x2
x2

0.5
0.4
0.4
0.3
0.3
0.2
0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1

We have a solution, but no support vectors yet...

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 31

Lagrange multipliers for inequality constraints (revisited)

• Suppose we have the following optimization problem, called primal:

min f (w)
w
such that gi(w) ≤ 0, i = 1 . . . k

• We define the generalized Lagrangian:

k
X
L(w, α) = f (w) + αigi(w), (1)
i=1

where αi, i = 1 . . . k are the Lagrange multipliers.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 32

A different optimization problem

• Consider P(w) = maxα:αi≥0 L(w, α)

• Observe that the follow is true (see extra notes):

f (w) if all constraints are satisfied
P(w) =
+∞ otherwise

• Hence, instead of computing minw f (w) subject to the original

constraints, we can compute:

p∗ = min P(w) = min max L(w, α)

w w α:αi ≥0

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 33

Dual optimization problem

• Let d∗ = maxα:αi≥0 minw L(w, α) (max and min are reversed)

• We can show that d∗ ≤ p∗.
– Let p∗ = L(wp, αp)
– Let d∗ = L(wd, αd)
– Then d∗ = L(wd, αd) ≤ L(wp, αd) ≤ L(wp, αp) = p∗.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 34

Dual optimization problem

• If f , gi are convex and the gi can all be satisfied simultaneously for some
w, then we have equality: d∗ = p∗ = L(w∗, α∗)
• Moreover w∗, α∗ solve the primal and dual if and only if they satisfy the
following conditions (called Karush-Kunh-Tucker):

∂
L(w∗, α∗) = 0, i = 1 . . . n (2)
∂wi
αi∗gi(w∗) = 0, i = 1 . . . k (3)
gi(w∗) ≤ 0, i = 1 . . . k (4)
αi∗ ≥ 0, i = 1 . . . k (5)

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 35

Back to maximum margin perceptron

• We wanted to solve (rewritten slightly):

min 12 kwk2
w.r.t. w, w0
s.t. 1 − yi(w · xi + w0) ≤ 0
• The Lagrangian is:

1 X
2
L(w, w0, α) = kwk + αi(1 − yi(w · xi + w0))
2 i

• The primal problem is: minw,w0 maxα:αi≥0 L(w, w0, α)

• We will solve the dual problem: maxα:αi≥0 minw,w0 L(w, w0, α)
• In this case, the optimal solutions coincide, because we have a quadratic
objective and linear constraints (both of which are convex).

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 36

Solving the dual
• From KKT (2), the derivatives of L(w, w0, α) wrt w, w0 should be 0
P
• The condition on the derivative wrt w0 gives i αiyi = 0
• The condition on the derivative wrt w gives:
X
w= αiyixi
i

⇒ Just like for the perceptron with zero initial weights, the optimal solution
for w is a linear combination of the xi, and likewise for w0.
• The output is m
!
X
hw,w0 (x) = sign αiyi(xi · x) + w0
i=1

⇒ Output depends on weighted dot product of input vector with training

examples

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 37

Solving the dual (II)

• By plugging these back into the expression for L, we get:

X 1X
max αi − yiyj αiαj (xi · xj )
α
i
2 i,j

P
with constraints: αi ≥ 0 and i αi yi =0

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 38

The support vectors

• Suppose we find optimal αs (e.g., using a standard QP package)

• The αi will be > 0 only for the points for which 1 − yi(w · xi + w0) = 0
• These are the points lying on the edge of the margin, and they are called
support vectors, because they define the decision boundary
• The output of the classifier for query point x is computed as:

m
!
X
sgn αiyi(xi · x) + w0
i=1

Hence, the output is determined by computing the dot product of the

point with the support vectors!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 39

Example

Support vectors are in bold

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 40

Non-linearly separable data
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

• A linear boundary might be too simple to capture the class structure.

• One way of getting a nonlinear decision boundary in the input space is to
find a linear decision boundary in an expanded space (e.g., for polynomial
regression.)
• Thus, xi is replaced by φ(xi), where φ is called a feature mapping

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 41

Margin optimization in feature space

• Replacing xi with φ(xi), the optimization problem to find w and w0

becomes:
min kwk2
w.r.t. w, w0
s.t. yi(w · φ(xi) + w0) ≥ 1
• Dual form:
Pm Pm
1
max i=1 αi − 2 i,j=1 yi yj αi αj φ(xi ) · φ(xj )
w.r.t. αi
s.t. 0P≤m
αi
i=1 αi yi = 0

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 42

Feature space solution

• The
Pm optimal weights, in the expanded feature space, are w =
i=1 αi yi φ(xi ).

• Classification of an input x is given by:

m
!
X
hw,w0 (x) = sign αiyiφ(xi) · φ(x) + w0
i=1

⇒ Note that to solve the SVM optimization problem in dual form and to
make a prediction, we only ever need to compute dot-products of feature
vectors.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 43

Kernel functions

• Whenever a learning algorithm (such as SVMs) can be written in terms

of dot-products, it can be generalized to kernels.
• A kernel is any function K : Rn × Rn 7→ R which corresponds to a dot
product for some feature mapping φ:

K(x1, x2) = φ(x1) · φ(x2) for some φ.

• Conversely, by choosing feature mapping φ, we implicitly choose a kernel

function
• Recall that φ(x1) · φ(x2) = cos ∠(x1, x2) where ∠ denotes the angle
between the vectors, so a kernel function can be thought of as a notion
of similarity.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 44

The “kernel trick”

• If we work with the dual, we do not actually have to ever compute the
feature mapping φ. We just have to compute the similarity K.
• That is, we can solve
Pm the dual forPthe αi :
1 m
max i=1 αi − 2 i,j=1 yi yj αi αj K(xi , xj )
w.r.t. αi
s.t. 0P≤ αi
m
i=1 αi yi = 0
• The class of a new input x is computed as:

m
! m
!
X X
hw,w0 (x) = sign ( αiyiφ(xi)) · φ(x) + w0 = sign αiyiK(xi, x) + w0
i=1 i=1

• Often, K(·, ·) can be evaluated in O(n) time—a big savings!

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 45

Regularization with SVMs

• Kernels are a powerful tool for allowing non-linear, complex functions

• But now the number of parameters can be as high as the number of
instances!
• With a very specific, non-linear kernel, each data point may be in its own
partition
• With linear and logistic regression, we used regularization to avoid
overfitting
• We need a method for allowing regularization with SVMs as well.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 46

Soft margin classifiers

• Recall that in the linearly separable case, we compute the solution to the
following optimization problem:
min 12 kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• If we want to allow misclassifications, we can relax the constraints to:

yi(w · xi + w0) ≥ 1 − ξi

• If ξi ∈ (0, 1), the data point is within the margin

• If ξi ≥ 1, then the data point is misclassified
P
• We define the soft error as i ξi
• We will have to change the criterion to reflect the soft errors

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 47

New problem formulation with soft errors

• Instead of:
min 12 kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
we want to solve: P
min 12 kwk2 + C i ξi
w.r.t. w, w0, ξi
s.t. yi(w · xi + w0) ≥ 1 − ξi, ξi ≥ 0
• Note that soft errors include points that are misclassified, as well as
points within the margin
• There is a linear penalty for both categories
• The choice of the constant C controls overfitting

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 48

A built-in overfitting knob

1 2
P
min 2 kwk + C i ξi
w.r.t. w, w0, ξi
s.t. yi(w · xi + w0) ≥ 1 − ξi
ξi ≥ 0

• If C is 0, there is no penalty for soft errors, so the focus is on maximizing

the margin, even if this means more mistakes
• If C is very large, the emphasis on the soft errors will cause decreasing
the margin, if this helps to classify more examples correctly.
• Internal cross-validation is a good way to choose C appropriately

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 49

Lagrangian for the new problem

• Like before, we can write a Lagrangian for the problem and then use the
dual formulation to find the optimal parameters:

1 X
2
L(w, w0, α, ξ, µ) = ||w|| + C ξi
2 i
X X
+ αi (1 − ξi − yi(wi · xi + w0)) + µ i ξi
i i

• All the previously described machinery can be used to solve this problem
• Note that in addition to αi we have coefficients µi, which ensure that
the errors are positive, but do not participate in the decision boundary

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 50

Soft margin optimization with kernels

• Replacing xi with φ(xi), the optimization problem to find w and w0

becomes: P
2
min kwk + C i ζi
w.r.t. w, w0, ζi
s.t. yi(w · φ(xi) + w0) ≥ (1 − ζi)
ζi ≥ 0
• Dual form and solution have similar forms to what we described last
time, but in terms of kernels

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 51

Getting SVMs to work in practice

• Two important choices:

– Kernel (and kernel parameters)
– Regularization parameter C
• The parameters may interact!
E.g. for Gaussian kernel, the larger the width of the kernel, the more
biased the classifier, so low C is better
• Together, these control overfitting: always do an internal parameter
search, using a validation set!
• Overfitting symptoms:
– Low margin
– Large fraction of instances are support vectors

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 52

Solving the quadratic optimization problem

• Many approaches exist

• Because we have constraints, gradient descent does not apply directly
(the optimum might be outside of the feasible region)
• Platt’s algorithm is the fastest current approach, based on coordinate
ascent

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 53

Coordinate ascent

• Suppose you want to find the maximum of some function F (α1, . . . αn)
• Coordinate ascent optimizes the function by repeatedly picking an αi
and optimizing it, while all other parameters are fixed
• There are different ways of looping through the parameters:
– Round-robin
– Repeatedly pick a parameter at random
– Choose next the variable expected to make the largest improvement
– ...

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 54

Example 22

2.5

1.5

0.5

−0.5

−1

−1.5

−2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in and
COMP-652 theECSE-608,
figureLecture
is the path28,that
6 - January 2016 it took on its way to the global maximum.
55
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.
Our optimization problem (dual form)

1XX
max αi − yiyj αiαj (φ(xi) · φ(xj ))
α
i
2 i,j
P
with constraints: 0 ≤ αi ≤ C and i αiyi = 0

• Suppose we want to optimize for α1 while α2, . . . αn are fixed

• We cannot do it because
Pm α1 will be completely determined by the last
constraint: α1 = −y1 i=2 αiyi
• Instead, we have to optimize pairs of parameters αi, αj together

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 56

SMO

• Suppose that we want to optimize α1 and α2 together, while all other

parameters are fixed.
Pm
• We know that y1α1 + y2α2 = − i=1 yiαi = ξ, where ξ is a constant
• So α1 = y1(ξ − y2α2) (because y1 is either +1 or −1 so y12 = 1)
• This defines a line, and any pair α1, α2 which is a solution has to be on
the line

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 57

SMO (II)

• We also know that 0 ≤ α1 ≤ C and 0 ≤ α2 ≤ C, so the solution

24 has to
be on the line segment inside the rectangle below

H α1y(1)+ α2y(2)=ζ
α2

L
α1 C

From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 58
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissable values for α2 that will ensure that α1 , α2
SMO(III)

• By plugging α1 back in the optimization criterion, we obtain a quadratic

function of α2, whose optimum we can find exactly
• If the optimum is inside the rectangle, we take it.
• If not, we pick the closest intersection point of the line and the rectangle
• This procedure is very fast because all these are simple computations.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 59

Interpretability

• SVMs are not very intuitive, but typically are more interpretable than
neural nets, if you look at the machine and the misclassifications
• E.g. Ovarian cancer data (Haussler) - 31 tissue samples of 3 classes,
misclassified examples wrongly labelled
• But no biological plausibility!
• Hard to interpret if the percentage of instances that are recruited as
support vectors is high

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 60

Complexity

• Quadratic programming is expensive in the number of training examples

• Platt’s SMO algorithm is quite fast though, and other fancy optimization
approaches are available
• Best packages can handle 50, 000+ instances, but not more than 100, 000
• On the other hand, number of attributes can be very high (strength
compared to neural nets)
• Evaluating a SVM is slow if there are a lot of support vectors.
• Dictionary methods attempt to select a subset of the data on which to
train.

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 61

Passive supervised learning

• The environment provides labelled data in the form of pairs (x, y)

• We can process the examples either as a batch or one at a time, with
the goal of producing a predictor of y as a function of x
• We assume that there is an underlying distribution P generating the
examples
• Each example is drawn i.i.d. from P
• What if instead we are allowed to ask for particular examples?
• Intuitively, if we are allowed to ask questions, and if we are smart about
what we want to know, fewer examples may be necessary

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 62

speech samples
images and video
But labeling can be expensive.
Semi-Supervised and Active Learning

Unlabeled points Supervised learning Semisupervised and

active learning

• Suppose you had access to a lot of unlabeled data

E.g. all the documents on the web
E.g. all the pictures on Instagram
• You can also get some labelled data, but not much
• How can we take advantage of the unlabeled data to improve supervised
learning performance?

COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 63

The Machine Learning Landscape
No ratings yet
The Machine Learning Landscape
25 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Machine Learning: Kernel Methods
No ratings yet
Machine Learning: Kernel Methods
6 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Kernel Methods
No ratings yet
Kernel Methods
32 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Lecture23.dmunoz
No ratings yet
Lecture23.dmunoz
3 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Kernel Trick
No ratings yet
Kernel Trick
40 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Lecture-Notes Kernal Methods
No ratings yet
Lecture-Notes Kernal Methods
12 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Leastsquares Minnorm Problems
No ratings yet
Leastsquares Minnorm Problems
6 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
NNLS1 2019 HW4 Solutions
No ratings yet
NNLS1 2019 HW4 Solutions
11 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Kernel Models for Data Scientists
No ratings yet
Kernel Models for Data Scientists
56 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
Fisher Linear Discriminant Analysis: Max Welling
No ratings yet
Fisher Linear Discriminant Analysis: Max Welling
4 pages
MLSlides4 Selected Shared
No ratings yet
MLSlides4 Selected Shared
35 pages
SVM Basics for Computer Science Students
No ratings yet
SVM Basics for Computer Science Students
36 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
CS229 Midterm Solutions 2010
No ratings yet
CS229 Midterm Solutions 2010
8 pages
Kernel Methods for Statisticians
No ratings yet
Kernel Methods for Statisticians
53 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Kernels Regularization and Differential Equations
No ratings yet
Kernels Regularization and Differential Equations
16 pages
Lect 7
No ratings yet
Lect 7
35 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
33 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
Be Central
No ratings yet
Be Central
98 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
Matlab Homework Experts 2
No ratings yet
Matlab Homework Experts 2
10 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
Data Scaling and Statistical Methods
No ratings yet
Data Scaling and Statistical Methods
4 pages
Scribe
No ratings yet
Scribe
4 pages
DSA5102X Lecture2
No ratings yet
DSA5102X Lecture2
43 pages
AL 3451 ML Unit-1
No ratings yet
AL 3451 ML Unit-1
38 pages
Evaluation of Equalization Methods For Binaural Si
No ratings yet
Evaluation of Equalization Methods For Binaural Si
18 pages
Artificial Intelligence Techniques (2025) - Week 3
No ratings yet
Artificial Intelligence Techniques (2025) - Week 3
37 pages
Can Ai Replace Stock Analysts
No ratings yet
Can Ai Replace Stock Analysts
56 pages
CT43B0513 Ieee
No ratings yet
CT43B0513 Ieee
6 pages
Matlab Code For A Level Set-Based Topology Optimization Method Using A Reaction Diffusion Equation
100% (1)
Matlab Code For A Level Set-Based Topology Optimization Method Using A Reaction Diffusion Equation
14 pages
Data 20 Bootcamp
No ratings yet
Data 20 Bootcamp
29 pages
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
No ratings yet
AASAN.2021 - Invertible and Pseudo-Invertible Encoders An Approach To Inverse Problems With Neural Networks
199 pages
Machine Learning and Artificial Intelligence: PG Diploma in
No ratings yet
Machine Learning and Artificial Intelligence: PG Diploma in
23 pages
Artificial Intelligence in Internal Controls and Risk Management
No ratings yet
Artificial Intelligence in Internal Controls and Risk Management
17 pages
1 s2.0 S0957417425006232 Main
No ratings yet
1 s2.0 S0957417425006232 Main
11 pages
Understanding Machine Learning Theory Algorithms
No ratings yet
Understanding Machine Learning Theory Algorithms
449 pages
Pondernet: Learning To Ponder: 8Th Icml Workshop On Automated Machine Learning (2021)
No ratings yet
Pondernet: Learning To Ponder: 8Th Icml Workshop On Automated Machine Learning (2021)
16 pages
EE2211 Past Paper
No ratings yet
EE2211 Past Paper
14 pages
CUSTOMER SEGMENTATION ANALYSIS OF DATA AND MACHINE LEARNING APPROACH With Plugorism
No ratings yet
CUSTOMER SEGMENTATION ANALYSIS OF DATA AND MACHINE LEARNING APPROACH With Plugorism
6 pages
Machine Learning Interview Q&A Guide
No ratings yet
Machine Learning Interview Q&A Guide
38 pages
Convex Optimization in Image Processing: Ernie Esser
No ratings yet
Convex Optimization in Image Processing: Ernie Esser
9 pages
Deep Learning QP Ia ! Sep 2024
No ratings yet
Deep Learning QP Ia ! Sep 2024
2 pages
1.1. Basics of Image Processing: Chapter-1
No ratings yet
1.1. Basics of Image Processing: Chapter-1
47 pages
Nonlinear System Identification Using A New Sliding-Window Kernel RLS Algorithm
No ratings yet
Nonlinear System Identification Using A New Sliding-Window Kernel RLS Algorithm
8 pages
HW1 Final
No ratings yet
HW1 Final
4 pages
DL - Assignment 7 Solution
100% (1)
DL - Assignment 7 Solution
5 pages
GMDH
No ratings yet
GMDH
5 pages
Deep Learning Regularization Lecture
No ratings yet
Deep Learning Regularization Lecture
79 pages
Loss Functions & Classification Metrics
No ratings yet
Loss Functions & Classification Metrics
56 pages
Autoencoder
No ratings yet
Autoencoder
24 pages
Forest Fire Detection Presentation
No ratings yet
Forest Fire Detection Presentation
11 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
48 pages