Foundations of Machine Learning
Kernel Methods
Mehryar Mohri
Courant Institute and Google Research
[email protected]
Motivation
Efficient computation of inner products in high
dimension.
Non-linear decision boundary.
Non-vectorial inputs.
Flexible selection of more complex features.
Mehryar Mohri - Foundations of Machine Learning page 2
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning page 3
Non-Linear Separation
Linear separation impossible in most problems.
Non-linear mapping from input space to high-
dimensional feature space: : X F .
Generalization ability: independent of dim(F ),
depends only on margin and sample size.
Mehryar Mohri - Foundations of Machine Learning page 4
Kernel Methods
Idea:
• Define K : X X R , called kernel, such that:
(x) · (y) = K(x, y).
• K often interpreted as a similarity measure.
Benefits:
• Efficiency: K is often more efficient to compute
than and the dot product.
• Flexibility: K can be chosen arbitrarily so long as
the existence of is guaranteed (PDS condition
or Mercer’s condition).
Mehryar Mohri - Foundations of Machine Learning page 5
PDS Condition
Definition: a kernel K: X X R is positive definite
symmetric (PDS) if for any {x1 , . . . , xm } X , the
matrix K = [K(xi , xj )]ij Rm m is symmetric
positive semi-definite (SPSD).
K SPSD if symmetric and one of the 2 equiv. cond.’s:
• its eigenvalues are non-negative. m
• for any c R , c Kc = c c K(x , x )
m 1
i,j=1
i j i j 0.
Terminology: PDS for kernels, SPSD for kernel
matrices (see (Berg et al., 1984)).
Mehryar Mohri - Foundations of Machine Learning page 6
Example - Polynomial Kernels
Definition:
x, y RN , K(x, y) = (x · y + c)d , c > 0.
Example: for N = 2 and d = 2 ,
K(x, y) = (x1 y1 + x2 y2 + c)2
x21 y12
x22 y22
2 x1 x2 2 y1 y2
= · .
2c x1 2c y1
2c x2 2c y2
c c
Mehryar Mohri - Foundations of Machine Learning page 7
XOR Problem
Use second-degree polynomial kernel with c = 1:
x2
√
2 x1 x2
(-1, 1)
√ √ √
(1, 1)
√ √ √
(1, 1, + 2, − 2, − 2, 1) (1, 1, + 2, + 2, + 2, 1)
√
x1 2 x1
(-1, -1) (1, -1) √ √ √
(1, 1, − 2, − 2, + 2, 1)
√ √ √
(1, 1, − 2, + 2, − 2, 1)
Linearly non-separable Linearly separable by
x1 x2 = 0.
Mehryar Mohri - Foundations of Machine Learning page 8
Normalized Kernels
Definition: the normalized kernel K associated to a
kernel K is defined by
0 if (K(x, x) = 0) (K(x , x ) = 0)
x, x X , K (x, x ) = K(x,x )
otherwise.
K(x,x)K(x ,x )
• If K is PDS, then K is PDS:
m m m 2
ci cj K(xi , xj ) ci cj (xi ), (xj ) ci (xi )
= = 0.
K(xi , xi )K(xj , xj ) i,j=1 (x )
i H (x )
j H (xi ) H
i,j=1 i=1 H
• By definition, for all x with K(x, x) = 0 ,
K (x, x) = 1.
Mehryar Mohri - Foundations of Machine Learning page 9
Other Standard PDS Kernels
Gaussian kernels:
||x y||2
K(x, y) = exp , = 0.
2 2
• Normalized kernel of (x, x ) exp x·x
2 .
Sigmoid Kernels:
K(x, y) = tanh(a(x · y) + b), a, b 0.
Mehryar Mohri - Foundations of Machine Learning page 10
Reproducing Kernel Hilbert Space
(Aronszajn, 1950)
Theorem: Let K: X X R be a PDS kernel. Then,
there exists a Hilbert space H and a mapping
from X to H such that
x, y X, K(x, y) = (x) · (y).
Proof: For any x X , define (x) : X RX as follows:
y X, (x)(y) = K(x, y).
• Let H = a (x ) : a R, x X, card(I) < .
0
i I
i i i i
• We are going to define an inner product ·, · on H . 0
Mehryar Mohri - Foundations of Machine Learning page 11
• Definition: for anyf = i I ai (xi ), g =
j J
bj (yj ),
f, g = ai bj K(xi , yj ) = bj f (yj ) = ai g(xi ).
i I,j J j J i I
• ·, · does not depend on representations of f and g.
• ·, · is bilinear and symmetric.
• ·, · is positive semi-definite since K is PDS: for any f,
f, f = ai aj K(xi , xj ) 0.
i,j I
• note: for any f , . . . , f m
1 m and c1 , . . . , cm ,
m m
ci cj f i , f j = ci f i , cj f j 0.
i,j=1 i=1 j=1
·, · is a PDS kernel on H0 .
Mehryar Mohri - Foundations of Machine Learning page 12
• ·, · is definite:
• first, Cauchy-Schwarz inequality for PDS kernels.
K(x,x) K(x,y)
If K is PDS,
M= is SPSD for all
x, y
K(y,x) K(y,y) X
In particular, the product of its eigenvalues, det(M)
is non-negative:
det(M) = K(x, x)K(y, y) K(x, y)2 0.
• since ·, · is a PDS kernel, for any f H0 and x X ,
f, (x) 2
f, f (x), (x) .
• observe the reproducingX
property of ·, · :
8f 2 H0 , 8x 2 X, f (x) = ai K(xi , x) = hf, (x)i.
•
i2I
Thus,[f (x)]2 f, f K(x, x) for all x X , which
shows the definiteness of ·, · .
Mehryar Mohri - Foundations of Machine Learning page 13
• Thus, ·, · defines an inner product on H , which0
thereby becomes a pre-Hilbert space.
• H can be completed to form a Hilbert space H in
0
which it is dense.
Notes:
• H is called the reproducing kernel Hilbert space
(RKHS) associated to K.
• A Hilbert space such that there exists : X H
with K(x, y) = (x)· (y) for all x, y X is also
called a feature space associated to K . is called
a feature mapping.
• Feature spaces associated to K are in general not
unique.
Mehryar Mohri - Foundations of Machine Learning page 14
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning page 15
SVMs with PDS Kernels
(Boser, Guyon, and Vapnik, 1992)
Constrained optimization:
(xi )· (xj )
m m
1
max i i j yi yj K(xi , xj )
i=1
2 i,j=1
m
subject to: 0 i C i yi = 0, i [1, m].
i=1
Solution:
m
h(x) = sgn i yi K(xi , x) +b ,
m i=1
with b = yi j yj K(xj , xi ) for any xi with
j=1 0< i < C.
Mehryar Mohri - Foundations of Machine Learning page 16
Rad. Complexity of Kernel-Based Hypotheses
Theorem: Let K: X X R be a PDS kernel and
let : X ! H be a feature mapping associated to K.
Let S {x : K(x, x) R2 } be a sample of size m , and
let H = {x 7! w· (x) : kwkH ⇤}. Then,
Tr[K] R2 2
RS (H) .
m m
m m
1
Proof: RS (H) =
m
E sup w · i (xi )
m
E i (xi )
w i=1 i=1
m 2 1/2 m 1/2
(Jensen’s ineq.) E i (xi ) E (xi ) 2
m i=1
m i=1
m 1/2
Tr[K] R2 2
= E K(xi , xi ) = .
m i=1
m m
Mehryar Mohri - Foundations of Machine Learning page 17
Generalization: Representer Theorem
(Kimeldorf and Wahba, 1971)
Theorem: Let K: X X R be a PDS kernel with H
the corresponding RKHS. Then, for any non-
decreasing function G: R R and any L: Rm R {+ }
problem
argmin F (h) = argmin G( h H) + L h(x1 ), . . . , h(xm )
h H h H
m
admits a solution of the form h = i K(xi , ·).
i=1
If G is further assumed to be increasing, then any
solution has this form.
Mehryar Mohri - Foundations of Machine Learning page 18
• Proof: let H = span({K(x , ·):h =i h [1,+m]})
1 . Any h
i H
admits the decomposition h 1 according
to H = H1 H1 .
• Since G is non-decreasing,
H) + h
= G( h H ).
G( h1 G h1 2 2
H H
• By the reproducing property, for all i [1, m],
h(xi ) = h, K(xi , ·) = h1 , K(xi , ·) = h1 (xi ).
• Thus, L h(x ), . . . , h(x 1 m) = L h1 (x1 ), . . . , h1 (xm )
and F (h1 ) F (h).
• If G is increasing, then F (h ) < F (h) when h 1 =0
and any solution of the optimization problem
must be in H1.
Mehryar Mohri - Foundations of Machine Learning page 19
Kernel-Based Algorithms
PDS kernels used to extend a variety of algorithms
in classification and other areas:
• regression.
• ranking.
• dimensionality reduction.
• clustering.
But, how do we define PDS kernels?
Mehryar Mohri - Foundations of Machine Learning page 20
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning page 21
Closure Properties of PDS Kernels
Theorem: Positive definite symmetric (PDS)
kernels are closed under:
• sum,
• product,
• tensor product,
• pointwise limit,
• composition with a power series with non-
negative coefficients.
Mehryar Mohri - Foundations of Machine Learning page 22
Closure Properties - Proof
Proof: closure under sum:
c Kc 0 c Kc 0 c (K + K )c 0.
• closure under product: K✓ = MM ,
◆
m
X m
X hX
m i
ci cj (Kij K0ij ) = ci cj Mik Mjk K0ij
i,j=1 i,j=1 k=1
Xm m
X
= ci cj Mik Mjk K0ij
k=1 i,j=1
2 3> 2 3
m
X c1 M1k c1 M1k
= 4 · · · 5 K0 4 · · · 5 0.
k=1 cm Mmk cm Mmk
Mehryar Mohri - Foundations of Machine Learning page 23
• Closure under tensor product:
• definition: for all x , x , y , y 1 2 1 2 X,
(K1 K2 )(x1 , y1 , x2 , y2 ) = K1 (x1 , x2 )K2 (y1 , y2 ).
• thus, PDS kernel as product of the kernels
(x1 , y1 , x2 , y2 ) K1 (x1 , x2 ) (x1 , y1 , x2 , y2 ) K2 (y1 , y2 ).
• Closure under pointwise limit: if for all x, y X,
lim Kn (x, y) = K(x, y),
n
Then, ( n, c Kn c 0) lim c Kn c = c Kc 0.
n
Mehryar Mohri - Foundations of Machine Learning page 24
• Closure under composition with power series:
• assumptions: Kf (x)
PDS kernel with|K(x, y)| < for
all x, y X and
= a x ,a
n=0 n 0
n
n power
series with radius of convergence .
• f K is a PDS kernel since K n is PDS by closure
N
under product, n=0 an K n is PDS by closure
under sum, and closure under pointwise limit.
Example: for any PDS kernel K, exp(K) is PDS.
Mehryar Mohri - Foundations of Machine Learning page 25
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning page 26
Sequence Kernels
Definition: Kernels defined over pairs of strings.
• Motivation: computational biology, text and
speech classification.
• Idea: two sequences are related when they share
some common substrings or subsequences.
• Example: bigramXkernel;
K(x, y) = countx (u) ⇥ county (u).
bigram u
Mehryar Mohri - Foundations of Machine Learning page 27
Weighted Transducers
b:a/0.6
b:a/0.2
a:b/0.1 1 a:a/0.4 b:a/0.3
2 3/0.1
0 a:b/0.5
T (x, y) = Sum of the weights of all accepting
paths with input x and output y .
T (abb, baa) = .1 .2 .3 .1 + .5 .3 .6 .1
Mehryar Mohri - Foundations of Machine Learning page 28
Rational Kernels over Strings
(Cortes et al., 2004)
Definition: a kernel K : R is rational if K = T
for some weighted transducer T .
Definition: let T1 : R and T2 : R be
two weighted transducers. Then, the composition
of T1 and T2 is defined for all x ,y by
(T1 T2 )(x, y) = T1 (x, z) T2 (z, y).
z
Definition: the inverse of a transducer T : R
is the transducer T 1 : R obtained from T
by swapping input and output labels.
Mehryar Mohri - Foundations of Machine Learning page 29
PDS Rational Kernels
General Construction
Theorem: for any weighted transducer T : R,
the function K = T T 1 is a PDS rational kernel.
Proof: by definition, for all x, y ,
K(x, y) = T (x, z) T (y, z).
z
• K is pointwise limit of (K ) n n 0 defined by
x, y , Kn (x, y) = T (x, z) T (y, z).
|z| n
•K n is PDS since for any sample (x1 , . . . , xm ),
Kn = AA with A = (Kn (xi , zj )) i [1,m] .
j [1,N ]
Mehryar Mohri - Foundations of Machine Learning page 30
PDS Sequence Kernels
PDS sequences kernels in computational biology,
text classification, other applications:
• special instances of PDS rational kernels.
• PDS rational kernels easy to define and modify.
• single general algorithm for their computation:
composition + shortest-distance computation.
• no need for a specific ‘dynamic-programming’
algorithm and proof for each kernel instance.
• general sub-family: based on counting
transducers.
Mehryar Mohri - Foundations of Machine Learning page 31
Counting Transducers
b:ε/1
b:ε/1
a:ε/1
X = ab
a:ε/1 Z = bbabaabba
X:X/1
0 1/1
εεabεεεεε εεεεεabεε
TX
X may be a string or an automaton
representing a regular expression.
Counts of Z in X : sum of the weights of
accepting paths of Z TX .
Mehryar Mohri - Foundations of Machine Learning page 32
Transducer Counting Bigrams
b:ε/1 b:ε/1
a:ε/1 a:ε/1
0 a:a/1 1 a:a/1 2/1
b:b/1 b:b/1
Tbigram
Counts of Z given by Z Tbigram ab .
Mehryar Mohri - Foundations of Machine Learning page 33
Transducer Counting Gappy Bigrams
b:ε/1 b:ε/λ b:ε/1
a:ε/1 a:ε/λ a:ε/1
0 a:a/1 1 a:a/1 2/1
b:b/1 b:b/1
Tgappy bigram
Counts of Z given by Z Tgappy bigram ab ,
gap penalty (0, 1) .
Mehryar Mohri - Foundations of Machine Learning page 34
Composition
Theorem: the composition of two weighted
transducer is also a weighted transducer.
Proof: constructive proof based on composition
algorithm.
• states identified with pairs.
• -free case: transitions defined by
E= (q1 , q1 ), a, c, w1 w2 , (q2 , q2 ) .
(q1 ,a,b,w1 ,q2 ) E1
(q1 ,b,c,w2 ,q2 ) E2
• general case: use of intermediate -filter.
Mehryar Mohri - Foundations of Machine Learning page 35
Composition Algorithm
ε-Free Case
a:a/0.6
b:a/0.2
b:b/0.3 2 a:b/0.5 a:b/0.3 2 b:a/0.5
a:b/0.1 b:b/0.1
0 1 b:b/0.4 3/0.7 0 1 3/0.6
a:b/0.4
a:b/0.2
a:a/.04 (0, 1)
a:a/.02 (3, 2)
a:b/.18
a:b/.01 b:a/.06 a:a/0.1
(0, 0) (1, 1) (2, 1) (3, 1)
a:b/.24
b:a/.08 (3, 3)/.42
Complexity: O(|T1 | |T2 |) in general, linear in some cases.
Mehryar Mohri - Foundations of Machine Learning page 36
(c)
!:!1 !:!1 !:!1 !:!1 !:!1
A' 0
Redundant ε-Paths Problem
a:a
1
b:!2
2
c:!2
3
d:d
4
(d)
!2:! !2:! !2:! (MM,!2Pereira,
:! and Riley, 1996; Pereira and Riley, 1997)
T1 0 a:aa:d 1 b:ε !1:e2 c:ε d:a3 d:d 4 0 a:d 1 ε:e 2 d:a 3 T2
B' 0 1 2 3
ε:ε1 ε:ε1 ε:ε1 ε:ε1 ε:ε1 ε2:ε ε2:ε ε2:ε ε2:ε
T!1 0 a:a 1 b:ε2 2 c:ε2 3 d:d 4 0 a:d 1 ε1: e 2 d:a 3 T!2
a:d !:e
0,0 1,1 1,2 ε1:ε1
(x:x) (!1:!1)
b:! b:e b:! ε2:ε1 ε1:ε1
(!2:!2) (!2:!1) (!2:!2) x:x 1
!:e x:x
2,1 (!1:!1)
2,2
0 F
c:! c:!
ε2:ε2 ε2:ε2
(!2:!2) (!2:!2)
x:x 2
3,1 !:e 3,2
(!1:!1)
d:a
(x:x)
4,3
T = T!1 ◦ F ◦ T!2 .
Mehryar Mohri - Foundations of Machine Learning page 37
Kernels for Other Discrete Structures
Similarly, PDS kernels can be defined on other
discrete structures:
• Images,
• graphs,
• parse trees,
• automata,
• weighted automata.
Mehryar Mohri - Foundations of Machine Learning page 38
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning page 39
Questions
Gaussian kernels have the form exp( d2 ) where d is
a metric.
• for what other functions d does exp( d2 ) define a
PDS kernel?
• what other PDS kernels can we construct from a
metric in a Hilbert space?
Mehryar Mohri - Foundations of Machine Learning page 40
Negative Definite Kernels
(Schoenberg, 1938)
Definition: A function K: X X R is said to be a
negative definite symmetric (NDS) kernel if it is
symmetric and if for all 1
{x , . . . , xm } X and c R m 1
with 1 c = 0 ,
c Kc 0.
Clearly, if K is PDS, then K is NDS, but the
converse does not hold in general.
Mehryar Mohri - Foundations of Machine Learning page 41
Examples
The squared distance ||x y||2 in a Hilbert space H
m
defines an NDS kernel. If i=1 ci = 0 ,
m m
ci cj ||xi xj ||2 = ci cj (xi xj ) · (xi xj )
i,j=1 i,j=1
m
= ci cj ( xi 2
+ xj 2
2xi · xj )
i,j=1
m m m
= ci cj ( xi 2
+ xj 2
) 2 ci xi · cj xj
i,j=1 i=1 j=1
m
ci cj ( xi 2
+ xj 2
)
i,j=1
m m m m
= cj ci ( xi 2
+ ci cj xj 2
= 0.
j=1 i=1 i=1 j=1
Mehryar Mohri - Foundations of Machine Learning page 42
NDS Kernels - Property
(Schoenberg, 1938)
Theorem: Let K: X X R be an NDS kernel such
that for all x, y X, K(x, y) = 0 iff x = y . Then, there
exists a Hilbert space H and a mapping : X H
such that
∀x, y ∈ X, K(x, y) = ∥Φ(x) − Φ(y)∥2 .
Thus, under the hypothesis of the theorem, K
√
defines a metric.
Mehryar Mohri - Foundations of Machine Learning page 43
PDS and NDS Kernels
(Schoenberg, 1938)
Theorem: let K: X X R be a symmetric kernel,
then:
• K is NDS iff exp( tK) is a PDS kernel for all t > 0 .
• Let K be defined for any x0 by
K (x, y) = K(x, x0 ) + K(y, x0 ) K(x, y) K(x0 , x0 )
for all x, y X. Then, K is NDS iff K is PDS.
Mehryar Mohri - Foundations of Machine Learning page 44
Example
The kernel defined by K(x, y) = exp( t||x y||2 )
is PDS for all t > 0 since ||x y||2 is NDS.
The kernel exp( |x y|p )is not PDS for p > 2 .
Otherwise, for any t > 0 ,{x1 , . . . , xm } X and c Rm 1
m m
t|xi xj |p |t1/p xi t1/p xj |p
ci cj e = ci cj e 0.
i,j=1 i,j=1
This would imply that |x y|p is NDS for p > 2, but
that cannot be (see past homework assignments).
Mehryar Mohri - Foundations of Machine Learning page 45
Conclusion
PDS kernels:
• rich mathematical theory and foundation.
• general idea for extending many linear
algorithms to non-linear prediction.
• flexible method: any PDS kernel can be used.
• widely used in modern algorithms and
applications.
• can we further learn a PDS kernel and a
hypothesis based on that kernel from labeled
data? (see tutorial: http://www.cs.nyu.edu/~mohri/icml2011-
tutorial/).
Mehryar Mohri - Foundations of Machine Learning page 46
References
• N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, 337-404, 1950.
• Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector
machines and other pattern classifiers. In Advances in kernel methods: support vector learning,
pages 43–54. MIT Press, Cambridge, MA, USA, 1999.
• Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on
Semigroups. Springer-Verlag: Berlin-New York, 1984.
• Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal
margin classifiers. In proceedings of COLT 1992, pages 144-152, Pittsburgh, PA, 1992.
• Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and
Algorithms. Journal of Machine Learning Research (JMLR), 5:1035-1062, 2004.
• Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20,
1995.
• Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical
Analysis and Applications, 33, 1 (1971) 82-95.
Mehryar Mohri - Foundations of Machine Learning page 47 Courant Institute, NYU
References
• James Mercer. Functions of Positive and Negative Type, and Their Connection with the
Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A,
Containing Papers of a Mathematical and Physical Character,Vol. 83, No. 559, pp. 69-70, 1909.
• Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and
Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial
Intelligence (ECAI-96),Workshop on Extended finite state models of language. Budapest,
Hungary, 1996.
• Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of
Weighted Finite Automata. In Finite-State Language Processing, pages 431-453. MIT Press,
1997.
• I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American
Mathematical Society,Vol. 44, No. 3, pp. 522-536, 1938.
• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,
1982.
• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
Mehryar Mohri - Foundations of Machine Learning page 48 Courant Institute, NYU
Appendix
Mercer’s Condition
(Mercer, 1909)
Theorem: Let X X be a compact subset of RN and
let K : X X R be in L (X X) and symmetric.
Then, K admits a uniformly convergent expansion
K(x, y) = an n (x) n (y), with an > 0,
n=0
iff for any function c in L2 (X),
c(x)c(y)K(x, y)dxdy 0.
X X
Mehryar Mohri - Foundations of Machine Learning page 50
SVMs with PDS Kernels
Constrained optimization: Hadamard product
max 2 1 ( y) K( y)
subject to: 0 C y = 0.
Solution:
m
h = sgn i yi K(xi , ·) +b ,
i=1
with b = yi ( y) Kei for any xi with
0 < i < C.
Mehryar Mohri - Foundations of Machine Learning page 51