(chapters 1,2,3,4)
Introduction to Kernels
Max Welling
October 1 2004
1
Introduction
• What is the goal of (pick your favorite name):
- Machine Learning
- Data Mining
- Pattern Recognition
- Data Analysis
- Statistics
Automatic detection of non-coincidental structure in data.
• Desiderata:
- Robust algorithms insensitive to outliers and wrong
model assumptions.
- Stable algorithms: generalize well to unseen data.
- Computationally efficient algorithms: large datasets.
2
Let’s Learn Something
Find the common characteristic (structure) among the following
statistical methods?
1. Principal Components Analysis
2. Ridge regression
3. Fisher discriminant analysis
4. Canonical correlation analysis
Answer:
We consider linear combinations of input vector: f ( x ) wT x
Linear algorithm are very well understood and enjoy strong guarantees.
(convexity, generalization bounds).
3
Can we carry these guarantees over to non-linear algorithms?
Feature Spaces
: x ( x), R F d
non-linear mapping to F
1. high-D space L2
2. infinite-D countable space :
3. function space (Hilbert space)
example: ( x, y ) ( x , y , 2 xy )
2 2
4
Ridge Regression (duality)
problem: min w ( yi wT xi ) 2 || w ||2
i 1
input regularization
target
solution: w ( X T X I d ) 1 X T y dxd inverse
X T ( XX T I ) 1 y inverse
X T (G I ) 1 y Gij xi , x j
xi i Gram-matrix
i 1
linear comb. data Dual Representation 5
Kernel Trick
Note: In the dual representation we used the Gram matrix
to express the solution.
Kernel Trick: kernel
Replace : x ( x),
Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j )
If we use algorithms that only depend on the Gram-matrix, G,
then we never have to know (compute) the actual features
This is the crucial point of kernel methods
6
Modularity
Kernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)
2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
some kernels: some kernel algorithms:
2
- support vector machine
k ( x, y ) e( || x y|| / c)
- Fisher discriminant analysis
k ( x, y ) ( x, y ) d - kernel regression
k ( x, y ) tanh( x, y )
- kernel PCA
1
k ( x, y ) - kernel CCA 7
|| x y || c 2 2
What is a proper kernel
Definition: A finitely positive semi-definite function k : x y R
is a symmetric function of its arguments for which matrices formed
by restriction on any finite subset of points is positive semi-definite.
T K 0
Theorem: A function k : x y R can be written
as k ( x, y ) ( x), ( y ) where ( x) is a feature map
x ( x) F iff k(x,y) satisfies the semi-definiteness property.
Relevance: We can now check if k(x,y) is a proper kernel using
only properties of k(x,y) itself,
i.e. without the need to know the feature map! 8
Reproducing Kernel Hilbert Spaces
The proof of the above theorem proceeds by constructing a very
special feature map (note that more feature maps may give rise to a kernel)
: x ( x) k ( x,.) i.e. we map to a function space.
definition function space: reproducing property:
m
f (.) i k ( xi ,.) any m,{xi } f , ( x) f , k ( x,.)
i 1
k
i k ( xi ,.), k ( x,.)
m
f , g i j k ( xi , x j )
i 1 j 1 i 1
k
k ( x , x) f ( x)
m
f , f i j k ( xi , x j ) 0 i i
i 1 j 1 i 1
( finite positive semi definite) ( x), ( y ) k ( x, y ) 9
Mercer’s Theorem
Theorem: X is compact, k(x,y) is symmetric continuous function s.t.
Tk f k (., x ) f ( x ) dx is a positive semi-definite operator: Tk 0
i.e.
k ( x, y) f ( x) f ( y) dxdy 0 f L2 ( X )
then there exists an orthonormal feature basis of eigen-functions
such that:
k ( x, y ) i ( x ) j ( y )
i 1
Hence: k(x,y) is a proper kernel.
Note: Here we construct feature vectors in L2, where the RKHS
construction was in a function space. 10
Learning Kernels
• All information is tunneled through the Gram-matrix information
bottleneck.
• The real art is to pick an appropriate kernel.
2
e.g. take the RBF kernel: k ( x, y ) e( || x y|| / c )
if c is very small: G=I (all data are dissimilar): over-fitting
if c is very large: G=1 (all data are very similar): under-fitting
We need to learn the kernel. Here is some ways to combine
kernels to improve them:
k1 ( x, y ) k2 ( x, y ) k ( x, y ) , 0 k1 cone
k ( x, y ) k ( x , y ) k ( x, y ) k2
1 2
any positive
k1 (( x), ( y )) k ( x, y ) polynomial
11
Stability of Kernel Algorithms
Our objective for learning is to improve generalize performance:
cross-validation, Bayesian methods, generalization bounds,...
Call ES [ f ( x)] 0 a pattern a sample S.
Is this pattern also likely to be present in new data: EP [ f ( x)] 0 ?
We can use concentration inequalities (McDiamid’s theorem)
to prove that:
Theorem: Let S {x1 ,..., x} be a IID sample from P and define
the sample mean of f(x) as: f 1 f ( xi ) then it follows that:
i 1
R 1 R sup x || f ( x) ||
P(|| f EP [ f ] || (2 2 ln( )) 1
12
(prob. that sample mean and population mean differ less than is more than ,independent of P!
Rademacher Complexity
Prolem: we only checked the generalization performance for a
single fixed pattern f(x).
What is we want to search over a function class F?
Intuition: we need to incorporate the complexity of this function class.
Rademacher complexity captures the ability of the function class to
fit random noise. ( i 1 uniform distributed) i 1
(empirical RC)
f1
2 f2
R ( F ) E [sup | i f ( xi ) |,| x1 ,..., x ]
f F i 1
2
R ( F ) ES E [sup | i f ( xi ) |]
f F i 1 xi 13
Generalization Bound
Theorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)
Then, with probability at least 1 over random draws of size
every f satisfies:
2
ln( )
E p [ f ( x)] Edata [ f ( x)] R ( F )
2
2
ln( )
Edata [ f ( x)] R ( F ) 3
2
Relevance: The expected pattern E[f]=0 will also be present in a new
data set, if the last 2 terms are small:
- Complexity function class F small
- number of training data large 14
Linear Functions (in feature space)
Consider the FB { f : x w, ( x) , || w || B}
function class: with k ( x, y ) ( x), ( y )
and a sample: S {x1 ,..., x}
Then, the empirical 2B
R ( FB ) tr ( K )
RC of FB is bounded by:
Relevance: Since: {x i k ( xi , x) , T K B} FB it follows that
if we control the norm i 1 T K || w ||2 in kernel algorithms, we control
the complexity of the function class (regularization). 15
Margin Bound (classification)
Theorem: Choose c>0 (the margin).
F : f(x,y)=-yg(x), y=+1,-1
S: {( x1 , y1 ),..., ( x , y )} IID sample
: (0,1) : probability of violating bound.
2
ln( )
1
4
Pp [ y sign( g ( x ))] i tr ( K ) 3
c i 1 c 2
(prob. of misclassification)
i (c yi g ( xi )) ( slack variable)
( f ) f if f 0 and 0 otherwise
Relevance: We our classification error on new samples. Moreover, we have a
strategy to improve generalization: choose the margin c as large possible such
that all samples are correctly classified: i 0 (e.g. support vector machines).
16