6/12/24, 9:57 AM Lecture 13: Kernels
Lecture 13: Kernels
previous back next
Machine Learning Lecture 21 "Model Selection / Kernels" -…
-…
Video II
Linear classifiers are great, but what if there exists no linear decision boundary? As it turns out, there is an elegant
way to incorporate non-linearities into most linear classifiers.
Handcrafted Feature Expansion
We can make linear classifiers non-linear by applying basis function (feature transformations) on the input feature
vectors. Formally, for a data vector x ∈ R , we apply the transformation x → ϕ(x) where ϕ(x) ∈ R . Usually
d D
D ≫ d because we add dimensions that capture non-linear interactions among the original features.
Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original
gradient descent code, just with the higher dimensional representation)
Disadvantage: ϕ(x) might be very high dimensional.
⎛ 1 ⎞
⎜ ⎟
⎜ ⎟
x1
⎜ ⎟
⎜ ⎟
⎜ ⎟
⋮
⎛ x1 ⎞ ⎜ ⎟
⎜ ⎟
⎜ 2⎟ ⎜ ⎟
xd
⎜ ⎟ ⎜ ⎟.
x
Consider the following example: x = ⎜ ⎟, and define ϕ(x) = ⎜
x1 x2
⎟
⎜ ⋮ ⎟ ⎜ ⎟
⎜ ⎟
⎝x ⎠ ⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
d
xd−1 xd
⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
x1 x2 ⋯ xd
Quiz: What is the dimensionality of ϕ(x)?
This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but
the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.
The Kernel Trick
Gradient Descent with Squared Loss
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 1/5
6/12/24, 9:57 AM Lecture 13: Kernels
The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space,
without ever computing a single vector ϕ(x) or ever computing the full vector w. It is a little magical.
It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the
gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:
n
ℓ(w) = ∑(w⊤ xi − yi )2
i=1
The gradient descent rule, with step-size/learning-rate s > 0 (we denoted this as α > 0 in our previous lectures),
updates w over time,
n n
∂ℓ ∂ℓ
= ∑ 2(w⊤ xi − yi ) xi = ∑ γi xi
i=1
wt+1 ← wt − s( ) where:
∂w ∂w i=1
γi : function of xi,yi
We will now show that we can express w as a linear combination of all input vectors,
n
w = ∑ αi xi .
i=1
Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be
⎛0⎞
whatever we want. For convenience, let us pick w0 = ⎜
⎜⋮⎟
⎟ . For this initial choice of w0 , the linear combination
⎝0⎠
n
in w = ∑i=1 αi x i is trivially α1 = ⋯ = αn = 0. We now show that throughout the entire gradient descent
optimization such coefficients α1 , … , αn must always exist, as we can re-write the gradient updates entirely in
terms of updating the αi coefficients:
n n n n
w1 =w0 − s ∑ 2(w⊤
0 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
0 0 1 (with α1i = α0i − sγi0)
i=1 i=1 i=1 i=1
n n n n
w2 =w1 − s ∑ 2(w⊤
1 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
1 1 2 (with α2i = α1i xi − sγi1)
i=1 i=1 i=1 i=1
n n n n
w3 =w2 − s ∑ 2(w⊤
2 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
2 2 3
(with α3i = α2i − sγi2)
i=1 i=1 i=1 i=1
⋯ ⋯ ⋯
n n n n
wt =wt−1 − s ∑ 2(w⊤
t−1 x i − yi )x i = ∑ αi x i − s ∑ γi
t−1 t−1
xi = ∑ αti xi (with αti = αt−1
i − sγit−1 )
i=1 i=1 i=1 i=1
Formally, the argument is by induction. w is trivially a linear combination of our training vectors for w0 (base
case). If we apply the inductive hypothesis for wt it follows for wt+1 .
The update-rule for αti is thus
t−1
αti = αt−1
i − sγit−1 , and we have αti = −s ∑ γir .
r=0
In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We
just keep track of the n coefficients α1 , … , αn . Now that w can be written as a linear combination of the training
set, we can also express the inner-product of w with any input x i purely in terms of inner-products between
training inputs:
n
w⊤ xj = ∑ αi x⊤
i xj.
i=1
= ∑i=1 (w⊤ xi − yi )2 entirely in terms of
n
Consequently, we can also re-write the squared-loss from ℓ(w)
inner-product between training inputs:
2
∑ (∑ αj x⊤ − yi )
n n
ℓ(α) = j xi
i=1 j=1
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 2/5
6/12/24, 9:57 AM Lecture 13: Kernels
During test-time we also only need these coefficients to make a prediction on a test-input xt , and can write the
entire classifier in terms of inner-products between the test point and training points:
n
h(xt ) = w⊤ xt = ∑ αj x⊤
j xt .
j=1
Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the
squared-loss is inner-products between all pairs of data vectors.
Inner-Product Computation
⎛ 1 ⎞
⎜ ⎟
⎜ ⎟
x1
⎜ ⎟
⎜ ⎟
⎜ ⎟
⋮
⎜ ⎟
⎜ ⎟
⎜ ⎟
xd
Let's go back to the previous example, ϕ(x) = ⎜
⎜
x1 x2 ⎟.
⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
xd−1 xd
⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
x1 x2 ⋯ xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
d
ϕ(x)⊤ ϕ(z) = 1 ⋅ 1 + x1 z1 + x2 z2 + ⋯ + x1 x2 z1 z2 + ⋯ + x1 ⋯ xd z1 ⋯ zd = ∏(1 + xk zk ).
k=1
The sum of 2d terms becomes the product of d terms. We can compute the inner-product from the above formula
in time O(d) instead of O(2d ) ! We define the function
k(xi , xj) = ϕ(xi )⊤ ϕ(xj).
this is called the kernel function
With a finite training set of n samples, inner products are often pre-computed and stored in a Kernel Matrix:
Kij = ϕ(xi )⊤ ϕ(xj).
If we store the matrix K , we only need to do simple inner-product look-ups and low-dimensional computations
throughout the gradient descent algorithm. The final classifier becomes:
n
h(xt ) = ∑ αj k(xj, xt ).
j=1
During training in the new high dimensional space of ϕ(x) we want to compute γi through kernels, without ever
n
computing any ϕ(x i ) or even w. We previously established that w = ∑j=1 αj ϕ(x j) , and
γi = 2(w⊤ ϕ(xi ) − yi ) . It follows that γi = 2(∑j=1 αj Kij ) − yi ). The gradient update in iteration t + 1
n
becomes
n
αt+1
i ← αti − 2s(∑ αtj Kij ) − yi ).
j=1
As we have n such updates to do, the amount of work per gradient update in the transformed space is O(n 2 ) ---
far better than O(2d ) .
General Kernels
Below are some popular kernel functions:
Linear: K(x, z) = x⊤ z.
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 3/5
6/12/24, 9:57 AM Lecture 13: Kernels
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix
if the dimensionality d of the data is high.)
Polynomial: K(x, z) = (1 + x⊤ z)d .
−∥x−z∥2
Radial Basis Function (RBF) (aka Gaussian Kernel): K(x, z) =e σ2 .
The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is
infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see
this paper).
−∥x−z∥
Exponential Kernel: K(x, z) =e 2σ 2
−|x−z|
Laplacian Kernel: K(x, z) =e σ
Sigmoid Kernel: K(x, z) = tanh(ax⊤ + c)
Kernel functions
Can any function K(⋅, ⋅) → R be used as a kernel?
No, the matrix K(x i , x j) has to correspond to real inner-products after some transformation x → ϕ(x) . This is
the case if and only if K is positive semi-definite.
Definition: A matrix A ∈ Rn×n is positive semi-definite iff ∀q ∈ Rn , q⊤ Aq ≥ 0 .
Remember Kij = ϕ(xi )⊤ ϕ(xj). So K = Φ⊤ Φ, where Φ = [ϕ(x1), … , ϕ(xn )]. It follows that K is p.s.d.,
because q Kq = (Φ⊤ q)2 ≥ 0. Inversely, if any matrix A is p.s.d., it can be decomposed as A = Φ⊤ Φ for
⊤
some realization of Φ.
You can even define kernels over sets, strings, graphs and molecules.
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 4/5
6/12/24, 9:57 AM Lecture 13: Kernels
Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve.
RBF works well with the decision boundary in this case.
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 5/5