Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views5 pages

Lecture 13 - Kernels

Lecture 13 discusses the concept of kernels in machine learning, focusing on how to incorporate non-linearities into linear classifiers through feature transformations. It introduces the 'kernel trick,' which allows for learning in high-dimensional spaces without explicitly computing transformed vectors, thus simplifying computations. Various kernel functions are presented, including linear, polynomial, and radial basis function (RBF) kernels, emphasizing their importance in creating effective classifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views5 pages

Lecture 13 - Kernels

Lecture 13 discusses the concept of kernels in machine learning, focusing on how to incorporate non-linearities into linear classifiers through feature transformations. It introduces the 'kernel trick,' which allows for learning in high-dimensional spaces without explicitly computing transformed vectors, thus simplifying computations. Various kernel functions are presented, including linear, polynomial, and radial basis function (RBF) kernels, emphasizing their importance in creating effective classifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

6/12/24, 9:57 AM Lecture 13: Kernels

Lecture 13: Kernels


previous back next

Machine Learning Lecture 21 "Model Selection / Kernels" -…


-…

Video II

Linear classifiers are great, but what if there exists no linear decision boundary? As it turns out, there is an elegant
way to incorporate non-linearities into most linear classifiers.

Handcrafted Feature Expansion


We can make linear classifiers non-linear by applying basis function (feature transformations) on the input feature
vectors. Formally, for a data vector x ∈ R , we apply the transformation x → ϕ(x) where ϕ(x) ∈ R . Usually
d D

D ≫ d because we add dimensions that capture non-linear interactions among the original features.
Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original
gradient descent code, just with the higher dimensional representation)

Disadvantage: ϕ(x) might be very high dimensional.

⎛ 1 ⎞
⎜ ⎟
⎜ ⎟
x1
⎜ ⎟
⎜ ⎟
⎜ ⎟

⎛ x1 ⎞ ⎜ ⎟
⎜ ⎟
⎜ 2⎟ ⎜ ⎟
xd
⎜ ⎟ ⎜ ⎟.
x
Consider the following example: x = ⎜ ⎟, and define ϕ(x) = ⎜
x1 x2

⎜ ⋮ ⎟ ⎜ ⎟
⎜ ⎟
⎝x ⎠ ⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
d
xd−1 xd
⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
x1 x2 ⋯ xd
Quiz: What is the dimensionality of ϕ(x)?

This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but
the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.

The Kernel Trick

Gradient Descent with Squared Loss

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 1/5
6/12/24, 9:57 AM Lecture 13: Kernels
The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space,
without ever computing a single vector ϕ(x) or ever computing the full vector w. It is a little magical.

It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the
gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:
n
ℓ(w) = ∑(w⊤ xi − yi )2
i=1

The gradient descent rule, with step-size/learning-rate s > 0 (we denoted this as α > 0 in our previous lectures),
updates w over time,
n n
∂ℓ ∂ℓ
= ∑ 2(w⊤ xi − yi ) xi = ∑ γi xi
i=1 
wt+1 ← wt − s( ) where:
∂w ∂w i=1
γi : function of xi,yi

We will now show that we can express w as a linear combination of all input vectors,
n
w = ∑ αi xi .
i=1

Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be
⎛0⎞
whatever we want. For convenience, let us pick w0 = ⎜
⎜⋮⎟
⎟ . For this initial choice of w0 , the linear combination
⎝0⎠
n
in w = ∑i=1 αi x i is trivially α1 = ⋯ = αn = 0. We now show that throughout the entire gradient descent
optimization such coefficients α1 , … , αn must always exist, as we can re-write the gradient updates entirely in
terms of updating the αi coefficients:
n n n n
w1 =w0 − s ∑ 2(w⊤
0 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
0 0 1 (with α1i = α0i − sγi0)
i=1 i=1 i=1 i=1
n n n n
w2 =w1 − s ∑ 2(w⊤
1 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
1 1 2 (with α2i = α1i xi − sγi1)
i=1 i=1 i=1 i=1
n n n n
w3 =w2 − s ∑ 2(w⊤
2 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
2 2 3
(with α3i = α2i − sγi2)
i=1 i=1 i=1 i=1
⋯ ⋯ ⋯
n n n n
wt =wt−1 − s ∑ 2(w⊤
t−1 x i − yi )x i = ∑ αi x i − s ∑ γi
t−1 t−1
xi = ∑ αti xi (with αti = αt−1
i − sγit−1 )
i=1 i=1 i=1 i=1

Formally, the argument is by induction. w is trivially a linear combination of our training vectors for w0 (base
case). If we apply the inductive hypothesis for wt it follows for wt+1 .

The update-rule for αti is thus

t−1
αti = αt−1
i − sγit−1 , and we have αti = −s ∑ γir .
r=0

In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We
just keep track of the n coefficients α1 , … , αn . Now that w can be written as a linear combination of the training
set, we can also express the inner-product of w with any input x i purely in terms of inner-products between
training inputs:
n
w⊤ xj = ∑ αi x⊤
i xj.
i=1

= ∑i=1 (w⊤ xi − yi )2 entirely in terms of


n
Consequently, we can also re-write the squared-loss from ℓ(w)
inner-product between training inputs:
2

∑ (∑ αj x⊤ − yi )
n n
ℓ(α) = j xi
i=1 j=1

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 2/5
6/12/24, 9:57 AM Lecture 13: Kernels
During test-time we also only need these coefficients to make a prediction on a test-input xt , and can write the
entire classifier in terms of inner-products between the test point and training points:
n
h(xt ) = w⊤ xt = ∑ αj x⊤
j xt .
j=1

Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the
squared-loss is inner-products between all pairs of data vectors.

Inner-Product Computation

⎛ 1 ⎞
⎜ ⎟
⎜ ⎟
x1
⎜ ⎟
⎜ ⎟
⎜ ⎟

⎜ ⎟
⎜ ⎟
⎜ ⎟
xd
Let's go back to the previous example, ϕ(x) = ⎜

x1 x2 ⎟.

⎜ ⎟
⎜ ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
xd−1 xd
⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
x1 x2 ⋯ xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:

d
ϕ(x)⊤ ϕ(z) = 1 ⋅ 1 + x1 z1 + x2 z2 + ⋯ + x1 x2 z1 z2 + ⋯ + x1 ⋯ xd z1 ⋯ zd = ∏(1 + xk zk ).
k=1

The sum of 2d terms becomes the product of d terms. We can compute the inner-product from the above formula
in time O(d) instead of O(2d ) ! We define the function

k(xi , xj) = ϕ(xi )⊤ ϕ(xj).



this is called the kernel function

With a finite training set of n samples, inner products are often pre-computed and stored in a Kernel Matrix:

Kij = ϕ(xi )⊤ ϕ(xj).


If we store the matrix K , we only need to do simple inner-product look-ups and low-dimensional computations
throughout the gradient descent algorithm. The final classifier becomes:
n
h(xt ) = ∑ αj k(xj, xt ).
j=1

During training in the new high dimensional space of ϕ(x) we want to compute γi through kernels, without ever
n
computing any ϕ(x i ) or even w. We previously established that w = ∑j=1 αj ϕ(x j) , and
γi = 2(w⊤ ϕ(xi ) − yi ) . It follows that γi = 2(∑j=1 αj Kij ) − yi ). The gradient update in iteration t + 1
n

becomes
n
αt+1
i ← αti − 2s(∑ αtj Kij ) − yi ).
j=1

As we have n such updates to do, the amount of work per gradient update in the transformed space is O(n 2 ) ---
far better than O(2d ) .

General Kernels
Below are some popular kernel functions:

Linear: K(x, z) = x⊤ z.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 3/5
6/12/24, 9:57 AM Lecture 13: Kernels
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix
if the dimensionality d of the data is high.)

Polynomial: K(x, z) = (1 + x⊤ z)d .


−∥x−z∥2

Radial Basis Function (RBF) (aka Gaussian Kernel): K(x, z) =e σ2 .

The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is
infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see
this paper).
−∥x−z∥

Exponential Kernel: K(x, z) =e 2σ 2

−|x−z|
Laplacian Kernel: K(x, z) =e σ

Sigmoid Kernel: K(x, z) = tanh(ax⊤ + c)

Kernel functions
Can any function K(⋅, ⋅) → R be used as a kernel?
No, the matrix K(x i , x j) has to correspond to real inner-products after some transformation x → ϕ(x) . This is
the case if and only if K is positive semi-definite.

Definition: A matrix A ∈ Rn×n is positive semi-definite iff ∀q ∈ Rn , q⊤ Aq ≥ 0 .

Remember Kij = ϕ(xi )⊤ ϕ(xj). So K = Φ⊤ Φ, where Φ = [ϕ(x1), … , ϕ(xn )]. It follows that K is p.s.d.,
because q Kq = (Φ⊤ q)2 ≥ 0. Inversely, if any matrix A is p.s.d., it can be decomposed as A = Φ⊤ Φ for

some realization of Φ.

You can even define kernels over sets, strings, graphs and molecules.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 4/5
6/12/24, 9:57 AM Lecture 13: Kernels

Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve.
RBF works well with the decision boundary in this case.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 5/5

You might also like