0% found this document useful (0 votes)

11 views5 pages

Lecture 13 - Kernels

Lecture 13 discusses the concept of kernels in machine learning, focusing on how to incorporate non-linearities into linear classifiers through feature transformations. It introduces the 'kernel trick,' which allows for learning in high-dimensional spaces without explicitly computing transformed vectors, thus simplifying computations. Various kernel functions are presented, including linear, polynomial, and radial basis function (RBF) kernels, emphasizing their importance in creating effective classifiers.

Uploaded by

Edmond Geraud Aguilar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views5 pages

Lecture 13 - Kernels

Uploaded by

Edmond Geraud Aguilar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

6/12/24, 9:57 AM Lecture 13: Kernels

Lecture 13: Kernels

previous back next

Machine Learning Lecture 21 "Model Selection / Kernels" -…

-…

Video II

Linear classifiers are great, but what if there exists no linear decision boundary? As it turns out, there is an elegant
way to incorporate non-linearities into most linear classifiers.

Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function (feature transformations) on the input feature
vectors. Formally, for a data vector x ∈ R , we apply the transformation x → ϕ(x) where ϕ(x) ∈ R . Usually
d D

D ≫ d because we add dimensions that capture non-linear interactions among the original features.
Advantage: It is simple, and your problem stays convex and well behaved. (i.e. you can still use your original
gradient descent code, just with the higher dimensional representation)

Disadvantage: ϕ(x) might be very high dimensional.

⎛ 1 ⎞
⎜ ⎟
⎜ ⎟
x1
⎜ ⎟
⎜ ⎟
⎜ ⎟
⋮
⎛ x1 ⎞ ⎜ ⎟
⎜ ⎟
⎜ 2⎟ ⎜ ⎟
xd
⎜ ⎟ ⎜ ⎟.
x
Consider the following example: x = ⎜ ⎟, and define ϕ(x) = ⎜
x1 x2
⎟
⎜ ⋮ ⎟ ⎜ ⎟
⎜ ⎟
⎝x ⎠ ⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
d
xd−1 xd
⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
x1 x2 ⋯ xd
Quiz: What is the dimensionality of ϕ(x)?

This new representation, ϕ(x), is very expressive and allows for complicated non-linear decision boundaries - but
the dimensionality is extremely high. This makes our algorithm unbearable (and quickly prohibitively) slow.

The Kernel Trick

Gradient Descent with Squared Loss

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 1/5
6/12/24, 9:57 AM Lecture 13: Kernels
The kernel trick is a way to get around this dilemma by learning a function in the much higher dimensional space,
without ever computing a single vector ϕ(x) or ever computing the full vector w. It is a little magical.

It is based on the following observation: If we use gradient descent with any one of our standard loss functions, the
gradient is a linear combination of the input samples. For example, let us take a look at the squared loss:
n
ℓ(w) = ∑(w⊤ xi − yi )2
i=1

The gradient descent rule, with step-size/learning-rate s > 0 (we denoted this as α > 0 in our previous lectures),
updates w over time,
n n
∂ℓ ∂ℓ
= ∑ 2(w⊤ xi − yi ) xi = ∑ γi xi
i=1 
wt+1 ← wt − s( ) where:
∂w ∂w i=1
γi : function of xi,yi

We will now show that we can express w as a linear combination of all input vectors,
n
w = ∑ αi xi .
i=1

Since the loss is convex, the final solution is independent of the initialization, and we can initialize w0 to be
⎛0⎞
whatever we want. For convenience, let us pick w0 = ⎜
⎜⋮⎟
⎟ . For this initial choice of w0 , the linear combination
⎝0⎠
n
in w = ∑i=1 αi x i is trivially α1 = ⋯ = αn = 0. We now show that throughout the entire gradient descent
optimization such coefficients α1 , … , αn must always exist, as we can re-write the gradient updates entirely in
terms of updating the αi coefficients:
n n n n
w1 =w0 − s ∑ 2(w⊤
0 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
0 0 1 (with α1i = α0i − sγi0)
i=1 i=1 i=1 i=1
n n n n
w2 =w1 − s ∑ 2(w⊤
1 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
1 1 2 (with α2i = α1i xi − sγi1)
i=1 i=1 i=1 i=1
n n n n
w3 =w2 − s ∑ 2(w⊤
2 x i − yi )x i = ∑ αi x i − s ∑ γi x i = ∑ αi x i
2 2 3
(with α3i = α2i − sγi2)
i=1 i=1 i=1 i=1
⋯ ⋯ ⋯
n n n n
wt =wt−1 − s ∑ 2(w⊤
t−1 x i − yi )x i = ∑ αi x i − s ∑ γi
t−1 t−1
xi = ∑ αti xi (with αti = αt−1
i − sγit−1 )
i=1 i=1 i=1 i=1

Formally, the argument is by induction. w is trivially a linear combination of our training vectors for w0 (base
case). If we apply the inductive hypothesis for wt it follows for wt+1 .

The update-rule for αti is thus

t−1
αti = αt−1
i − sγit−1 , and we have αti = −s ∑ γir .
r=0

In other words, we can perform the entire gradient descent update rule without ever expressing w explicitly. We
just keep track of the n coefficients α1 , … , αn . Now that w can be written as a linear combination of the training
set, we can also express the inner-product of w with any input x i purely in terms of inner-products between
training inputs:
n
w⊤ xj = ∑ αi x⊤
i xj.
i=1

= ∑i=1 (w⊤ xi − yi )2 entirely in terms of

n
Consequently, we can also re-write the squared-loss from ℓ(w)
inner-product between training inputs:
2

∑ (∑ αj x⊤ − yi )
n n
ℓ(α) = j xi
i=1 j=1

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 2/5
6/12/24, 9:57 AM Lecture 13: Kernels
During test-time we also only need these coefficients to make a prediction on a test-input xt , and can write the
entire classifier in terms of inner-products between the test point and training points:
n
h(xt ) = w⊤ xt = ∑ αj x⊤
j xt .
j=1

Do you notice a theme? The only information we ever need in order to learn a hyper-plane classifier with the
squared-loss is inner-products between all pairs of data vectors.

Inner-Product Computation

⎛ 1 ⎞
⎜ ⎟
⎜ ⎟
x1
⎜ ⎟
⎜ ⎟
⎜ ⎟
⋮
⎜ ⎟
⎜ ⎟
⎜ ⎟
xd
Let's go back to the previous example, ϕ(x) = ⎜
⎜
x1 x2 ⎟.
⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
xd−1 xd
⎜ ⎟
⎜ ⋮ ⎟
⎝ ⎠
x1 x2 ⋯ xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:

d
ϕ(x)⊤ ϕ(z) = 1 ⋅ 1 + x1 z1 + x2 z2 + ⋯ + x1 x2 z1 z2 + ⋯ + x1 ⋯ xd z1 ⋯ zd = ∏(1 + xk zk ).
k=1

The sum of 2d terms becomes the product of d terms. We can compute the inner-product from the above formula
in time O(d) instead of O(2d ) ! We define the function

k(xi , xj) = ϕ(xi )⊤ ϕ(xj).


this is called the kernel function

With a finite training set of n samples, inner products are often pre-computed and stored in a Kernel Matrix:

Kij = ϕ(xi )⊤ ϕ(xj).

If we store the matrix K , we only need to do simple inner-product look-ups and low-dimensional computations
throughout the gradient descent algorithm. The final classifier becomes:
n
h(xt ) = ∑ αj k(xj, xt ).
j=1

During training in the new high dimensional space of ϕ(x) we want to compute γi through kernels, without ever
n
computing any ϕ(x i ) or even w. We previously established that w = ∑j=1 αj ϕ(x j) , and
γi = 2(w⊤ ϕ(xi ) − yi ) . It follows that γi = 2(∑j=1 αj Kij ) − yi ). The gradient update in iteration t + 1
n

becomes
n
αt+1
i ← αti − 2s(∑ αtj Kij ) − yi ).
j=1

As we have n such updates to do, the amount of work per gradient update in the transformed space is O(n 2 ) ---
far better than O(2d ) .

General Kernels
Below are some popular kernel functions:

Linear: K(x, z) = x⊤ z.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 3/5
6/12/24, 9:57 AM Lecture 13: Kernels
(The linear kernel is equivalent to just using a good old linear classifier - but it can be faster to use a kernel matrix
if the dimensionality d of the data is high.)

Polynomial: K(x, z) = (1 + x⊤ z)d .

−∥x−z∥2

Radial Basis Function (RBF) (aka Gaussian Kernel): K(x, z) =e σ2 .

The RBF kernel is the most popular Kernel! It is a Universal approximator!! Its corresponding feature vector is
infinite dimensional and cannot be computed. However, very effective low dimensional approximations exist (see
this paper).
−∥x−z∥

Exponential Kernel: K(x, z) =e 2σ 2

−|x−z|
Laplacian Kernel: K(x, z) =e σ

Sigmoid Kernel: K(x, z) = tanh(ax⊤ + c)

Kernel functions
Can any function K(⋅, ⋅) → R be used as a kernel?
No, the matrix K(x i , x j) has to correspond to real inner-products after some transformation x → ϕ(x) . This is
the case if and only if K is positive semi-definite.

Definition: A matrix A ∈ Rn×n is positive semi-definite iff ∀q ∈ Rn , q⊤ Aq ≥ 0 .

Remember Kij = ϕ(xi )⊤ ϕ(xj). So K = Φ⊤ Φ, where Φ = [ϕ(x1), … , ϕ(xn )]. It follows that K is p.s.d.,
because q Kq = (Φ⊤ q)2 ≥ 0. Inversely, if any matrix A is p.s.d., it can be decomposed as A = Φ⊤ Φ for
⊤

some realization of Φ.

You can even define kernels over sets, strings, graphs and molecules.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 4/5
6/12/24, 9:57 AM Lecture 13: Kernels

Figure 1: The demo shows how kernel function solves the problem linear classifiers can not solve.
RBF works well with the decision boundary in this case.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote13.html 5/5

6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
SVM 4
No ratings yet
SVM 4
8 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Vahid
No ratings yet
Vahid
18 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Paul Gader Classification 3 Scikit NEONDIJune 2017
No ratings yet
Paul Gader Classification 3 Scikit NEONDIJune 2017
31 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Ds 11
No ratings yet
Ds 11
21 pages
Montanari
No ratings yet
Montanari
10 pages
Neural Networks for Engineers
No ratings yet
Neural Networks for Engineers
44 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Kernel Models for Data Scientists
No ratings yet
Kernel Models for Data Scientists
56 pages
Note KT 1
No ratings yet
Note KT 1
5 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Kernel Trick
No ratings yet
Kernel Trick
40 pages
Lecture-Notes Kernal Methods
No ratings yet
Lecture-Notes Kernal Methods
12 pages
2000 Conf
No ratings yet
2000 Conf
22 pages
Classification
No ratings yet
Classification
47 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Kernels, Beginning of Neural Networks: 3.1 Digging Deeper Into The Perceptron
No ratings yet
Kernels, Beginning of Neural Networks: 3.1 Digging Deeper Into The Perceptron
11 pages
Hda RMML
No ratings yet
Hda RMML
131 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
Advanced Machine Learning Concepts
No ratings yet
Advanced Machine Learning Concepts
38 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
40 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
Kernel Methods for Statisticians
No ratings yet
Kernel Methods for Statisticians
53 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
NN Theory
No ratings yet
NN Theory
138 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Lect 1
No ratings yet
Lect 1
24 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
Unit 1,2,3
No ratings yet
Unit 1,2,3
17 pages
KernelTrick PDF
No ratings yet
KernelTrick PDF
4 pages
SVM Basics for Computer Science Students
No ratings yet
SVM Basics for Computer Science Students
36 pages
Machine Learning: Kernel Methods
No ratings yet
Machine Learning: Kernel Methods
6 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
SVM Kernels: Non-Linear Learning
No ratings yet
SVM Kernels: Non-Linear Learning
15 pages
Journal 2
No ratings yet
Journal 2
11 pages
Machine Learning Syllabus
No ratings yet
Machine Learning Syllabus
73 pages
CS229 Midterm Solutions 2010
No ratings yet
CS229 Midterm Solutions 2010
8 pages
EEG Classification of Covert Speech Using
No ratings yet
EEG Classification of Covert Speech Using
9 pages
Feature Extraction and Classification Techniques in O.C.R. Systems For Handwritten Gurmukhi Script - A Survey
No ratings yet
Feature Extraction and Classification Techniques in O.C.R. Systems For Handwritten Gurmukhi Script - A Survey
4 pages
A Novel Approach To Predicting Youngs Mo PDF
No ratings yet
A Novel Approach To Predicting Youngs Mo PDF
11 pages
Approximate Inference Turns Deep Networks Into Gaussian Processes
No ratings yet
Approximate Inference Turns Deep Networks Into Gaussian Processes
18 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
Daxaz Semester1
No ratings yet
Daxaz Semester1
11 pages
A Moisture Content Prediction Model For Deep Bed Peanut
No ratings yet
A Moisture Content Prediction Model For Deep Bed Peanut
10 pages
Sindh University
No ratings yet
Sindh University
10 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools to Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219 - Read the ebook online or download it for the best experience
100% (9)
Hands On Machine Learning With Scikit Learn and TensorFlow Techniques and Tools to Build Learning Machines 1st Edition by AurÃ©lien GÃ©ron 9352135210 9789352135219 - Read the ebook online or download it for the best experience
85 pages
Emotion Recognition From Physiological Signals: Review
No ratings yet
Emotion Recognition From Physiological Signals: Review
8 pages
Course - Machine Learning A-Z - AI, Python & R + ChatGPT Prize (2025) - Udemy Business
No ratings yet
Course - Machine Learning A-Z - AI, Python & R + ChatGPT Prize (2025) - Udemy Business
18 pages
27 SVM Interview Questions (ANSWERED) To Master Before ML & Data Science Interview - MLStack - Cafe
No ratings yet
27 SVM Interview Questions (ANSWERED) To Master Before ML & Data Science Interview - MLStack - Cafe
25 pages
Lecture 6 Classification SVM
No ratings yet
Lecture 6 Classification SVM
44 pages
2017 Chicken Meat Freshness Identification Using The Histogram Color Feature
No ratings yet
2017 Chicken Meat Freshness Identification Using The Histogram Color Feature
5 pages
2021 10 11 - Intro ML - Inserm
No ratings yet
2021 10 11 - Intro ML - Inserm
41 pages
Lecture Notes - SVM
No ratings yet
Lecture Notes - SVM
13 pages
MLT Notes
No ratings yet
MLT Notes
17 pages
Ieee A
No ratings yet
Ieee A
9 pages
Intro to Supervised Learning
No ratings yet
Intro to Supervised Learning
55 pages
Contoh Jurnal
No ratings yet
Contoh Jurnal
6 pages
Who Supported Obama in 2012? Ecological Inference Through Distribution Regression
No ratings yet
Who Supported Obama in 2012? Ecological Inference Through Distribution Regression
10 pages
Data Analytics (Da) by I Tech World
No ratings yet
Data Analytics (Da) by I Tech World
65 pages
Final: CS 189 Spring 2016 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2016 Introduction To Machine Learning
12 pages
R Packages For Machine Learning
No ratings yet
R Packages For Machine Learning
3 pages
Pattern Recognitionn Makaut Question Solution
No ratings yet
Pattern Recognitionn Makaut Question Solution
9 pages
Research Paper On Quantum Computing by Adarsh Pratap Singh
No ratings yet
Research Paper On Quantum Computing by Adarsh Pratap Singh
21 pages
A Fuzzy Based Soft Computing Technique To Predict The Movement of The Price of A Stock
No ratings yet
A Fuzzy Based Soft Computing Technique To Predict The Movement of The Price of A Stock
6 pages

Lecture 13 - Kernels

Uploaded by

Lecture 13 - Kernels

Uploaded by

6/12/24, 9:57 AM Lecture 13: Kernels

Lecture 13: Kernels

Machine Learning Lecture 21 "Model Selection / Kernels" -…

Handcrafted Feature Expansion

Disadvantage: ϕ(x) might be very high dimensional.

The Kernel Trick

Gradient Descent with Squared Loss

The update-rule for αti is thus

= ∑i=1 (w⊤ xi − yi )2 entirely in terms of

k(xi , xj) = ϕ(xi )⊤ ϕ(xj).

Kij = ϕ(xi )⊤ ϕ(xj).

Polynomial: K(x, z) = (1 + x⊤ z)d .

Radial Basis Function (RBF) (aka Gaussian Kernel): K(x, z) =e σ2 .

Exponential Kernel: K(x, z) =e 2σ 2

Sigmoid Kernel: K(x, z) = tanh(ax⊤ + c)

Definition: A matrix A ∈ Rn×n is positive semi-definite iff ∀q ∈ Rn , q⊤ Aq ≥ 0 .

You might also like