0% found this document useful (0 votes)

17 views12 pages

Lecture-Notes Kernal Methods

Uploaded by

Ekambaram Thirupalli T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views12 pages

Lecture-Notes Kernal Methods

Uploaded by

Ekambaram Thirupalli T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Chapter 5

Kernel methods

5.1 Feature maps

Recall that in our discussion about linear regression, we considered the prob-
lem of predicting the price of a house (denoted by y) from the living area of
the house (denoted by x), and we t a linear function of x to the training
data. What if the price y can be more accurately represented as a non-linear
function of x? In this case, we need a more expressive family of models than
linear models.
We start by considering tting cubic functions y = 3 x3 + 2 x2 + 1 x + 0 .
It turns out that we can view the cubic function as a linear function over
the a dierent set of feature variables (dened below). Concretely, let the
function φ : R → R4 be dened as

 
1
 x 
φ(x) =  
 x2   R 
4
(5.1)
x3

Let   R4 be the vector containing 0 , 1 , 2 , 3 as entries. Then we can

rewrite the cubic function in x as:

3 x3 + 2 x2 + 1 x + 0 = T φ(x)

Thus, a cubic function of the variable x can be viewed as a linear function

over the variables φ(x). To distinguish between these two sets of variables,
in the context of kernel methods, we will call the original input value the
input attributes of a problem (in this case, x, the living area). When the

48
49

original input is mapped to some new set of quantities φ(x), we will call those
new quantities the features variables. (Unfortunately, dierent authors use
dierent terms to describe these two things in dierent contexts.) We will
call φ a feature map, which maps the attributes to the features.

5.2 LMS (least mean squares) with features

We will derive the gradient descent algorithm for tting the model T φ(x).
First recall that for ordinary least square problem where we were to t T x,
the batch gradient descent update is (see the rst lecture note for its deriva-
tion):
n
∑  
 :=  +  y (i) − hθ (x(i) ) x(i)
i=1
∑n
 
:=  +  y (i) − T x(i) x(i)  (5.2)
i=1

Let φ : Rd → Rp be a feature map that maps attribute x (in Rd ) to the

features φ(x) in Rp . (In the motivating example in the previous subsection,
we have d = 1 and p = 4.) Now our goal is to t the function T φ(x), with
 being a vector in Rp instead of Rd . We can replace all the occurrences of
x(i) in the algorithm above by φ(x(i) ) to obtain the new update:

n
∑  
 :=  +  y (i) − T φ(x(i) ) φ(x(i) ) (5.3)
i=1

Similarly, the corresponding stochastic gradient descent update rule is

 
 :=  +  y (i) − T φ(x(i) ) φ(x(i) ) (5.4)

5.3 LMS with the kernel trick

The gradient descent update, or stochastic gradient update above becomes
computationally expensive when the features φ(x) is high-dimensional. For
example, consider the direct extension of the feature map in equation (5.1)
to high-dimensional input x: suppose x  Rd , and let φ(x) be the vector that
50

contains all the monomials of x with degree ≤ 3

 
1
 x1 
 
 x2 
 
 .. 
 . 
 2 
 x1 
 
 x1 x2 
 
 x1 x3 
φ(x) =   (5.5)
 .. 
 . 
 
 x2 x1 
 . 
 .. 
 
 x3 
 1 
 x2 x 
 1 2 
..
.

The dimension of the features φ(x) is on the order of d3 .1 This is a pro-

hibitively long vector for computational purpose — when d = 1000, each
update requires at least computing and storing a 10003 = 109 dimensional
vector, which is 106 times slower than the update rule for for ordinary least
squares updates (5.2).
It may appear at rst that such d3 runtime per update and memory usage
are inevitable, because the vector  itself is of dimension p ≈ d3 , and we may
need to update every entry of  and store it. However, we will introduce the
kernel trick with which we will not need to store  explicitly, and the runtime
can be signicantly improved.
For simplicity, we assume the initialize the value  = 0, and we focus
on the iterative update (5.3). The main observation is that at any time, 
can be represented as a linear combination of the vectors φ(x(1) ),    , φ(x(n) ).
Indeed,
n we can show this inductively as follows. At initialization,  = 0 =
(i)
i=1 0 · φ(x ). Assume at some point,  can be represented as

n
∑
= i φ(x(i) ) (5.6)
i=1

1
Here, for simplicity, we include all the monomials with repetitions (so that, e.g., x1 x2 x3
and x2 x3 x1 both appear in φ(x)). Therefore, there are totally 1 + d + d2 + d3 entries in
φ(x).
51

for some 1 ,    , n  R. Then we claim that in the next round,  is still a

linear combination of φ(x(1) ),    , φ(x(n) ) because
n
∑  
 :=  +  y (i) − T φ(x(i) ) φ(x(i) )
i=1
n
∑ n
∑
(i)
 
= i φ(x ) +  y (i) − T φ(x(i) ) φ(x(i) )
i=1 i=1
n
∑  
= (i +  y (i) − T φ(x(i) ) ) φ(x(i) ) (5.7)
i=1
  
new βi

You may realize that our general strategy is to implicitly represent the p-
dimensional vector  by a set of coecients 1 ,    , n . Towards doing this,
we derive the update rule of the coecients 1 ,    , n . Using the equation
above, we see that the new i depends on the old one via
 
i := i +  y (i) − T φ(x(i) ) (5.8)

Herewe still have the old  on the RHS of the equation. Replacing  by
 = nj=1 j φ(x(j) ) gives
( n
)
∑
(j) T
∀i  1,    , n, i := i +  y (i) − j φ(x ) φ(x(i) )
j=1

T
We often rewrite φ(x(j) ) φ(x(i) ) as 〈φ(x(j) ), φ(x(i) )〉 to emphasize that it’s the
inner product of the two feature vectors. Viewing i ’s as the new representa-
tion of , we have successfully translated the batch gradient descent algorithm
into an algorithm that updates the value of  iteratively. It may appear that
at every iteration, we still need to compute the values of 〈φ(x(j) ), φ(x(i) )〉 for
all pairs of i, j, each of which may take roughly O(p) operation. However,
two important properties come to rescue:

1. We can pre-compute the pairwise inner products 〈φ(x(j) ), φ(x(i) )〉 for all
pairs of i, j before the loop starts.

2. For the feature map φ dened in (5.5) (or many other interesting fea-
ture maps), computing 〈φ(x(j) ), φ(x(i) )〉 can be ecient and does not
52

necessarily require computing φ(x(i) ) explicitly. This is because:

d
∑ ∑ ∑
〈φ(x), φ(z)〉 = 1 + xi zi + xi xj zi zj + xi xj xk zi zj zk
i=1 i,j1,,d i,j,k1,,d
d
( d
)2 ( d
)3
∑ ∑ ∑
=1+ xi zi + xi zi + xi zi
i=1 i=1 i=1
= 1 + 〈x, z〉 + 〈x, z〉 + 〈x, z〉3
2
(5.9)

Therefore, to compute 〈φ(x), φ(z)〉, we can rst compute 〈x, z〉 with

O(d) time and then take another constant number of operations to com-
pute 1 + 〈x, z〉 + 〈x, z〉2 + 〈x, z〉3 .

As you will see, the inner products between the features 〈φ(x), φ(z)〉 are
essential here. We dene the Kernel corresponding to the feature map φ as
a function that maps X × X → R satisfying: 2

K(x, z) , 〈φ(x), φ(z)〉 (5.10)

To wrap up the discussion, we write the down the nal algorithm as

follows:

1. Compute all the values K(x(i) , x(j) ) , 〈φ(x(i) ), φ(x(j) )〉 using equa-
tion (5.9) for all i, j  1,    , n. Set  := 0.

2. Loop:
( n
)
∑
(i) (i) (j)
∀i  1,    , n, i := i +  y − j K(x , x ) (5.11)
j=1

Or in vector notation, letting K be the n × n matrix with Kij =

K(x , x(j) ), we have
(i)

 :=  + (~y − K)

With the algorithm above, we can update the representation  of the

vector  eciently with O(n) time per update. Finally, we need to show that
2
Recall that X is the space of the input x. In our running example, X = Rd
53

the knowledge of the representation  suces to compute the prediction

T φ(x). Indeed, we have
n
∑ n
∑
T
T φ(x) = i φ(x(i) ) φ(x) = i K(x(i) , x) (5.12)
i=1 i=1

You may realize that fundamentally all we need to know about the feature
map φ(·) is encapsulated in the corresponding kernel function K(·, ·). We
will expand on this in the next section.

5.4 Properties of kernels

In the last subsection, we started with an explicitly dened feature map φ,
which induces the kernel function K(x, z) , 〈φ(x), φ(z)〉. Then we saw that
the kernel function is so intrinsic so that as long as the kernel function is
dened, the whole training algorithm can be written entirely in the language
of the kernel without referring to the feature map φ, so can the prediction of
a test example x (equation (5.12).)
Therefore, it would be tempted to dene other kernel function K(·, ·) and
run the algorithm (5.11). Note that the algorithm (5.11) does not need to
explicitly access the feature map φ, and therefore we only need to ensure the
existence of the feature map φ, but do not necessarily need to be able to
explicitly write φ down.
What kinds of functions K(·, ·) can correspond to some feature map φ? In
other words, can we tell if there is some feature mapping φ so that K(x, z) =
φ(x)T φ(z) for all x, z?
If we can answer this question by giving a precise characterization of valid
kernel functions, then we can completely change the interface of selecting
feature maps φ to the interface of selecting kernel function K. Concretely,
we can pick a function K, verify that it satises the characterization (so
that there exists a feature map φ that K corresponds to), and then we can
run update rule (5.11). The benet here is that we don’t have to be able
to compute φ or write it down analytically, and we only need to know its
existence. We will answer this question at the end of this subsection after
we go through several concrete examples of kernels.
Suppose x, z  Rd , and let’s rst consider the function K(·, ·) dened as:

K(x, z) = (xT z)2 

We can also write this as

( d
)( d
)
∑ ∑
K(x, z) = xi zi xj zj
i=1 j=1
d ∑
∑ d
= xi xj zi zj
i=1 j=1
d
∑
= (xi xj )(zi zj )
i,j=1

Thus, we see that K(x, z) = 〈φ(x), φ(z)〉 is the kernel function that corre-
sponds to the the feature mapping φ given (shown here for the case of d = 3)
by  
x1 x1
 x1 x2 
 
 x1 x3 
 
 x2 x1 
 
φ(x) =  
 x2 x2  
 x2 x3 
 
 x3 x1 
 
 x3 x2 
x3 x3
Revisiting the computational eciency perspective of kernel, note that whereas
calculating the high-dimensional φ(x) requires O(d2 ) time, nding K(x, z)
takes only O(d) time—linear in the dimension of the input attributes.
For another related example, also consider K(·, ·) dened by

K(x, z) = (xT z + c)2

∑d d
∑ √ √
= (xi xj )(zi zj ) + ( 2cxi )( 2czi ) + c2 
i,j=1 i=1

(Check this yourself.) This function K is a kernel function that corresponds

to the feature mapping (again shown for d = 3)

 
x1 x1
 x1 x2 
 
 x1 x3 
 
 x2 x1 
 
 x2 x2 
 
 x2 x3 
 
φ(x) =  
 x3 x1  ,
 x3 x2 
 
 x3 x3 
 √ 
 2cx1 
 √ 
 2cx2 
 √ 
 2cx3 
c

and the parameter c controls the relative weighting between the xi (rst
order) and the xi xj (second order) terms.
T k
More broadly, the
d+k  kernel K(x, z) = (x z + c) corresponds to a feature
mapping to an k feature space, corresponding of all monomials of the
form xi1 xi2    xik that are up to order k. However, despite working in this
O(dk )-dimensional space, computing K(x, z) still takes only O(d) time, and
hence we never need to explicitly represent feature vectors in this very high
dimensional feature space.

Kernels as similarity metrics. Now, let’s talk about a slightly dierent

view of kernels. Intuitively, (and there are things wrong with this intuition,
but nevermind), if φ(x) and φ(z) are close together, then we might expect
K(x, z) = φ(x)T φ(z) to be large. Conversely, if φ(x) and φ(z) are far apart—
say nearly orthogonal to each other—then K(x, z) = φ(x)T φ(z) will be small.
So, we can think of K(x, z) as some measurement of how similar are φ(x)
and φ(z), or of how similar are x and z.
Given this intuition, suppose that for some learning problem that you’re
working on, you’ve come up with some function K(x, z) that you think might
be a reasonable measure of how similar x and z are. For instance, perhaps
you chose  
x − z2
K(x, z) = exp − 
2σ 2
This is a reasonable measure of x and z’s similarity, and is close to 1 when
x and z are close, and near 0 when x and z are far apart. Does there exist
56

a feature map φ such that the kernel K dened above satises K(x, z) =
φ(x)T φ(z)? In this particular example, the answer is yes. This kernel is called
the Gaussian kernel, and corresponds to an innite dimensional feature
mapping φ. We will give a precise characterization about what properties
a function K needs to satisfy so that it can be a valid kernel function that
corresponds to some feature map φ.

Necessary conditions for valid kernels. Suppose for now that K is

indeed a valid kernel corresponding to some feature mapping φ, and we will
rst see what properties it satises. Now, consider some nite set of n points
(not necessarily the training set) x(1) ,    , x(n) , and let a square, n-by-n
matrix K be dened so that its (i, j)-entry is given by Kij = K(x(i) , x(j) ).
This matrix is called the kernel matrix. Note that we’ve overloaded the
notation and used K to denote both the kernel function K(x, z) and the
kernel matrix K, due to their obvious close relationship.
Now, if K is a valid kernel, then Kij = K(x(i) , x(j) ) = φ(x(i) )T φ(x(j) ) =
φ(x(j) )T φ(x(i) ) = K(x(j) , x(i) ) = Kji , and hence K must be symmetric. More-
over, letting φk (x) denote the k-th coordinate of the vector φ(x), we nd that
for any vector z, we have
∑∑
z T Kz = zi Kij zj
i j
∑∑
= zi φ(x(i) )T φ(x(j) )zj
i j
∑∑ ∑
= zi φk (x(i) )φk (x(j) )zj
i j k
∑∑∑
= zi φk (x(i) )φk (x(j) )zj
k i j
( )2
∑ ∑
= zi φk (x(i) )
k i
≥ 0
  2
The second-to-last step uses the fact that i,j ai aj = ( i ai ) for ai =
zi φk (x(i) ). Since z was arbitrary, this shows that K is positive semi-denite
(K ≥ 0).
Hence, we’ve shown that if K is a valid kernel (i.e., if it corresponds to
some feature mapping φ), then the corresponding kernel matrix K  Rn×n
is symmetric positive semidenite.
57

Sucient conditions for valid kernels. More generally, the condition

above turns out to be not only a necessary, but also a sucient, condition
for K to be a valid kernel (also called a Mercer kernel). The following result
is due to Mercer.3

Theorem (Mercer). Let K : Rd × Rd 7→ R be given. Then for K

to be a valid (Mercer) kernel, it is necessary and sucient that for any
x(1) ,    , x(n) , (n < ∞), the corresponding kernel matrix is symmetric pos-
itive semi-denite.

Given a function K, apart from trying to nd a feature mapping φ that

corresponds to it, this theorem therefore gives another way of testing if it is
a valid kernel. You’ll also have a chance to play with these ideas more in
problem set 2.
In class, we also briey talked about a couple of other examples of ker-
nels. For instance, consider the digit recognition problem, in which given
an image (16x16 pixels) of a handwritten digit (0-9), we have to gure out
which digit it was. Using either a simple polynomial kernel K(x, z) = (xT z)k
or the Gaussian kernel, SVMs were able to obtain extremely good perfor-
mance on this problem. This was particularly surprising since the input
attributes x were just 256-dimensional vectors of the image pixel intensity
values, and the system had no prior knowledge about vision, or even about
which pixels are adjacent to which other ones. Another example that we
briey talked about in lecture was that if the objects x that we are trying
to classify are strings (say, x is a list of amino acids, which strung together
form a protein), then it seems hard to construct a reasonable, small set of
features for most learning algorithms, especially if dierent strings have dif-
ferent lengths. However, consider letting φ(x) be a feature vector that counts
the number of occurrences of each length-k substring in x. If we’re consid-
ering strings of English letters, then there are 26k such strings. Hence, φ(x)
is a 26k dimensional vector; even for moderate values of k, this is probably
too big for us to eciently work with. (e.g., 264 ≈ 460000.) However, using
(dynamic programming-ish) string matching algorithms, it is possible to ef-
ciently compute K(x, z) = φ(x)T φ(z), so that we can now implicitly work
in this 26k -dimensional feature space, but without ever explicitly computing
feature vectors in this space.
3
Many texts present Mercer’s theorem in a slightly more complicated form involving
L2 functions, but when the input attributes take values in Rd , the version given here is
equivalent.
58

Application of kernel methods: We’ve seen the application of kernels

to linear regression. In the next part, we will introduce the support vector
machines to which kernels can be directly applied. dwell too much longer on
it here. In fact, the idea of kernels has signicantly broader applicability than
linear regression and SVMs. Specically, if you have any learning algorithm
that you can write in terms of only inner products 〈x, z〉 between input
attribute vectors, then by replacing this with K(x, z) where K is a kernel,
you can magically allow your algorithm to work eciently in the high
dimensional feature space corresponding to K. For instance, this kernel trick
can be applied with the perceptron to derive a kernel perceptron algorithm.
Many of the algorithms that we’ll see later in this class will also be amenable
to this method, which has come to be known as the kernel trick.
Chapter 6

Support vector machines

This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
o-the-shelf supervised learning algorithms. To tell the SVM story, we’ll
need to rst talk about margins and the idea of separating data with a large
gap. Next, we’ll talk about the optimal margin classier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs eciently in very high dimensional (such as innite-
dimensional) feature spaces, and nally, we’ll close o the story with the
SMO algorithm, which gives an ecient implementation of SVMs.

6.1 Margins: intuition

We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the condence of our predic-
tions; these ideas will be made formal in Section 6.3.
Consider logistic regression, where the probability p(y = 1x; ) is mod-
eled by hθ (x) = g( T x). We then predict 1 on an input x if and only if
hθ (x) ≥ 05, or equivalently, if and only if T x ≥ 0. Consider a positive
training example (y = 1). The larger T x is, the larger also is hθ (x) = p(y =
1x; ), and thus also the higher our degree of condence that the label is 1.
Thus, informally we can think of our prediction as being very condent that
y = 1 if T x  0. Similarly, we think of logistic regression as condently
predicting y = 0, if T x  0. Given a training set, again informally it seems
that we’d have found a good t to the training data if we can nd  so that
T x(i)  0 whenever y (i) = 1, and T x(i)  0 whenever y (i) = 0, since this
would reect a very condent (and correct) set of classications for all the

Rise and Shine Starter AmE SB 9781292421025 PDF
No ratings yet
Rise and Shine Starter AmE SB 9781292421025 PDF
10 pages
Classroom Objects Worksheet
100% (2)
Classroom Objects Worksheet
4 pages
Lecture23.dmunoz
No ratings yet
Lecture23.dmunoz
3 pages
Digital-Signal-Processing Lecture 1
No ratings yet
Digital-Signal-Processing Lecture 1
11 pages
271 - AI Lect Notes
No ratings yet
271 - AI Lect Notes
13 pages
Vahid
No ratings yet
Vahid
18 pages
Mean Value Inequality
No ratings yet
Mean Value Inequality
12 pages
List of Important Papers
No ratings yet
List of Important Papers
6 pages
Digital Processing Tutorial
No ratings yet
Digital Processing Tutorial
14 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Note KT 1
No ratings yet
Note KT 1
5 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
Schedule
No ratings yet
Schedule
25 pages
Assignment #1 Professor Solutions
No ratings yet
Assignment #1 Professor Solutions
6 pages
Kernels, Beginning of Neural Networks: 3.1 Digging Deeper Into The Perceptron
No ratings yet
Kernels, Beginning of Neural Networks: 3.1 Digging Deeper Into The Perceptron
11 pages
Module 5
No ratings yet
Module 5
53 pages
Simolazione Seconda Traccia Inglese 2023 Extra
No ratings yet
Simolazione Seconda Traccia Inglese 2023 Extra
5 pages
Kernel Trick
No ratings yet
Kernel Trick
40 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Kernel Methods
No ratings yet
Kernel Methods
32 pages
NIPS 1999 Support Vector Method For Novelty Detection Paper
No ratings yet
NIPS 1999 Support Vector Method For Novelty Detection Paper
7 pages
Section 8 ISO 19650 3 Infographic - 280721@3xPDF
No ratings yet
Section 8 ISO 19650 3 Infographic - 280721@3xPDF
1 page
Undertaking Format - CA
No ratings yet
Undertaking Format - CA
1 page
Lect 7
No ratings yet
Lect 7
35 pages
SVM 4
No ratings yet
SVM 4
8 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
ML Imppp
No ratings yet
ML Imppp
12 pages
Lecture 14: Kernels - Applied ML
No ratings yet
Lecture 14: Kernels - Applied ML
14 pages
Kernel Methods
No ratings yet
Kernel Methods
19 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
Fin Irjmets1711372102
No ratings yet
Fin Irjmets1711372102
3 pages
1.1. Cách Viết Câu Supporting Sentences
No ratings yet
1.1. Cách Viết Câu Supporting Sentences
4 pages
Radiod Master
0% (1)
Radiod Master
149 pages
Adijfpqo
No ratings yet
Adijfpqo
8 pages
Model: BERT + DNN Discussion: Anushya Subbiah Divya Sudhakar Kenny Hsu
No ratings yet
Model: BERT + DNN Discussion: Anushya Subbiah Divya Sudhakar Kenny Hsu
1 page
Assessing Children's Pain: R-Flacc Pain Rating Scale For Children With Developmental Disability
0% (1)
Assessing Children's Pain: R-Flacc Pain Rating Scale For Children With Developmental Disability
1 page
Lecture 8 - Kernels
No ratings yet
Lecture 8 - Kernels
32 pages
SLAC-Proposal-May 19, 2023
No ratings yet
SLAC-Proposal-May 19, 2023
16 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Poly Kernel
No ratings yet
Poly Kernel
6 pages
Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning
No ratings yet
Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning
36 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
A Quantitative Evaluation of Shame Resilience Theory
No ratings yet
A Quantitative Evaluation of Shame Resilience Theory
2 pages
Kernel Models for Data Scientists
No ratings yet
Kernel Models for Data Scientists
56 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
Lect 3
No ratings yet
Lect 3
14 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Marine Transportation Thesis Topics
100% (3)
Marine Transportation Thesis Topics
7 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Zacks CPRT Copart, Inc (CPRT) Zacks Company Report
No ratings yet
Zacks CPRT Copart, Inc (CPRT) Zacks Company Report
9 pages
Tle 9
No ratings yet
Tle 9
31 pages
CLAT 2023 UG Provisional 2nd List
No ratings yet
CLAT 2023 UG Provisional 2nd List
6 pages
JP Morgan CPRT Copart, Inc. Update Model To Reflect Modestly Softer FY
No ratings yet
JP Morgan CPRT Copart, Inc. Update Model To Reflect Modestly Softer FY
13 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Copart's Strong F2Q23 Results
No ratings yet
Copart's Strong F2Q23 Results
14 pages
Cultural and Social Studies Modules
No ratings yet
Cultural and Social Studies Modules
16 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
ĐỀ CƯƠNG ÔN TẬP GIỮA KÌ 2-ANH 9
No ratings yet
ĐỀ CƯƠNG ÔN TẬP GIỮA KÌ 2-ANH 9
6 pages
LBO Model - Correction v2
No ratings yet
LBO Model - Correction v2
8 pages
Machine Learning: Kernel Methods
No ratings yet
Machine Learning: Kernel Methods
6 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
LBO Model - Cas Elèves v2
No ratings yet
LBO Model - Cas Elèves v2
12 pages
Montanari
No ratings yet
Montanari
10 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Affirmations Creation Worksheet 1
100% (1)
Affirmations Creation Worksheet 1
4 pages
Shadowing Technique Boosts Pronunciation
No ratings yet
Shadowing Technique Boosts Pronunciation
20 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
PGDM Brochure & Flyers at Gibs Bangalore - Top PGDM College in Bangalore - Business Management Programme
No ratings yet
PGDM Brochure & Flyers at Gibs Bangalore - Top PGDM College in Bangalore - Business Management Programme
19 pages
Ds 11
No ratings yet
Ds 11
21 pages
KernelTrick PDF
No ratings yet
KernelTrick PDF
4 pages
How To Improve Student English-Speaking Skill
No ratings yet
How To Improve Student English-Speaking Skill
2 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
CMU Machine Learning Homework
No ratings yet
CMU Machine Learning Homework
5 pages
Critical Thinking and Reflective Practices (8611)
No ratings yet
Critical Thinking and Reflective Practices (8611)
11 pages
Session 5118 IC CEBU Approaches To Integration
No ratings yet
Session 5118 IC CEBU Approaches To Integration
118 pages
Kepuasan Pemustaka di Perpustakaan Yogyakarta
No ratings yet
Kepuasan Pemustaka di Perpustakaan Yogyakarta
19 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Consti Concepts Citizanship Suffrage
No ratings yet
Consti Concepts Citizanship Suffrage
7 pages
Completion
No ratings yet
Completion
1 page
Kernel Density Estimation Guide
No ratings yet
Kernel Density Estimation Guide
105 pages
Hsgraduation6 28
No ratings yet
Hsgraduation6 28
4 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
Function Approximation Guide
No ratings yet
Function Approximation Guide
74 pages
Support Vector Machines & Kernels: David Sontag New York University
No ratings yet
Support Vector Machines & Kernels: David Sontag New York University
19 pages
General Mathematics Exam 2019-2020
No ratings yet
General Mathematics Exam 2019-2020
5 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
Kernel Methods for Statisticians
No ratings yet
Kernel Methods for Statisticians
53 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
Professional Education Test
No ratings yet
Professional Education Test
7 pages

Lecture-Notes Kernal Methods

Uploaded by

Lecture-Notes Kernal Methods

Uploaded by

Chapter 5

5.1 Feature maps

Let   R4 be the vector containing 0 , 1 , 2 , 3 as entries. Then we can

Thus, a cubic function of the variable x can be viewed as a linear function

5.2 LMS (least mean squares) with features

Let φ : Rd → Rp be a feature map that maps attribute x (in Rd ) to the

Similarly, the corresponding stochastic gradient descent update rule is

5.3 LMS with the kernel trick

contains all the monomials of x with degree ≤ 3

The dimension of the features φ(x) is on the order of d3 .1 This is a pro-

for some 1 ,    , n  R. Then we claim that in the next round,  is still a

necessarily require computing φ(x(i) ) explicitly. This is because:

Therefore, to compute 〈φ(x), φ(z)〉, we can rst compute 〈x, z〉 with

K(x, z) , 〈φ(x), φ(z)〉 (5.10)

To wrap up the discussion, we write the down the nal algorithm as

Or in vector notation, letting K be the n × n matrix with Kij =

With the algorithm above, we can update the representation  of the

the knowledge of the representation  suces to compute the prediction

5.4 Properties of kernels

K(x, z) = (xT z)2 

We can also write this as

K(x, z) = (xT z + c)2

(Check this yourself.) This function K is a kernel function that corresponds

to the feature mapping (again shown for d = 3)

Kernels as similarity metrics. Now, let’s talk about a slightly dierent

Necessary conditions for valid kernels. Suppose for now that K is

Sucient conditions for valid kernels. More generally, the condition

Theorem (Mercer). Let K : Rd × Rd 7→ R be given. Then for K

Given a function K, apart from trying to nd a feature mapping φ that

Application of kernel methods: We’ve seen the application of kernels

Support vector machines

6.1 Margins: intuition

You might also like