0% found this document useful (0 votes)

3 views22 pages

Classification

Lesson 14 focuses on classification as a supervised learning method, highlighting the differences between training, validation, and new data samples. It introduces three classification methods: recursive partitioning, linear discriminant analysis, and support vector machines, while discussing the importance of avoiding overfitting and the use of validation sets to assess classification accuracy. The lesson includes an example using bone marrow cancer data to illustrate the classification process and the challenges involved.

Uploaded by

Huaqing Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views22 pages

Classification

Uploaded by

Huaqing Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

3/8/2018 https://onlinecourses.science.psu.

edu/stat555/print/book/export/html/17

Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Home > Lesson 14: Classification

Lesson 14: Classification

Key Learning Goals for this Lesson:

Understanding classification as a supervised rule learning method

Understanding the differences between the training and validation samples and the
new data to be classified
Learning about 3 classification methods: recursive partitioning, linear discriminant
analysis and support vector machines.

Introduction

We have already discussed clustering, which is a means of finding patterns in our data to find sets of
similar features or of similar samples. Often, however, we have already clustered or selected our
samples by phenotype and would like to either determine which (genomic) features define the clusters
or use genomic features to assign new samples to the correct cluster. This is the classification
problem. For example, we might classify tissue biopsy samples into normal or cancerous. In the
computer science and machine learning literature, classification is sometimes called "supervised
learning" because the algorithm "learns" the estimated parameters of the classification rule from the
data with guidance from the known classes. By contrast, clustering is sometimes called "unsupervised
learning" because the algorithm must learn both the classes and the classification rule from the data.

The typical starting point is that we have some samples with known classification as well as other
measurements. The purpose is to use the known samples, which is called the training set, to get a
classification rule for new samples whose class is not known but for whom the other measurements are
available. As a simple example, we might think of determining whether a tissue sample is normal or
cancer based on features seen under a microscope. Typically for this course however, the
measurements will be high throughput measurements such as gene expression or methylation.

Since we will use the training set to tune the classification rule, we expect to do better classifying the
training samples than new samples using our rule. For this reason, to assess how well our
classification rule works, or to select among several competing rules, we also maintain another set of
samples, the validation set, for which we also know the correct classification. The validation set is not
used for tuning the classification rule; the elements of the validation set are classified using the
classification rule and these classifications are then compared with the correct classifications to
assess the misclassification rate. When choosing between different classifiers, we usually choose the
Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 1/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

one with the lowest misclassification rate on the validation set. [1] We will discuss this more thoroughly
in the lesson on cross-validation and bootstraps.

Classification using genomic features can be useful to replace or enhance other methods. If a small set
of genes associated with different types of cancer can be identified, then an inexpensive microarray
might be able to diagnose whether a tissue sample is benign or cancerous more rapidly and accurately
than a pathologist.

However, there is a problem with our agenda - overfitting. Whenever the number of features is larger
than the number of samples, we methods like linear discriminant analysis classify the training sample
perfectly - even if the features are not actually associated with the phenotype. This means that with n
samples almost any set of n+1 features can achieve perfect classification of the training set. Although
we can assess how well the classification rule works using the validation set, we can end up with
millions of candidates as possible "good" classification rules. Both statisticians and computer
scientists have worked on this problem, attempting to find ways to find a close to optimal set of features
for classification [2].

We are not going to look at the methods for selecting a small number of features to use for
classification. However, we will look at 3 popular methods for developing classification rules once the
features have been selected. Recursive partitioning, which searches through the features at each step
(that's why it is "recursive") can also be considered a feature selection method as usually only a subset
of the features will actually be used in the classification rule.

Recursive Partitioning (partition trees)

Discriminant Analysis
Support Vector Machines (SVM)

The first two of these methods have been around for a while. SVM, while not new, is the newest method
of the three. There is a large literature on developing classification rules in both statistics and machine
learning. These three method are very popular in both the statistics and machine learning worlds.

References:

[1] Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Classification Evaluation. Nature Methods,
13(8), 603-604. doi:10.1038/nmeth.3945

[2] Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Model Selection and Overfitting. Nature
Methods, 13(9), 703-704. doi:10.1038/nmeth.3968

14.1 - Example: Bone Marrow Cancer Data

Example

We will explore classification using the 10 most differentially expressed genes from a study using 99 Affymetrix
microarrays from GSE47552 which focused on the differences in gene expression among normal, transitioning and
cancerous bone marrow samples. The number of samples for each of the 4 bone marrow types are summarized below:

Normal 5

MGUS [MathJax]/extensions/MathMenu.js
Loading 20

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 2/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

SMM 33
MM 41

To give some idea of the challenge of finding a classification rule for these 99 samples, we start with a cluster diagram
based on the 120 most differentially expressed genes. As can be seen below, even with 120 genes, the samples do not
cluster cleanly by marrow type. We will see how well we can do with classifying the samples using only 10 genes.

As always, before starting an analysis, we should explore the data. I usually use scatterplots. Since our objective is to
classify the samples using the gene expression, lets start by looking at all the samples using a scatterplot matrix based
on the genes. Here I display only the 5 most differentially expressed genes, due to space limitations. The sample types
are plotted in color where Normal is green, is "G" black, "M" is red and "S" is blue. We can see that we should be able to
do well in [MathJax]/extensions/MathMenu.js
Loading classifying the Normal samples, and reasonably well in classifying the "G" samples, but distinguishing between
"M" and "S" might be difficult.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 3/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

We can see that this set of highly differentially expressed genes are also highly correlated with each other. As well, we
can see that the Normal group (green) is fairly well separated from the other samples using any pair of genes but that the
cancerous samples are intermingled.

Note as well that the usefulness of a gene set for classification (or prediction) is not the same as the usefulness of the
gene set for understanding the biological process which created the condition. The genes useful for classification need
only be associated with the phenotype - they are not necessarily causal.

14.2 - Recursive Partitioning

Recursive Partitioning for Classification

Recursive partitioning is a very simple idea for clustering. It is the inverse of hierarchical clustering.

In hierarchical clustering, we start with individual items and cluster those that are closest in some
metric. In recursive partitioning, we start with a single cluster and then split into clusters that have the
smallest within cluster distances in some metric. When we are doing recursive partitioning for
clustering, the "within cluster distance" is just a measure of how homogeneous the cluster is with
Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 4/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

respect to the classes of the objects in it. For example, a cluster with all the same tissue type is more
homogeneous that a cluster with a mix of tissue types.

There are several measures of cluster homogeneity. Denote the proportion of the node coming from
type i as p(i|nod e) . The most popular homogeneity measure is the Gini index:
Σi,j p(i|nod e)p(j|nod e) where i, j are the classes. If all the nodes are homogeneous the Gini index is

zero. The impurity index is a similar idea based on information theory.: −Σi,j p(i|nod e)log(p(j|nod e))
.

For gene expression data, we start with a set of preselected genes and some measure, such as gene
expression, for each gene. These may have been selected based on network analysis, or more simply
based on a differential expression analysis. We start with a single node(cluster) containing all the
samples, and recursively split into increasingly homogeneous clusters. At each step, we select a node
to split and split it independently of other nodes and any splits already performed.

The measures on each gene are the splitting variables. For each possible value of each selection
variable, we partition the node into two groups - samples above the cut-off or less than or equal the cut-
off. The algorithm simply considers all possible splits for all possible variables and selects the variable
and split that creates the most homogeneous clusters. (Note that if a variable includes two values say
A and B with no values between, then any cut-off between A and B produces the same 2 clusters, so
the algorithm only needs to search through a finite number of splits - the midpoints between adjacent
observed values.

We stop splitting when the leaves are all pure or have less than r elements (where r is chosen in
advance).

Hence this process creates a recursive binary tree.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 5/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

When splitting stops, the samples in each node are classified according to a majority rule. So the left
nodes above are declared red (upper) and yellow (lower) leading to 3 misclassified samples. The right
nodes are declared either black or red (upper) and black (lower) leading to one misclassified sample.

It is quite easy to overfit when using this method. Overfitting occurs when we fit the noise as well as the
signal. In this case, it would mean adding a branch that works well for this particular sample by chance.
An overfitted classifier works well with the training sample, but when new data are classified it makes
more mistakes than a classifier with few branches.

Determining when to stop splitting is more of an art than a science. It is usually not useful to partition
down to single elements as this increases overfitting. Cross-validation, which will be discussed in the
next lesson, is often used to help tune the tree selection without over-fitting.

How do you classify a new sample for which the category is unknown? The new sample will have the
same set of variables (measured features). You start at the top the tree and "drop" the sample down the
tree. The classification for the new sample is the majority type in the terminal node. Sometimes instead
of a "hard" classification into a type, the proportion of training samples in the node is used to obtain a
probabilistic (fuzzy) classification. If the leaf is homogeneous, the new sample is assigned to that type
with probability one. If the leaf is a mixture, then you assign the new sample to each component of the
mixture with probability equal to the mixture proportion.

A more recent (and now more popular method) is to use grow multiple trees using random subsets of
the data. This is called a "random forest". The final classifier uses all of the tree rules. To classify a
new sample, it is dropped down all the trees (following the tree rules). This gives it a final classification
by "voting" - each sample is classified into the type obtained from the majority of trees. This consensus
method will be discussed in more detail in the lesson on the cross-validation, bootstraps and
consensus.

Recursive partitioning bases the split points on each variable, looking for a split that increases the
homogeneity of the resulting subsets. If two splits are equally good, the algorithm selects one. The
"goodness" of the split does not depend on any measure of distance between the nodes, only
the within node homogeneity. All the split points are based on one variable at a time; therefore they
are represented by vertical and horizontal partitioning of the scatterplot. We will demonstrate the
splitting algorithm using the two most differentially expressed genes as seen below.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 6/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

The first split uses gene 2 and splits into two groups based on log2(expression) above or below 7.406.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 7/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

Then the two groups are each split again. They could be split on either gene, but both are split on gene
1. This gives 4 groups:

gene 2 below 7.406, gene 1 above/below 3.35

gene 2 above 7.406, gene 1 above/below 5.6. (note that the "above" group is a bit less homogeneous
than it would be with a higher split, but the "below" group is more homogeneous)

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 8/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

The entire tree grown using just these 2 genes is shown below:

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 9/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

Each of the terminal (leaf) nodes is labelled with the category making up the majority. The numbers
below each leaf is the number of samples in G\M\N\S. We can see that none of these leaf nodes is
purely one category, so there are lots of misclassified genes. However, we can distinguish reasonably
well between the normal and cancerous samples, and between G and the other two cancer types. How
well can we do with the 10 best genes? That tree is displayed below:

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 10/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

Only 5 of the 10 most differentially expressed genes are used in the classification. We do a bit better
than with only two genes, but still have a lot of misclassified genes.

Notice that we have misclassified two tumor samples as normal. We might be less worried about
misclassifying the cancer samples among themselves and more worried about misclassifying them as
normal, since a classification as cancerous will likely lead to further testing. A weighting of the
classification cost, to make it more costly to misclassify cancer samples can provide a classification
that does better at separating out the normal from cancer samples. This is readily done by providing
misclassification weights.

14.3 - Discriminant Analysis

Discriminant analysis is the oldest of the three classification methods. It was originally developed for
multivariate normal distributed data. Actually, for linear discriminant analysis to be optimal, the data as
a whole should not be normally distributed but within each class the data should be normally
distributed. This means that if you could plot the data, each class would form an ellipsoid, but the
means would differ.

The simplest type of discriminant analysis is called linear discriminant analysis or LDA. The LDA
model is appropriate when the ellipsoids have the same orientation as shown below. A related method
called quadratic discriminant analysis is appropriate when the ellipsoids have different orientations.

To understand how the method works, consider the 2 ellipses in Figure 1a, which have the same
orientation and spread. Now consider drawing any line on the plot, and putting the centers of the
Loading [MathJax]/extensions/MathMenu.js
ellipses on the line by drawing an orthogonal to the line (Figure 1b shows 2 examples). Now consider
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 11/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

putting all the data on the line also by drawing an orthogonal from each data point to the line (Figure
1c). This is called the orthogonal projection.

For each ellipse, the projected data has a Normal distribution (Figure 1d) with mean the projected
center and some variance. As we change the orientation of the lines, the mean and variance change.
Note that only the orientation of the lines matters to the distance between the means and the 2
variances. For a given line, call the two means μ1 and μ2 and the two variances σ12 and σ22 . Then
similar to the two-sample t-test, we consider the difference between the means measured in SDs is
μ1 − μ2

‾‾‾‾‾‾‾
2 ‾
2
σ + σ
√ 1 2

Now consider a point M on the line which minimizes (μ1 − M )2 /σ12 + (μ2 − M 2 /σ22 )2 . When the
variances are the same, M is just the midpoint between the two means. The orthogonal line going
through M is called a separating line. (Figure 1e) (When there are more than two dimensions it is called
a separating hyperplane.) The best separating line (hyperplane) is the line going through M that is
orthogonal to the projection minimizing the difference in means. Data points are classified according
to which side of the hyperplane they lie on or alternatively, to the class whose projected mean is closest
to the projected point.

When the two data ellipses have the same orientation but different sizes (Figure 1f) M moves closer to
the mean of the class with the smaller variance. Otherwise the computation is the same.

Figure 1: Linear discriminant analysis

It turns out that this method maximizes the ratio of the within group variance among the points to the
between group variance of the means. Unlike recursive partitioning, there is both a concept of
separating the groups AND of maximizing the distance between them.
Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 12/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

To account for the orientation and differences in spread, a weighted distance called Mahalanobis
distance is usually used with LDA, because it provides a more justifiable idea of distance in terms of
SD along the line joining the point to the mean. We expect data from the cluster to be closer to the
center if it is in one of the "thin" directions than if it is in one of the "fat" directions. One effect of using
Mahalanobis distance is to move the split point between the means closer to the less variable class as
we have already seen in Figure 1f.

It is not important to know the exact equation for Mahalanobis distance, but for those who are
interested, if two data points are designed by x and y (and each is a vector) and if S is a symmetric
positive definite matrix:

Euclidean Distance between x and y is √‾ ⊤

‾‾‾‾‾‾‾‾‾‾‾‾
(x ‾.
− y) (x − y)

The S-weighted distance between x and y is √‾ ⊤ −1

‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
(x − y) S ‾.
(x − y)

The Mahalanobis distance between x and the center ci of class i is the S-weighted distance where S is
the estimated variance-covariance matrix of the class.

When there are more than 2 classes and the ellipses are oriented in the same direction, the separating
hyperplanes are eigenvectors of the ratio of the between and within variance matrices. These are not
readily visualized on the scatterplot (and in any case, with multiple features it is not clear which
scatterplot should be used) so instead the data are often plotted by their projections onto the
eigenvectors. This produces the directions in which the ratio of between class to within class variance
is maximized, so we should be able to see the clusters on the plot as shown in the example. Basically,
the plot is a rotation to new axes in the directions of greatest spread.

If the classes are elliptical but have a different orientation, the minimizer of the ratio of within to between
class differences turns out to describe a quadratic surface rather than a linear surface. The method is
called quadratic discriminant analysis or QDA.

LDA and QDA often do a good job of classification even if the classes are not elliptical but neither can
do a good job if the classes are not convex. (See the sketch below). A newer method called kernel
discriminant analysis (KDA) uses an even more general weighting scheme than S-weights, and can
follow very curved boundaries. The ideas behind KDA are the same as the ideas behind the kernel
support vector machine and will be illustrated in the next section.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 13/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

With "omics" data, we have a huge multiplicity problem. When you have more features or variables
than the number of samples we can get perfect separation even if the data are just noise! The solution
is to firstly select a small set of features based on other criteria such as differential expression. As with
recursive partitioning, we use a training sample and validation sample and/or cross validation to mimic
how well the classifier will work with new samples.

Bone Marrow Data

Lets see how linear discriminant analysis works for our bone marrow data. Recall that the scatterplot
matrix of the five most differentially expressed genes indicate that we should be able to do well in
classifying the Normal samples, and reasonably well in classifying the "G" samples, but distinguishing
between "M" and "S" might be difficult.

Lets start by looking at the two genes we selected for recursive partitioning. We have already seen the
plot of the two genes, which is repeated below.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 14/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

We can see that we might be able to draw a line separating the "N" samples from most of the others,
and another separating the "G" samples from most of the others, but the "M" and "S" samples are not
clearly separated. Since we are using only 2 features, we have only 2 discriminant functions.
Projecting each sample onto the two directions and then plotting gives the plot below.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 15/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

We can see that direction LD1 does a good job of separating out the "N" and "G" groups, with the
former having values less than -2 and the latter having values between -2 and 2. The utility of direction
LD2 is less clear, but it might help us distinguish between "G" and "M". Direction LD1 is essentially the
trend line on the scatterplot.

Using the 10 most differentially expressed genes allows us to find 3 discriminant functions. Plotting
them pairwise gives the plot below:

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 16/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

Note that LD1 on this plot is not the same as on the previous plot, as it is sitting in a 10-dimensional
space instead of a 2-dimensional space. However, by concentrating on the first column of the
scatterplot matrix (LD1 is the x-axis for both plots) we can see that it is a very similar direction.
Samples with LD1<-2 are mainly "N" and between -2 and 2 are mainly "G". Directions LD2 and LD3
are disentangling the "G", "M" and "S" samples, but not well.

We saw from the scatterplot matrix of the features, that at least the first 5 features have a very similar
pattern across the samples. We might do better in classification if we could find genes that are
differentially expressed, but have different patterns than these. There are a number of ways to do this.
One method is variable selection - start with a larger set of genes and then sequentially add genes that
do a good job of classification. However, when selecting variables we are very likely to overfit so
methods like the bootstrap and cross-validation must be used to provide less biased estimates of the
quality of the classifier.

14.4 - Support Vector Machine

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 17/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

The idea behind the basic support vector machine (SVM) is similar to LDA - find a hyperplane
separating the classes. In LDA, the separating hyperplane separates the means of the samples in each
class, which is suitable when the data are sitting inside an ellipsoid or other convex set. However,
intuitively what we really want is a rule that separates the samples in the different classes which appear
to be closest. When the classes are separable (i.e. can be separated by a hyperplane without
misclassification), the support vector machine (SVM) finds the hyperplane that maximizes the minimum
distance from the hyperplane to the nearest point in each set. The nearest points are called the support
vectors.

In discriminant analysis, the assumption is that the samples cluster more tightly around the mean of the
within class distribution. Using this method the focus is on separating the ellipsoids containing the bulk
of the samples within each class.

Using a SVM, the focus is on separating the closest points in different classes. The first step is to
determine which point or points are closest to the other classes. These points are called support
vectors because each sample is described by its vector of values on the genes (e.g. expression
values). After identifying the support vectors (circled below) for each pair of classes we compute the
separating hyperplane by computing the shortest distance from each support vector to the line. We
then select the line the maximizes the sum of distances. If it is impossible to completely separate the
classes, then a negative penalty is applied for misclassification - the selected line then maximizes the
sum of distances plus penalty, which balances the distances from the support vectors to the line and the
misclassifications.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 18/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

The separating hyperplane used in LDA is sensitive to all the data in each class because you are using
the entire ellipse of data and its center. The separating hyperplane used in SVM is sensitive only to the
data on the boundary which provides computational savings and makes it more to robust outliers.

This is how it is applied:

Suppose the data vector for the ith sample is xi , which is the expression vector for several genes. Also,
suppose we have only 2 classes denoted by yi = +/ − 1 . The hyperplane is defined by a vector W
and a scalar b such that for all i.
′
y i ( x W + b) − 1 ≥ 0
i
for all i with equality for the support vectors and ||W|| is minimal. After
computing W and b using the training set, new samples are classified by determining the value of
y satisfying the inequality.

If the clusters overlap, the "obvious" extension is to find the hyperplane that minimizes the distance
between the main samples and the hyperplane, while penalizing for misclassified samples.

This is the solution to the optimization problem: yi (xi′ W + b) − 1 + zi ≥ 0) for all i with equality for
the support vectors where all zi ≥ 0 and ||W|| + CΣi zi is minimal for some C.

The idea is that some points can be one the wrong side of the separating hyperplane, but there is a
penalty for each misclassification. The hyperplane is now found by minimizing a criterion that still
separates the support vectors as much as possible in this orthogonal direction, but also minimizes the
number of misclassifications.

It is not clear that SVM does much better than linear discriminant analysis when the groups are actually
elliptical. When the groups are not elliptical because of outliers SVM does better because the criterion
is only sensitive to the boundaries and not sensitive to the entire shape. However, like LDA, SVM
requires groups that can be separated by planes. Neither method will do well for an example like the
one below. However, there is a method called the kernel method (sometimes called the kernel trick) that
works for more general examples.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 19/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

The idea is simple - the implementation is very clever.

Suppose that we suspected that the clusters were in concentric circles (or spheres) as seen above. If
we designate the very center of the points as the origin, we can compute the distance from each point
to the center as a new variable. In this case, the clusters separate very nicely on the radius, creating a
circular boundary.

The general idea is to add more variables that are nonlinear functions of the original variables until we
have enough dimensions to separate the groups.

The kernel trick uses the fact that for both LDA and SVM, only the dot product of the data vectors xi′ xj
are used in the computations. Lets call K(xi , xj ) = x xj
′
i
. We replace K with a more general function
called a kernel function and use K(xi , xj ) as if it were the dot product in our algorithm. (For those who
are more mathematically inclined, the kernel function needs to be positive definite, and we will actually
be using the dot product in the basis described by its eigenfunctions.)

The most popular kernels are the radial kernel and polynomial kernel. The radial kernel (sometimes
called the Gaussian kernel) is k(xi , xj ) = exp(−||xi − xj ||2 /2σ 2 ) .

Here is an animation that shows how "kernel" SVM works with the concentric circles using the radial
kernel.

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 20/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

Performing nonlinear classi cation via linear separation in higher dimension…

Here is another example of essentially the same problem using a polynomial kernel instead of Normal
densities. k(xi , xj ) = (xi′ xj + 1)2 .

SVM with polynomial kernel visualization

The plot[MathJax]/extensions/MathMenu.js
Loading below shows the results of ordinary (linear) SVM on our cancer data. I believe the blue "tail" is
an artifact of the plotting routine - the boundaries between the groups should be straight lines. The data
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 21/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17

points marked with "X" are the support points - there are a lot of them because the data are so strung
out.

I also used SVM with the radial kernel (below). The plot looks nicer, but there are actually more support
points (15 instead of 12) and the same 2 samples are misclassified.

Source URL: https://onlinecourses.science.psu.edu/stat555/node/17

Loading [MathJax]/extensions/MathMenu.js

https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 22/22

Machine Learning
100% (11)
Machine Learning
135 pages
Wiley Royal Statistical Society
No ratings yet
Wiley Royal Statistical Society
49 pages
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
No ratings yet
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
13 pages
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
No ratings yet
Cancer Classification of Bioinformatics Data Using ANOVA: A. Bharathi, Dr.A.M.Natarajan
5 pages
Module 3
No ratings yet
Module 3
132 pages
Chapter 4
No ratings yet
Chapter 4
103 pages
Classification & Prediction Guide
No ratings yet
Classification & Prediction Guide
103 pages
Machine Learning Project
No ratings yet
Machine Learning Project
12 pages
Dsbdunitiii T1729232981820-1
No ratings yet
Dsbdunitiii T1729232981820-1
26 pages
MIT6 047F15 Lecture07
No ratings yet
MIT6 047F15 Lecture07
86 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
Two-Class Pattern Classification Tutorial
No ratings yet
Two-Class Pattern Classification Tutorial
14 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Chapter 02 - DM Tasks - Part I - Classification
No ratings yet
Chapter 02 - DM Tasks - Part I - Classification
58 pages
Obust Model For Gene Anlysis and Classification: Fatemeh Aminzadeh, Bita Shadgar, Alireza Osareh
No ratings yet
Obust Model For Gene Anlysis and Classification: Fatemeh Aminzadeh, Bita Shadgar, Alireza Osareh
10 pages
Statistical Classification
No ratings yet
Statistical Classification
6 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Review of Data Mining Classification Techniques
No ratings yet
Review of Data Mining Classification Techniques
4 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
7073 21560 2 PB
No ratings yet
7073 21560 2 PB
9 pages
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
No ratings yet
C. Cifarelli Et Al - Incremental Classification With Generalized Eigenvalues
25 pages
Data Classification & Prediction Guide
No ratings yet
Data Classification & Prediction Guide
34 pages
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
No ratings yet
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
26 pages
Evolutionary Computation For Feature Selection and Feature Construction
No ratings yet
Evolutionary Computation For Feature Selection and Feature Construction
30 pages
Tan 2021 J. Phys. Conf. Ser. 1994 012016
No ratings yet
Tan 2021 J. Phys. Conf. Ser. 1994 012016
6 pages
Support Vector Machine Classification of Microarray Gene Expression Data UCSC-CRL-99-09
No ratings yet
Support Vector Machine Classification of Microarray Gene Expression Data UCSC-CRL-99-09
31 pages
05 Classification
No ratings yet
05 Classification
33 pages
Unit 4
No ratings yet
Unit 4
186 pages
ML Unit II - Final
No ratings yet
ML Unit II - Final
138 pages
Unit Iv
No ratings yet
Unit Iv
38 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
BMC Bioinformatics
No ratings yet
BMC Bioinformatics
10 pages
Classification
No ratings yet
Classification
73 pages
Session 5
No ratings yet
Session 5
91 pages
t3 Sol
No ratings yet
t3 Sol
8 pages
Longintro
No ratings yet
Longintro
60 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination
No ratings yet
Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination
70 pages
Data Mining and Classification Basics
No ratings yet
Data Mining and Classification Basics
129 pages
ML 41
No ratings yet
ML 41
49 pages
Notes - CH 5 Decision Trees and Linear Regression
No ratings yet
Notes - CH 5 Decision Trees and Linear Regression
3 pages
TMP 3623
No ratings yet
TMP 3623
9 pages
Classification - Basic Concepts
No ratings yet
Classification - Basic Concepts
35 pages
Multivariate Exploratory
No ratings yet
Multivariate Exploratory
13 pages
An Efficient Data Partitioning To Improve Classification Performance While Keeping Parameters Interpretable
No ratings yet
An Efficient Data Partitioning To Improve Classification Performance While Keeping Parameters Interpretable
16 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Unit 6 DWDM
No ratings yet
Unit 6 DWDM
74 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Practical Aplication 2
No ratings yet
Practical Aplication 2
10 pages
WK 07
No ratings yet
WK 07
8 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
33 pages
CDT B1 Lab06 MondayWeek2
No ratings yet
CDT B1 Lab06 MondayWeek2
6 pages
05 Classification
No ratings yet
05 Classification
79 pages
ML Solved Endsem
No ratings yet
ML Solved Endsem
16 pages
Prediction Model For Students PDF
No ratings yet
Prediction Model For Students PDF
4 pages
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
No ratings yet
A Complete Tutorial On Tree Based Modeling From Scratch (In R & Python) PDF
28 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
17 pages
Birchwood MDS Brochure (2) - Min
No ratings yet
Birchwood MDS Brochure (2) - Min
18 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Detecting Phishing with AI Techniques
No ratings yet
Detecting Phishing with AI Techniques
10 pages
DL Practical 3
No ratings yet
DL Practical 3
5 pages
2 The - Habitat - Template - of - Phytoplankton - Morphology-B PDF
No ratings yet
2 The - Habitat - Template - of - Phytoplankton - Morphology-B PDF
13 pages
Artificial Intelligence Interview Questions
100% (1)
Artificial Intelligence Interview Questions
28 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
CNN vs. RNN vs. ANN - Analysing 3 Types of Neural Networks in Deep Learning
No ratings yet
CNN vs. RNN vs. ANN - Analysing 3 Types of Neural Networks in Deep Learning
10 pages
Predictive Model for Retailers
100% (1)
Predictive Model for Retailers
3 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Nptel Notes 4
No ratings yet
Nptel Notes 4
12 pages
Blue Futuristic Technology Presentation
No ratings yet
Blue Futuristic Technology Presentation
19 pages
Introduction To Computers Definition of Terms
100% (1)
Introduction To Computers Definition of Terms
8 pages
Machine Vision Based Fruit Classification and Grad
No ratings yet
Machine Vision Based Fruit Classification and Grad
14 pages
Petrel 2013 Property Modeling Course: Module 2: Data Preparation
100% (3)
Petrel 2013 Property Modeling Course: Module 2: Data Preparation
15 pages
Pattern Recognition Handwritten Notes
No ratings yet
Pattern Recognition Handwritten Notes
64 pages
Quantitative Urban Morphology Thesis
No ratings yet
Quantitative Urban Morphology Thesis
69 pages
Index: ILWIS 3.0 User's Guide
No ratings yet
Index: ILWIS 3.0 User's Guide
10 pages
CAF Jun24 Chp13 Statistical Description and Sampling Pranav Popat
No ratings yet
CAF Jun24 Chp13 Statistical Description and Sampling Pranav Popat
28 pages
Fake Reviews Detection Using Supervised Machine Learning
No ratings yet
Fake Reviews Detection Using Supervised Machine Learning
7 pages
MSC Salunkhe T P 2018
No ratings yet
MSC Salunkhe T P 2018
79 pages
Data Structure and Algorithm
No ratings yet
Data Structure and Algorithm
19 pages
Assignment-2 ML Solution by Loknath Regmi
No ratings yet
Assignment-2 ML Solution by Loknath Regmi
6 pages
ML - Chapter 5 - Neural Network
No ratings yet
ML - Chapter 5 - Neural Network
64 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages

Classification

Uploaded by

Classification

Uploaded by

3/8/2018 https://onlinecourses.science.psu.

Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Lesson 14: Classification

Understanding classification as a supervised rule learning method

Recursive Partitioning (partition trees)

14.1 - Example: Bone Marrow Cancer Data

14.2 - Recursive Partitioning

Hence this process creates a recursive binary tree.

gene 2 below 7.406, gene 1 above/below 3.35

14.3 - Discriminant Analysis

Figure 1: Linear discriminant analysis

Euclidean Distance between x and y is √‾ ⊤

The S-weighted distance between x and y is √‾ ⊤ −1

Bone Marrow Data

14.4 - Support Vector Machine

This is how it is applied:

The idea is simple - the implementation is very clever.

Performing nonlinear classi cation via linear separation in higher dimension…

SVM with polynomial kernel visualization

Source URL: https://onlinecourses.science.psu.edu/stat555/node/17

You might also like