Classification
Classification
edu/stat555/print/book/export/html/17
Introduction
We have already discussed clustering, which is a means of finding patterns in our data to find sets of
similar features or of similar samples. Often, however, we have already clustered or selected our
samples by phenotype and would like to either determine which (genomic) features define the clusters
or use genomic features to assign new samples to the correct cluster. This is the classification
problem. For example, we might classify tissue biopsy samples into normal or cancerous. In the
computer science and machine learning literature, classification is sometimes called "supervised
learning" because the algorithm "learns" the estimated parameters of the classification rule from the
data with guidance from the known classes. By contrast, clustering is sometimes called "unsupervised
learning" because the algorithm must learn both the classes and the classification rule from the data.
The typical starting point is that we have some samples with known classification as well as other
measurements. The purpose is to use the known samples, which is called the training set, to get a
classification rule for new samples whose class is not known but for whom the other measurements are
available. As a simple example, we might think of determining whether a tissue sample is normal or
cancer based on features seen under a microscope. Typically for this course however, the
measurements will be high throughput measurements such as gene expression or methylation.
Since we will use the training set to tune the classification rule, we expect to do better classifying the
training samples than new samples using our rule. For this reason, to assess how well our
classification rule works, or to select among several competing rules, we also maintain another set of
samples, the validation set, for which we also know the correct classification. The validation set is not
used for tuning the classification rule; the elements of the validation set are classified using the
classification rule and these classifications are then compared with the correct classifications to
assess the misclassification rate. When choosing between different classifiers, we usually choose the
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 1/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
one with the lowest misclassification rate on the validation set. [1] We will discuss this more thoroughly
in the lesson on cross-validation and bootstraps.
Classification using genomic features can be useful to replace or enhance other methods. If a small set
of genes associated with different types of cancer can be identified, then an inexpensive microarray
might be able to diagnose whether a tissue sample is benign or cancerous more rapidly and accurately
than a pathologist.
However, there is a problem with our agenda - overfitting. Whenever the number of features is larger
than the number of samples, we methods like linear discriminant analysis classify the training sample
perfectly - even if the features are not actually associated with the phenotype. This means that with n
samples almost any set of n+1 features can achieve perfect classification of the training set. Although
we can assess how well the classification rule works using the validation set, we can end up with
millions of candidates as possible "good" classification rules. Both statisticians and computer
scientists have worked on this problem, attempting to find ways to find a close to optimal set of features
for classification [2].
We are not going to look at the methods for selecting a small number of features to use for
classification. However, we will look at 3 popular methods for developing classification rules once the
features have been selected. Recursive partitioning, which searches through the features at each step
(that's why it is "recursive") can also be considered a feature selection method as usually only a subset
of the features will actually be used in the classification rule.
The first two of these methods have been around for a while. SVM, while not new, is the newest method
of the three. There is a large literature on developing classification rules in both statistics and machine
learning. These three method are very popular in both the statistics and machine learning worlds.
References:
[1] Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Classification Evaluation. Nature Methods,
13(8), 603-604. doi:10.1038/nmeth.3945
[2] Lever, J., Krzywinski, M. & ., Altman, N. (2016). Points of Significance: Model Selection and Overfitting. Nature
Methods, 13(9), 703-704. doi:10.1038/nmeth.3968
We will explore classification using the 10 most differentially expressed genes from a study using 99 Affymetrix
microarrays from GSE47552 which focused on the differences in gene expression among normal, transitioning and
cancerous bone marrow samples. The number of samples for each of the 4 bone marrow types are summarized below:
Normal 5
MGUS [MathJax]/extensions/MathMenu.js
Loading 20
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 2/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
SMM 33
MM 41
To give some idea of the challenge of finding a classification rule for these 99 samples, we start with a cluster diagram
based on the 120 most differentially expressed genes. As can be seen below, even with 120 genes, the samples do not
cluster cleanly by marrow type. We will see how well we can do with classifying the samples using only 10 genes.
As always, before starting an analysis, we should explore the data. I usually use scatterplots. Since our objective is to
classify the samples using the gene expression, lets start by looking at all the samples using a scatterplot matrix based
on the genes. Here I display only the 5 most differentially expressed genes, due to space limitations. The sample types
are plotted in color where Normal is green, is "G" black, "M" is red and "S" is blue. We can see that we should be able to
do well in [MathJax]/extensions/MathMenu.js
Loading classifying the Normal samples, and reasonably well in classifying the "G" samples, but distinguishing between
"M" and "S" might be difficult.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 3/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
We can see that this set of highly differentially expressed genes are also highly correlated with each other. As well, we
can see that the Normal group (green) is fairly well separated from the other samples using any pair of genes but that the
cancerous samples are intermingled.
Note as well that the usefulness of a gene set for classification (or prediction) is not the same as the usefulness of the
gene set for understanding the biological process which created the condition. The genes useful for classification need
only be associated with the phenotype - they are not necessarily causal.
Recursive partitioning is a very simple idea for clustering. It is the inverse of hierarchical clustering.
In hierarchical clustering, we start with individual items and cluster those that are closest in some
metric. In recursive partitioning, we start with a single cluster and then split into clusters that have the
smallest within cluster distances in some metric. When we are doing recursive partitioning for
clustering, the "within cluster distance" is just a measure of how homogeneous the cluster is with
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 4/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
respect to the classes of the objects in it. For example, a cluster with all the same tissue type is more
homogeneous that a cluster with a mix of tissue types.
There are several measures of cluster homogeneity. Denote the proportion of the node coming from
type i as p(i|nod e) . The most popular homogeneity measure is the Gini index:
Σi,j p(i|nod e)p(j|nod e) where i, j are the classes. If all the nodes are homogeneous the Gini index is
zero. The impurity index is a similar idea based on information theory.: −Σi,j p(i|nod e)log(p(j|nod e))
.
For gene expression data, we start with a set of preselected genes and some measure, such as gene
expression, for each gene. These may have been selected based on network analysis, or more simply
based on a differential expression analysis. We start with a single node(cluster) containing all the
samples, and recursively split into increasingly homogeneous clusters. At each step, we select a node
to split and split it independently of other nodes and any splits already performed.
The measures on each gene are the splitting variables. For each possible value of each selection
variable, we partition the node into two groups - samples above the cut-off or less than or equal the cut-
off. The algorithm simply considers all possible splits for all possible variables and selects the variable
and split that creates the most homogeneous clusters. (Note that if a variable includes two values say
A and B with no values between, then any cut-off between A and B produces the same 2 clusters, so
the algorithm only needs to search through a finite number of splits - the midpoints between adjacent
observed values.
We stop splitting when the leaves are all pure or have less than r elements (where r is chosen in
advance).
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 5/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
When splitting stops, the samples in each node are classified according to a majority rule. So the left
nodes above are declared red (upper) and yellow (lower) leading to 3 misclassified samples. The right
nodes are declared either black or red (upper) and black (lower) leading to one misclassified sample.
It is quite easy to overfit when using this method. Overfitting occurs when we fit the noise as well as the
signal. In this case, it would mean adding a branch that works well for this particular sample by chance.
An overfitted classifier works well with the training sample, but when new data are classified it makes
more mistakes than a classifier with few branches.
Determining when to stop splitting is more of an art than a science. It is usually not useful to partition
down to single elements as this increases overfitting. Cross-validation, which will be discussed in the
next lesson, is often used to help tune the tree selection without over-fitting.
How do you classify a new sample for which the category is unknown? The new sample will have the
same set of variables (measured features). You start at the top the tree and "drop" the sample down the
tree. The classification for the new sample is the majority type in the terminal node. Sometimes instead
of a "hard" classification into a type, the proportion of training samples in the node is used to obtain a
probabilistic (fuzzy) classification. If the leaf is homogeneous, the new sample is assigned to that type
with probability one. If the leaf is a mixture, then you assign the new sample to each component of the
mixture with probability equal to the mixture proportion.
A more recent (and now more popular method) is to use grow multiple trees using random subsets of
the data. This is called a "random forest". The final classifier uses all of the tree rules. To classify a
new sample, it is dropped down all the trees (following the tree rules). This gives it a final classification
by "voting" - each sample is classified into the type obtained from the majority of trees. This consensus
method will be discussed in more detail in the lesson on the cross-validation, bootstraps and
consensus.
Recursive partitioning bases the split points on each variable, looking for a split that increases the
homogeneity of the resulting subsets. If two splits are equally good, the algorithm selects one. The
"goodness" of the split does not depend on any measure of distance between the nodes, only
the within node homogeneity. All the split points are based on one variable at a time; therefore they
are represented by vertical and horizontal partitioning of the scatterplot. We will demonstrate the
splitting algorithm using the two most differentially expressed genes as seen below.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 6/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
The first split uses gene 2 and splits into two groups based on log2(expression) above or below 7.406.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 7/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
Then the two groups are each split again. They could be split on either gene, but both are split on gene
1. This gives 4 groups:
gene 2 above 7.406, gene 1 above/below 5.6. (note that the "above" group is a bit less homogeneous
than it would be with a higher split, but the "below" group is more homogeneous)
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 8/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
The entire tree grown using just these 2 genes is shown below:
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 9/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
Each of the terminal (leaf) nodes is labelled with the category making up the majority. The numbers
below each leaf is the number of samples in G\M\N\S. We can see that none of these leaf nodes is
purely one category, so there are lots of misclassified genes. However, we can distinguish reasonably
well between the normal and cancerous samples, and between G and the other two cancer types. How
well can we do with the 10 best genes? That tree is displayed below:
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 10/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
Only 5 of the 10 most differentially expressed genes are used in the classification. We do a bit better
than with only two genes, but still have a lot of misclassified genes.
Notice that we have misclassified two tumor samples as normal. We might be less worried about
misclassifying the cancer samples among themselves and more worried about misclassifying them as
normal, since a classification as cancerous will likely lead to further testing. A weighting of the
classification cost, to make it more costly to misclassify cancer samples can provide a classification
that does better at separating out the normal from cancer samples. This is readily done by providing
misclassification weights.
The simplest type of discriminant analysis is called linear discriminant analysis or LDA. The LDA
model is appropriate when the ellipsoids have the same orientation as shown below. A related method
called quadratic discriminant analysis is appropriate when the ellipsoids have different orientations.
To understand how the method works, consider the 2 ellipses in Figure 1a, which have the same
orientation and spread. Now consider drawing any line on the plot, and putting the centers of the
Loading [MathJax]/extensions/MathMenu.js
ellipses on the line by drawing an orthogonal to the line (Figure 1b shows 2 examples). Now consider
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 11/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
putting all the data on the line also by drawing an orthogonal from each data point to the line (Figure
1c). This is called the orthogonal projection.
For each ellipse, the projected data has a Normal distribution (Figure 1d) with mean the projected
center and some variance. As we change the orientation of the lines, the mean and variance change.
Note that only the orientation of the lines matters to the distance between the means and the 2
variances. For a given line, call the two means μ1 and μ2 and the two variances σ12 and σ22 . Then
similar to the two-sample t-test, we consider the difference between the means measured in SDs is
μ1 − μ2
‾‾‾‾‾‾‾
2 ‾
2
σ + σ
√ 1 2
Now consider a point M on the line which minimizes (μ1 − M )2 /σ12 + (μ2 − M 2 /σ22 )2 . When the
variances are the same, M is just the midpoint between the two means. The orthogonal line going
through M is called a separating line. (Figure 1e) (When there are more than two dimensions it is called
a separating hyperplane.) The best separating line (hyperplane) is the line going through M that is
orthogonal to the projection minimizing the difference in means. Data points are classified according
to which side of the hyperplane they lie on or alternatively, to the class whose projected mean is closest
to the projected point.
When the two data ellipses have the same orientation but different sizes (Figure 1f) M moves closer to
the mean of the class with the smaller variance. Otherwise the computation is the same.
It turns out that this method maximizes the ratio of the within group variance among the points to the
between group variance of the means. Unlike recursive partitioning, there is both a concept of
separating the groups AND of maximizing the distance between them.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 12/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
To account for the orientation and differences in spread, a weighted distance called Mahalanobis
distance is usually used with LDA, because it provides a more justifiable idea of distance in terms of
SD along the line joining the point to the mean. We expect data from the cluster to be closer to the
center if it is in one of the "thin" directions than if it is in one of the "fat" directions. One effect of using
Mahalanobis distance is to move the split point between the means closer to the less variable class as
we have already seen in Figure 1f.
It is not important to know the exact equation for Mahalanobis distance, but for those who are
interested, if two data points are designed by x and y (and each is a vector) and if S is a symmetric
positive definite matrix:
The Mahalanobis distance between x and the center ci of class i is the S-weighted distance where S is
the estimated variance-covariance matrix of the class.
When there are more than 2 classes and the ellipses are oriented in the same direction, the separating
hyperplanes are eigenvectors of the ratio of the between and within variance matrices. These are not
readily visualized on the scatterplot (and in any case, with multiple features it is not clear which
scatterplot should be used) so instead the data are often plotted by their projections onto the
eigenvectors. This produces the directions in which the ratio of between class to within class variance
is maximized, so we should be able to see the clusters on the plot as shown in the example. Basically,
the plot is a rotation to new axes in the directions of greatest spread.
If the classes are elliptical but have a different orientation, the minimizer of the ratio of within to between
class differences turns out to describe a quadratic surface rather than a linear surface. The method is
called quadratic discriminant analysis or QDA.
LDA and QDA often do a good job of classification even if the classes are not elliptical but neither can
do a good job if the classes are not convex. (See the sketch below). A newer method called kernel
discriminant analysis (KDA) uses an even more general weighting scheme than S-weights, and can
follow very curved boundaries. The ideas behind KDA are the same as the ideas behind the kernel
support vector machine and will be illustrated in the next section.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 13/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
With "omics" data, we have a huge multiplicity problem. When you have more features or variables
than the number of samples we can get perfect separation even if the data are just noise! The solution
is to firstly select a small set of features based on other criteria such as differential expression. As with
recursive partitioning, we use a training sample and validation sample and/or cross validation to mimic
how well the classifier will work with new samples.
Lets see how linear discriminant analysis works for our bone marrow data. Recall that the scatterplot
matrix of the five most differentially expressed genes indicate that we should be able to do well in
classifying the Normal samples, and reasonably well in classifying the "G" samples, but distinguishing
between "M" and "S" might be difficult.
Lets start by looking at the two genes we selected for recursive partitioning. We have already seen the
plot of the two genes, which is repeated below.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 14/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
We can see that we might be able to draw a line separating the "N" samples from most of the others,
and another separating the "G" samples from most of the others, but the "M" and "S" samples are not
clearly separated. Since we are using only 2 features, we have only 2 discriminant functions.
Projecting each sample onto the two directions and then plotting gives the plot below.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 15/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
We can see that direction LD1 does a good job of separating out the "N" and "G" groups, with the
former having values less than -2 and the latter having values between -2 and 2. The utility of direction
LD2 is less clear, but it might help us distinguish between "G" and "M". Direction LD1 is essentially the
trend line on the scatterplot.
Using the 10 most differentially expressed genes allows us to find 3 discriminant functions. Plotting
them pairwise gives the plot below:
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 16/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
Note that LD1 on this plot is not the same as on the previous plot, as it is sitting in a 10-dimensional
space instead of a 2-dimensional space. However, by concentrating on the first column of the
scatterplot matrix (LD1 is the x-axis for both plots) we can see that it is a very similar direction.
Samples with LD1<-2 are mainly "N" and between -2 and 2 are mainly "G". Directions LD2 and LD3
are disentangling the "G", "M" and "S" samples, but not well.
We saw from the scatterplot matrix of the features, that at least the first 5 features have a very similar
pattern across the samples. We might do better in classification if we could find genes that are
differentially expressed, but have different patterns than these. There are a number of ways to do this.
One method is variable selection - start with a larger set of genes and then sequentially add genes that
do a good job of classification. However, when selecting variables we are very likely to overfit so
methods like the bootstrap and cross-validation must be used to provide less biased estimates of the
quality of the classifier.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 17/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
The idea behind the basic support vector machine (SVM) is similar to LDA - find a hyperplane
separating the classes. In LDA, the separating hyperplane separates the means of the samples in each
class, which is suitable when the data are sitting inside an ellipsoid or other convex set. However,
intuitively what we really want is a rule that separates the samples in the different classes which appear
to be closest. When the classes are separable (i.e. can be separated by a hyperplane without
misclassification), the support vector machine (SVM) finds the hyperplane that maximizes the minimum
distance from the hyperplane to the nearest point in each set. The nearest points are called the support
vectors.
In discriminant analysis, the assumption is that the samples cluster more tightly around the mean of the
within class distribution. Using this method the focus is on separating the ellipsoids containing the bulk
of the samples within each class.
Using a SVM, the focus is on separating the closest points in different classes. The first step is to
determine which point or points are closest to the other classes. These points are called support
vectors because each sample is described by its vector of values on the genes (e.g. expression
values). After identifying the support vectors (circled below) for each pair of classes we compute the
separating hyperplane by computing the shortest distance from each support vector to the line. We
then select the line the maximizes the sum of distances. If it is impossible to completely separate the
classes, then a negative penalty is applied for misclassification - the selected line then maximizes the
sum of distances plus penalty, which balances the distances from the support vectors to the line and the
misclassifications.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 18/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
The separating hyperplane used in LDA is sensitive to all the data in each class because you are using
the entire ellipse of data and its center. The separating hyperplane used in SVM is sensitive only to the
data on the boundary which provides computational savings and makes it more to robust outliers.
Suppose the data vector for the ith sample is xi , which is the expression vector for several genes. Also,
suppose we have only 2 classes denoted by yi = +/ − 1 . The hyperplane is defined by a vector W
and a scalar b such that for all i.
′
y i ( x W + b) − 1 ≥ 0
i
for all i with equality for the support vectors and ||W|| is minimal. After
computing W and b using the training set, new samples are classified by determining the value of
y satisfying the inequality.
If the clusters overlap, the "obvious" extension is to find the hyperplane that minimizes the distance
between the main samples and the hyperplane, while penalizing for misclassified samples.
This is the solution to the optimization problem: yi (xi′ W + b) − 1 + zi ≥ 0) for all i with equality for
the support vectors where all zi ≥ 0 and ||W|| + CΣi zi is minimal for some C.
The idea is that some points can be one the wrong side of the separating hyperplane, but there is a
penalty for each misclassification. The hyperplane is now found by minimizing a criterion that still
separates the support vectors as much as possible in this orthogonal direction, but also minimizes the
number of misclassifications.
It is not clear that SVM does much better than linear discriminant analysis when the groups are actually
elliptical. When the groups are not elliptical because of outliers SVM does better because the criterion
is only sensitive to the boundaries and not sensitive to the entire shape. However, like LDA, SVM
requires groups that can be separated by planes. Neither method will do well for an example like the
one below. However, there is a method called the kernel method (sometimes called the kernel trick) that
works for more general examples.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 19/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
Suppose that we suspected that the clusters were in concentric circles (or spheres) as seen above. If
we designate the very center of the points as the origin, we can compute the distance from each point
to the center as a new variable. In this case, the clusters separate very nicely on the radius, creating a
circular boundary.
The general idea is to add more variables that are nonlinear functions of the original variables until we
have enough dimensions to separate the groups.
The kernel trick uses the fact that for both LDA and SVM, only the dot product of the data vectors xi′ xj
are used in the computations. Lets call K(xi , xj ) = x xj
′
i
. We replace K with a more general function
called a kernel function and use K(xi , xj ) as if it were the dot product in our algorithm. (For those who
are more mathematically inclined, the kernel function needs to be positive definite, and we will actually
be using the dot product in the basis described by its eigenfunctions.)
The most popular kernels are the radial kernel and polynomial kernel. The radial kernel (sometimes
called the Gaussian kernel) is k(xi , xj ) = exp(−||xi − xj ||2 /2σ 2 ) .
Here is an animation that shows how "kernel" SVM works with the concentric circles using the radial
kernel.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 20/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
Here is another example of essentially the same problem using a polynomial kernel instead of Normal
densities. k(xi , xj ) = (xi′ xj + 1)2 .
The plot[MathJax]/extensions/MathMenu.js
Loading below shows the results of ordinary (linear) SVM on our cancer data. I believe the blue "tail" is
an artifact of the plotting routine - the boundaries between the groups should be straight lines. The data
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 21/22
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17
points marked with "X" are the support points - there are a lot of them because the data are so strung
out.
I also used SVM with the radial kernel (below). The plot looks nicer, but there are actually more support
points (15 instead of 12) and the same 2 samples are misclassified.
Loading [MathJax]/extensions/MathMenu.js
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/17 22/22